This paper presents Bag-of-Concept Graph (BACON) to gift models with limited linguistic abilities to taste the privilege of Vision Language Models (VLMs) and reduce hallucinations in the downstream tasks such as detection, visual question answering (VQA), and image generation. Since the visual scenes in physical worlds are structured with complex relations between objects, BACON breaks down annotations into basic minimum elements and presents them in a graph structure. Element-wise style enables easy understanding, and structural composition liberates difficult locating. Careful prompt design births the BACON captions with the help of publicly available VLMs and segmentation methods. In this way, we gather a dataset with 100K annotated images, which endow VLMs with remarkable capabilities, such as accurately generating BACON, transforming prompts into BACON format, envisioning scenarios in the style of BACON, and dynamically modifying elements within BACON through interactive dialogue and more. Wide representative experiments, including detection, VQA, and image generation tasks, tell BACON as a lifeline to achieve previous out-of-reach tasks or excel in their current cutting-edge solutions.
BACON provides a representation of an image, including overall description, object list, and relationships.
Method:The construction of BACON representation has two stages, (1) Graph Construction and (2) Graph Grounding
BACON dataset is a graph dataset that concurrently offeres open-vocabulary capabilities, detailed object attributes, and comprehensive overall descriptions.
BACON can be applied to help multiple downstream tasks by flexibly utilizing desired parts of information:
BACON can significantly enhance Grounding DINO's performance in the OVD task, outperforming baselines containing state-of-the-art grounding caption models and specialized OVD models.
Settings: To evaluate the performance of the captioner, we replace the input of QA from images to their captions. Intuitively, if a fixed QA model can correctly answer more questions using a given caption, it suggests that the caption contains more accurate information, indicating a superior captioning performance.
BACON-Captioner performs better than widely-used captioners (Since the PointQA task and the PointingQA task require position information, We compare BACON-Captioner with grounding caption models rather than normal VLMs).
BACON significantly outperforms multiple specialized SGG approachs on OV-SGG task.
BACON significantly enhances SDXL's ability to understand and follow complex prompts. Remarkably, it enables SDXL to surpass DALL-E 3 in faithfully reproducing the details specified in the text descriptions.
BACON achieves higher user preference ratings than widely used VLMs containing GPT-4V, as well as higher precision and recall scores in user evaluations.
Hello, I would like to ask for your help in describing an image. Please note that I would like the description to be as detailed as possible. Please strictly respond following my instructions and do not print any redundant words.
This description needs to include three parts. The title of each part should be '%%Part1: Overall description%%', '%%Part2: List of objects%%', and '%%Part3: Relationships%%'. All important nouns in your response have to be bounded by '<' and '>'!
The first part is an overall description of the image. Your answer to this part should consist of three parts, one sentence to describe the style of the image, one sentence to describe the theme of the image, and several sentences to describe the image. The titles of these parts are '&&Part1.1: Style&&', '&&Part1.2: Theme&&', '&&Part1.3: Global description of background&&', 'Part1.4: Global description of foreground&&'. The global description should be as detailed as possible and at least 150 words in total. If there is text content in the image, you can also describe the text, which should be bound by quotation marks. All important nouns in your response have to be bounded by '<' and '>'!
The second part is to list all the objects in the image, as many as possible, in order of importance. Note that any object should not be a part of other objects. Note that the listed object should not be the plural. If there are multiple individuals of the same category of objects, please list them separately. For example, if there are three apples in the picture, they should be listed as 'Apple 1,' 'Apple 2,' and 'Apple 3.', respectively. Additionally, the objects should be classified into two categories: living and inanimate objects. Living refers to creatures such as humans, cats, dogs, and plants, while other lifeless objects belong to the category of inanimate objects. Finally, each object should have a very detailed description, with more important objects receiving more detailed descriptions. Each description should be at least 30 words and the important nouns in it have to be bounded by '<' and '>'. You should also identify whether this object belongs to the foreground or background. You should additionally provide a sentence to describe the color information of the object. Therefore, the format for listing each object should be 'Object Name (Category (Living/Inanimate); foreground/background; Description; Color information)'. Specifically, the detailed description of an object should focus on its part and its action. All descriptions should be in the forms of, object's + part + verb + object/adjective or object + is + present participle. The description should be detailed as well as possible, and try to describe all parts of this object. You should specifically notice if there is a sky, tree, sun, or other object in the background of the environment. All important nouns in your response have to be bounded by '<' and '>'!
The third part is to describe the relationships between all the objects in pairs. Please list them one by one. Additionally, please describe the relationship between object A and object B in the format of 'Object A' + 'Action' + 'Object B.' Please don't print the same relation twice. For example, if there is “A relation B”, you shouldn't print 'B relation A' again. All important nouns in your response have to be bounded by '<' and '>'!
I will provide you with an example of the last two parts of a description to show you the desired format. You should only focus on the format of this example instead of the content of it. You should use the same format to respond.
"%%Part2: List of objects%%
<Woman> (Living; foreground; The <woman>'s <hair> is bundled in a <scarf>. Her <torso> is covered with a <black shirt>. Her <lower body> is clad in <blue jeans>. Her <legs> move through the <water>. Her <right hand> holds a pair of <shoes>; Color information: <black> shirt, <blue> jeans, <orange> scarf.)
<Water> (Inanimate; foreground/background; The <water> floods the <street>, reflecting the <sky> and <surrounding objects>; Color information: <murky blue-grey>.)
<Building 1> (Inanimate; background; The <building> has a <façade> with <doors> and <windows>, showing signs of <water damage>; Color information: <pale yellow>.)
<Building 2> (Inanimate; background; This <building> is similar to <Building 1> but with a <red> roof visible above the <flood>; Color information: <light orange> walls, <red> roof.)
<Vehicle 1> (Inanimate; background; A <vehicle> is partially submerged, showing only the <roof> and <upper parts>; Color information: <white>.)
<Vehicle 2> (Inanimate; background; Another <vehicle>, also partially submerged, with a <visible logo>; Color information: <silver>.)
<Sky> (Inanimate; background; The <sky> is filled with <clouds>, implying recent or ongoing <precipitation>; Color information: <gray>.)
%%Part3: Relationships%%
<Woman> [is walking through] <Water>.
<Woman> [is moving away from] <Camera>.
<Water> [reflects] <Sky>.
<Water> [surrounds] <Vehicles>.
<Buildings> [line] <Street>.
<Vehicle 1> [is submerged by] <Water>.
<Vehicle 2> [is submerged by] <Water>.
@inproceedings{yang2024bacon,
author = {Yang, Zhantao and Feng, Ruili and Yan, Keyu and Wang, Huangji and Wang, Zhicai and Zhu, Shangwen and Zhang, Han and Xiao, Jie and Wu, Pingyu and Zhu, Kai and Chen, Jixuan and Xie, Chen-Wei and Mao, Chaojie and Yang, Yue and Zhang, Hongyang and Liu, Yu and Cheng, Fan},
title = {BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations},
booktitle = {arXiv preprint arXiv:2407.03314},
year = {2024}
}