BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations

Abstract

This paper presents Bag-of-Concept Graph (BACON) to gift models with limited linguistic abilities to taste the privilege of Vision Language Models (VLMs) and reduce hallucinations in the downstream tasks such as detection, visual question answering (VQA), and image generation. Since the visual scenes in physical worlds are structured with complex relations between objects, BACON breaks down annotations into basic minimum elements and presents them in a graph structure. Element-wise style enables easy understanding, and structural composition liberates difficult locating. Careful prompt design births the BACON captions with the help of publicly available VLMs and segmentation methods. In this way, we gather a dataset with 100K annotated images, which endow VLMs with remarkable capabilities, such as accurately generating BACON, transforming prompts into BACON format, envisioning scenarios in the style of BACON, and dynamically modifying elements within BACON through interactive dialogue and more. Wide representative experiments, including detection, VQA, and image generation tasks, tell BACON as a lifeline to achieve previous out-of-reach tasks or excel in their current cutting-edge solutions.

BACON: Bag-of-Concept Graph

BACON provides a representation of an image, including overall description, object list, and relationships.

Method:The construction of BACON representation has two stages, (1) Graph Construction and (2) Graph Grounding

Graph Construction.

Deconstructing annotations: BACON decomposes the annotations of VLMs (GPT-4V here in practice) into basic elements and then combining them according to the specific structure. Based on this approach, we develop the BACON dataset (details).
BACON-Captioner: As an alternative, we fine-tuned a 13B LLaVA model on the BACON dataset to function as a specialized captioner. The trained BACON-Captioner exhibits a high similarity in output distribution to that of GTP-4V. Consequently, BACON-Captioner is a viable alternative and can be used to extend the dataset.

Graph Grounding.

1. Get BACON: Derive BACON by deconstructing annotations;

2. Grounding: Using Grounding DINO to obtain candidate regions;
3. Exclude outrageous answers: Using LLaVA to discard blatant incorrect regions;
4. Match: Using CLIP to identify the optimal region by comparing the object description with the image in that region..

BACON Dataset

BACON dataset is a graph dataset that concurrently offeres open-vocabulary capabilities, detailed object attributes, and comprehensive overall descriptions.

Training Set.

Scale: Refined 100k BACON-image pairs.
Methods: graph construction + graph grounding

Test Benchmark.

Scale: 3k images, 27k objects, and 148k relationships.
Detailed methods:

1. segmentation: Using SAM to detect all objects in an image.
2. Obtain the object list part: Using VLMs to identify the objects (corrected by human annotation) and then describe each object based on its name (corrected by human annotation).
3. Obtain the overall description part: Using VLMs to generate the overall description annotation given the object list (corrected by human annotation).
4. Obtain the relationships part: Using VLMs to determine the relationship between two objects based on their combined masked image (corrected by human annotation).

BACON Captioner

Interactively Modify BACON with Captioner

Transform normal prompts into BACON style

Experiments

BACON can be applied to help multiple downstream tasks by flexibly utilizing desired parts of information:

1. Open-vocabulary Object Detection

BACON can significantly enhance Grounding DINO's performance in the OVD task, outperforming baselines containing state-of-the-art grounding caption models and specialized OVD models.

2. Point/Pointing/Video Question Answering (PointQA/PointingQA/VQA)

Settings: To evaluate the performance of the captioner, we replace the input of QA from images to their captions. Intuitively, if a fixed QA model can correctly answer more questions using a given caption, it suggests that the caption contains more accurate information, indicating a superior captioning performance.

BACON-Captioner performs better than widely-used captioners (Since the PointQA task and the PointingQA task require position information, We compare BACON-Captioner with grounding caption models rather than normal VLMs).

3. Open-vocabulary Scene Graph Generation

BACON significantly outperforms multiple specialized SGG approachs on OV-SGG task.

4. Image Generation

BACON significantly enhances SDXL's ability to understand and follow complex prompts. Remarkably, it enables SDXL to surpass DALL-E 3 in faithfully reproducing the details specified in the text descriptions.

5. Precision & Recall and User Study

BACON achieves higher user preference ratings than widely used VLMs containing GPT-4V, as well as higher precision and recall scores in user evaluations.

Visualization on Video Captioning

BACON on video captioning: also includes three components: an overall description, an object list, and their relationships, each dynamically evolving over time. With respect to a prior frame, updates are color-coded: new elements in green, removed in red, altered in gold, and persistent ones in black. BACON thus adeptly captures the temporal changes and salient details of each video frame, while its structured nature potentially aids in downstream model comprehension.

Copy & Paste Text Box

Other Examples About BACON

Examples of BACON in String Format Obtained by GPT-4V

Instruction for GPT-4V to Obtain BACON

Hello, I would like to ask for your help in describing an image. Please note that I would like the description to be as detailed as possible. Please strictly respond following my instructions and do not print any redundant words.

This description needs to include three parts. The title of each part should be '%%Part1: Overall description%%', '%%Part2: List of objects%%', and '%%Part3: Relationships%%'. All important nouns in your response have to be bounded by '<' and '>'!

The first part is an overall description of the image. Your answer to this part should consist of three parts, one sentence to describe the style of the image, one sentence to describe the theme of the image, and several sentences to describe the image. The titles of these parts are '&&Part1.1: Style&&', '&&Part1.2: Theme&&', '&&Part1.3: Global description of background&&', 'Part1.4: Global description of foreground&&'. The global description should be as detailed as possible and at least 150 words in total. If there is text content in the image, you can also describe the text, which should be bound by quotation marks. All important nouns in your response have to be bounded by '<' and '>'!

The second part is to list all the objects in the image, as many as possible, in order of importance. Note that any object should not be a part of other objects. Note that the listed object should not be the plural. If there are multiple individuals of the same category of objects, please list them separately. For example, if there are three apples in the picture, they should be listed as 'Apple 1,' 'Apple 2,' and 'Apple 3.', respectively. Additionally, the objects should be classified into two categories: living and inanimate objects. Living refers to creatures such as humans, cats, dogs, and plants, while other lifeless objects belong to the category of inanimate objects. Finally, each object should have a very detailed description, with more important objects receiving more detailed descriptions. Each description should be at least 30 words and the important nouns in it have to be bounded by '<' and '>'. You should also identify whether this object belongs to the foreground or background. You should additionally provide a sentence to describe the color information of the object. Therefore, the format for listing each object should be 'Object Name (Category (Living/Inanimate); foreground/background; Description; Color information)'. Specifically, the detailed description of an object should focus on its part and its action. All descriptions should be in the forms of, object's + part + verb + object/adjective or object + is + present participle. The description should be detailed as well as possible, and try to describe all parts of this object. You should specifically notice if there is a sky, tree, sun, or other object in the background of the environment. All important nouns in your response have to be bounded by '<' and '>'!

The third part is to describe the relationships between all the objects in pairs. Please list them one by one. Additionally, please describe the relationship between object A and object B in the format of 'Object A' + 'Action' + 'Object B.' Please don't print the same relation twice. For example, if there is “A relation B”, you shouldn't print 'B relation A' again. All important nouns in your response have to be bounded by '<' and '>'!

I will provide you with an example of the last two parts of a description to show you the desired format. You should only focus on the format of this example instead of the content of it. You should use the same format to respond.

"%%Part2: List of objects%%
<Woman> (Living; foreground; The <woman>'s <hair> is bundled in a <scarf>. Her <torso> is covered with a <black shirt>. Her <lower body> is clad in <blue jeans>. Her <legs> move through the <water>. Her <right hand> holds a pair of <shoes>; Color information: <black> shirt, <blue> jeans, <orange> scarf.)
<Water> (Inanimate; foreground/background; The <water> floods the <street>, reflecting the <sky> and <surrounding objects>; Color information: <murky blue-grey>.)
<Building 1> (Inanimate; background; The <building> has a <façade> with <doors> and <windows>, showing signs of <water damage>; Color information: <pale yellow>.)
<Building 2> (Inanimate; background; This <building> is similar to <Building 1> but with a <red> roof visible above the <flood>; Color information: <light orange> walls, <red> roof.)
<Vehicle 1> (Inanimate; background; A <vehicle> is partially submerged, showing only the <roof> and <upper parts>; Color information: <white>.)
<Vehicle 2> (Inanimate; background; Another <vehicle>, also partially submerged, with a <visible logo>; Color information: <silver>.)
<Sky> (Inanimate; background; The <sky> is filled with <clouds>, implying recent or ongoing <precipitation>; Color information: <gray>.)
%%Part3: Relationships%%
<Woman> [is walking through] <Water>.
<Woman> [is moving away from] <Camera>.
<Water> [reflects] <Sky>.
<Water> [surrounds] <Vehicles>.
<Buildings> [line] <Street>.
<Vehicle 1> [is submerged by] <Water>.
<Vehicle 2> [is submerged by] <Water>.

BibTeX


        @inproceedings{yang2024bacon,
          author      = {Yang, Zhantao and Feng, Ruili and Yan, Keyu and Wang, Huangji and Wang, Zhicai and Zhu, Shangwen and Zhang, Han and Xiao, Jie and Wu, Pingyu and Zhu, Kai and Chen, Jixuan and Xie, Chen-Wei and Mao, Chaojie and Yang, Yue and Zhang, Hongyang and Liu, Yu and Cheng, Fan},
          title       = {BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations},
          booktitle   = {arXiv preprint arXiv:2407.03314},
          year        = {2024}
        }