ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models

1Tsinghua University  2HKUST(GZ)  3Zhipu AI
4Chongqing University  5Shanghai Jiao Tong University

CVPR 2025


*Equal Contribution

Corresponding Authors

Comparison with Contrastive Decoding


Unlike contrastive decoding, ICT does not eliminate language priors to reduce the model’s over-reliance on text modality. Instead, it intervenes during the forward pass to enhance the model’s focus on both comprehensive visual information and fine-grained object details. As illustrated in the Figure, after applying ICT, the model is able to focus more on the details within the image, such as identifying the man as Curry, while simultaneously utilizing beneficial language priors (e.g., Curry is a basketball player) to infer and arrive at the correct answer. Since the intervention shift vectors are pre-computed, ICT does not introduce additional delays during the forward pass.

ICT: Image-Object Cross-Level Trusted Intervention


We propose ICT, a novel, training-free, plug-and-play method that effectively reduces hallucinations in LVLMs by enhancing focus on both overall visual information and fine-grained object details during the forward pass, without eliminating beneficial language priors.

Qualitative Results


Top: After applying ICT, the model allocates a higher proportion of attention to visual tokens, especially to object tokens relevant to the question (e.g., ``horse'' and ``fruits''). By prioritizing visual information, ICT correctly identifies the absence of a horse in the image, whereas VCD erroneously concludes that a horse is present due to insufficient attention to visual cues.
Bottom:When asked, "How many uncut fruits are in the picture?", VCD incorrectly answered ``two'' due to a lack of focus on visual details. Although ICT correctly identifies that there are a total of four fruits in the image, this question requires not only attention to visual content but also reasoning within the text modality. The model needs to not only recognize the overall count of fruits but also focus on the attribute ``uncut''. Due to its failure to incorporate this information, ICT outputs a wrong answer.

Quantitative Results


Table 1 presents the results of LLaVA-v1.5 and Qwen-VL on nine POPE dataset subsets, leading to the following conclusions:
ICT Improves Performance: Applying ICT boosts the F1 score by 7.09% for LLaVA-v1.5 and 5.44% for Qwen-VL, surpassing the previous SOTA baseline (Opera) by 2.19% and 1.14%, respectively. This improvement stems from ICT enhancing attention to visual information without eliminating useful language priors, reducing hallucinations.
Multi-Level Interventions Help: Image-level and object-level interventions achieve average F1 gains of 5.76% and 5.47%, respectively, showing that enhancing visual attention at different levels effectively mitigates hallucinations. Object-level intervention slightly outperforms as it implicitly broadens the model’s focus.
ICT Generalizes Well: An intervention shift vector trained on 1,500 MSCOCO samples improves the F1 score by 7.67% on MSCOCO and 6.09% across other subsets. This suggests ICT captures a general trustworthiness direction rather than overfitting specific datasets.

Citation


        @article{chen2024ict,
          title={ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models},
          author={Chen, Junzhe and Zhang, Tianshu and Huang, Shiyu and Niu, Yuwei and Zhang, Linfeng and Wen, Lijie and Hu, Xuming},
          journal={arXiv preprint arXiv:2411.15268},
          year={2024}
        }