Unlike contrastive decoding, ICT does not eliminate language priors to reduce the model’s over-reliance on text modality. Instead, it intervenes during the forward pass to enhance the model’s focus on both comprehensive visual information and fine-grained object details. As illustrated in the Figure, after applying ICT, the model is able to focus more on the details within the image, such as identifying the man as Curry, while simultaneously utilizing beneficial language priors (e.g., Curry is a basketball player) to infer and arrive at the correct answer. Since the intervention shift vectors are pre-computed, ICT does not introduce additional delays during the forward pass.
We propose ICT, a novel, training-free, plug-and-play method that effectively reduces hallucinations in LVLMs by enhancing focus on both overall visual information and fine-grained object details during the forward pass, without eliminating beneficial language priors.
@article{chen2024ict,
title={ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models},
author={Chen, Junzhe and Zhang, Tianshu and Huang, Shiyu and Niu, Yuwei and Zhang, Linfeng and Wen, Lijie and Hu, Xuming},
journal={arXiv preprint arXiv:2411.15268},
year={2024}
}