Vision Language Models Are Blind
Pooyan Rahmanzadehgervi*, Logan Bolton*, Mohammad Reza Taesiri, Anh Totti Nguyen
Links: pdf | code | project page
While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.57% accurate on average. Sonnet-3.5 performs the best at 74.94% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with tasks that require precise spatial information and recognizing geometric primitives that overlap or are close together.
Acknowledgment: This work is supported by the National Science Foundation under Grant No. 2145767, Adobe Research, and the NaphCare Charitable Foundation.
Conference: ACCV 2024. Oral presentation (47/839 = 5.6% acceptance rate).
Press coverage:
- TechCrunch. Are visual AI models actually blind?
- ArsTechnica. Can you do better than top-level AI models on these basic vision tests?
- TechTalks. Why vision-language models fail on simple visual tests
- Yahoo! ‘Visual’ AI models might not see anything at all
- Substack. Vision Language Models are surprisingly blind!