Vision Language Models Are Blind

Pooyan Rahmanzadehgervi*, Logan Bolton*, Mohammad Reza Taesiri, Anh Totti Nguyen

ACCV 2024

Links: pdf | code | project page

While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.57% accurate on average. Sonnet-3.5 performs the best at 74.94% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with tasks that require precise spatial information and recognizing geometric primitives that overlap or are close together.

Acknowledgment: This work is supported by the National Science Foundation under Grant No. 2145767, Adobe Research, and the NaphCare Charitable Foundation.

Conference: ACCV 2024. Oral presentation (47/839 = 5.6% acceptance rate).

Press coverage:

TechCrunch. Are visual AI models actually blind?
ArsTechnica. Can you do better than top-level AI models on these basic vision tests?
TechTalks. Why vision-language models fail on simple visual tests
Yahoo! ‘Visual’ AI models might not see anything at all
Substack. Vision Language Models are surprisingly blind!

Figure 1: VLMs cannot reliably count the intersections between the blue and red plots.

Figure 2: VLMs consistently fail at smaller distances. However, when the gap is large and clearly visible, GPT-4o remains unreliable. Sonnet-3.5 tends to conservatively answer “No” regardless of the actual distance between the two circles.

Figure 3: Counting nested squares is not easy to VLMs even when there are only two squares (leftmost). The task becomes harder as the count increases from 2 to 5. Sonnet-3.5 performs the best (92.08%) but still not at the 100% by humans.

Figure 4: VLMs are often off by one or two in counting rows and columns in an empty grid. The same is true when a grid is small (e.g., 3×4) and contains a word in each cell.

Figure 5: Counting overlapped circles is not easy to VLMs regardless of circle colors, line widths, and resolutions. Gemini-1.5 often predicts “5” regardless of the actual circle count, suggesting a strong bias towards the well-known Olympic logo.

Figure 6: Identifying the letter being circled is non-trivial for VLMs across both English words (Acknowledgement & Subdermatoglyphic) and a random string (tHyUiKaRbNqWeOpXcZvM ). When making mistakes, VLMs tend to predict letters adjacent to the circled one.

Figure 7: Some VLMs (GPT-4o and Gemini-1.5) surprisingly fail in even extremely easy cases across both line widths (leftmost). VLMs tend to perform worse as the number of paths connecting stations increases.

Navigation

Vision Language Models Are Blind

Categories