Vision Language Models Are Blind

Pooyan Rahmanzadehgervi*, Logan Bolton*, Mohammad Reza Taesiri, Anh Totti Nguyen

Links: pdf | code | project page

While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.57% accurate on average. Sonnet-3.5 performs the best at 74.94% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with tasks that require precise spatial information and recognizing geometric primitives that overlap or are close together.

Acknowledgment: This work is supported by the National Science Foundation under Grant No. 2145767, Adobe Research, and the NaphCare Charitable Foundation.

Conference: ACCV 2024. Oral presentation (47/839 = 5.6% acceptance rate).

Press coverage: 

 

Figure 1: VLMs cannot reliably count the intersections between the blue and red plots.

Figure 2: VLMs consistently fail at smaller distances. However, when the gap is large and clearly visible, GPT-4o remains unreliable. Sonnet-3.5 tends to conservatively answer “No” regardless of the actual distance between the two circles.

Figure 3: Counting nested squares is not easy to VLMs even when there are only two squares (leftmost). The task becomes harder as the count increases from 2 to 5. Sonnet-3.5 performs the best (92.08%) but still not at the 100% by humans.

Figure 4: VLMs are often off by one or two in counting rows and columns in an empty grid. The same is true when a grid is small (e.g., 3×4) and contains a word in each cell.

Figure 5: Counting overlapped circles is not easy to VLMs regardless of circle colors, line widths, and resolutions. Gemini-1.5 often predicts “5” regardless of the actual circle count, suggesting a strong bias towards the well-known Olympic logo.

Figure 6: Identifying the letter being circled is non-trivial for VLMs across both English words (Acknowledgement & Subdermatoglyphic) and a random string (tHyUiKaRbNqWeOpXcZvM ). When making mistakes, VLMs tend to predict letters adjacent to the circled one.

Figure 7: Some VLMs (GPT-4o and Gemini-1.5) surprisingly fail in even extremely easy cases across both line widths (leftmost). VLMs tend to perform worse as the number of paths connecting stations increases.