Vision language models are blind
๐ Abstract
The article examines the limitations of large language models with vision capabilities (VLMs) in performing simple visual tasks that are trivial for humans. It introduces a new benchmark called "BlindTest" to systematically evaluate the visual perception of four state-of-the-art VLMs: GPT-4o, Gemini-1.5 Pro, Claude-3 Sonnet, and Claude-3.5 Sonnet.
๐ Q&A
[01] Vision Language Models (VLMs)
1. What are the key findings about the vision capabilities of the tested VLMs?
- VLMs struggle with tasks that involve identifying intersections between lines or circles, recognizing which letter is circled in a word, counting overlapping or nested shapes, and tracing colored paths in a simplified subway map.
- The performance of VLMs on these simple visual tasks is surprisingly poor, suggesting their vision is like that of a person with myopia seeing fine details as blurry.
- VLMs perform well on high-level vision benchmarks, but the authors argue that current benchmarks overlook low-level visual perception abilities.
2. How do the authors compare the performance of different VLMs on the BlindTest benchmark?
- The four tested VLMs have varying performance on the BlindTest tasks, with Sonnet-3.5 generally performing the best and GPT-4o and Gemini-1.5 struggling the most.
- Factors like image resolution, shape orientation, and distance between objects do not significantly impact the VLMs' performance, suggesting their limitations are not due to these visual attributes.
- The authors also find that fine-tuning a VLM on the BlindTest tasks does not lead to substantial improvements, indicating the challenges may require more fundamental changes to the model architecture or training.
[02] BlindTest Benchmark
1. What are the key tasks in the BlindTest benchmark? The BlindTest benchmark consists of 7 tasks that test VLMs' ability to perceive simple geometric primitives:
- Counting the number of intersections between two line plots
- Determining if two circles are touching or overlapping
- Identifying which letter is circled in a word
- Counting overlapping circles or pentagons
- Counting nested squares
- Counting the rows and columns in a grid
- Tracing and counting single-colored paths in a simplified subway map
2. How are the tasks designed to be easy for humans but challenging for VLMs?
- The tasks involve only basic geometric shapes and minimal world knowledge, unlike existing benchmarks that test high-level vision and reasoning.
- The authors hypothesize that while these tasks are trivial for humans, they may be difficult for VLMs because the spatial information required is not easily expressible in natural language.
- The benchmark also controls for factors like image size, shape orientation, and line thickness to isolate the VLMs' visual perception capabilities.
3. What are the key findings from the VLMs' performance on the BlindTest benchmark?
- VLMs struggle across all 7 tasks, often performing far below human expectations, with accuracies ranging from 23% to 93%.
- VLMs tend to perform poorly when objects are close together, overlapping, or nested, suggesting their vision is "blurry" and unable to perceive fine spatial details.
- The authors find that fine-tuning a VLM on the BlindTest tasks does not lead to substantial improvements, indicating the need for more fundamental changes to model architecture or training.