Chat BCG: Can AI Read Your Slide Deck?
๐ Abstract
The article evaluates the accuracy of two large language models, GPT-4o and Gemini Flash-1.5, in reading and interpreting data from labeled and unlabeled charts. It aims to assess whether these advanced multimodal models can perform well on specific "reading and estimation" tasks, particularly in the context of visual charts in business decks.
๐ Q&A
[01] Labeled Charts: What is the match rate across charts?
1. How accurately can multimodal models with advanced vision capabilities read data from labeled charts? The article found that both GPT-4o and Gemini Flash-1.5 are consistently inaccurate on specific types of labeled charts, such as those with multiple figures, stacked charts, and waterfall charts. The models' error rates on labeled charts are around 15%, which may not be suitable for high-stakes business applications.
2. Is there a consistent accuracy advantage for one model over the other? The article did not find a consistent accuracy advantage for either GPT-4o or Gemini Flash-1.5. Both models make similar types of mistakes, such as misreading labels or mislabeling negative numbers as positive. Across the labeled charts, neither model consistently outperformed the other.
[02] Unlabeled Charts: What is the magnitude of error across charts?
1. How accurately can the models estimate numerical data from unlabeled charts? On unlabeled charts, where the models have to estimate data points based on the X and Y axes, both GPT-4o and Gemini Flash-1.5 struggle. The models have error rates as high as 79% and 83%, respectively, in perfectly matching the source of truth values.
2. On average, how 'incorrect' is their estimation? The article measured the magnitude of the models' errors using Mean Absolute Percentage Error (MAPE). Both GPT-4o and Gemini Flash-1.5 had average MAPE of around 55%, compared to a human margin of error estimated at 10-20%. The models' errors are driven by both small consistent underestimations as well as larger deviations from misreading labels or numbers.
[03] Conclusion
1. What are the key limitations of the current performance of these models? The article concludes that while GPT-4o and Gemini Flash-1.5 exhibit many advanced capabilities, they still require human oversight to achieve acceptable accuracy levels, particularly for high-stakes business applications. The models are only able to read 7-8 out of 15 labeled charts with 100% accuracy, and their performance on unlabeled charts is highly inconsistent, with error rates exceeding 100% for more complex visuals.
2. Are these models ready to operate without human intervention for tasks demanding high precision? No, the article states that for any use case demanding high precision, these models are not yet ready to operate without human intervention. Their current limitations in accurately reading and interpreting charts mean they cannot be relied upon without human oversight.