ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution
arXiv:2602.15769v1 Announce Type: new Abstract: Multimodal Large Language Models (mLLMs) are often used to answer questions in structured data such as tables in Markdown, JSON, and images. While these models can often give correct answers, users also need to know where those answers come from. In this work, we study structured data attribution/citation, which is the ability of the models to point to the specific rows and columns that support an answer. We evaluate several mLLMs across different table formats and prompting strategies. Our results show a clear gap between question answering and evidence attribution. Although question answering accuracy remains moderate, attribution accuracy is much lower, near random for JSON inputs, across all models. We also find that models are more reliable at citing rows than columns, and struggle more with textual formats than images. Finally, we observe notable differences across model families. Overall, our findings show that current mLLMs are u
arXiv:2602.15769v1 Announce Type: new Abstract: Multimodal Large Language Models (mLLMs) are often used to answer questions in structured data such as tables in Markdown, JSON, and images. While these models can often give correct answers, users also need to know where those answers come from. In this work, we study structured data attribution/citation, which is the ability of the models to point to the specific rows and columns that support an answer. We evaluate several mLLMs across different table formats and prompting strategies. Our results show a clear gap between question answering and evidence attribution. Although question answering accuracy remains moderate, attribution accuracy is much lower, near random for JSON inputs, across all models. We also find that models are more reliable at citing rows than columns, and struggle more with textual formats than images. Finally, we observe notable differences across model families. Overall, our findings show that current mLLMs are unreliable at providing fine-grained, trustworthy attribution for structured data, which limits their usage in applications requiring transparency and traceability.
Executive Summary
This article evaluates the ability of Multimodal Large Language Models (mLLMs) to attribute answers to specific rows and columns in structured data, specifically tables. The study reveals a significant gap between question answering accuracy and attribution accuracy, with the latter being near random for JSON inputs across all models. The results suggest that current mLLMs are unreliable in providing fine-grained, trustworthy attribution, limiting their usage in applications requiring transparency and traceability. The study highlights differences in attribution accuracy across model families and table formats, with textual formats being more challenging than images. The findings have important implications for the development and deployment of mLLMs in applications where attribution is crucial.
Key Points
- ▸ Multimodal Large Language Models (mLLMs) struggle with attribution in structured data
- ▸ Attribution accuracy is near random for JSON inputs across all models
- ▸ Differences in attribution accuracy across model families and table formats exist
Merits
Contribution to the field
The study provides valuable insights into the limitations of mLLMs in structured data attribution, highlighting the need for further research and development in this area.
Methodological rigor
The study employs a systematic evaluation approach, assessing multiple mLLMs across different table formats and prompting strategies, ensuring robust results.
Demerits
Limited scope
The study focuses primarily on attribution accuracy, neglecting other important aspects of mLLMs, such as their ability to reason and generate text.
Lack of generalizability
The results may not be generalizable to other types of structured data, such as graphs or networks.
Expert Commentary
The study's findings on the limitations of mLLMs in structured data attribution are significant and warrant further investigation. The results demonstrate the need for developing techniques that can provide transparent and interpretable results, such as Explainable AI methods. The study's implications for the development and deployment of mLLMs in applications requiring transparency and traceability are substantial, highlighting the importance of considering the limitations of these models in critical applications.
Recommendations
- ✓ Future research should focus on developing Explainable AI techniques to improve the transparency and trustworthiness of mLLMs.
- ✓ Developers and deployers of mLLMs should carefully evaluate the limitations of these models in attribution and consider alternative approaches to ensure transparency and traceability in critical applications.