Academic

ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models

arXiv:2602.15758v1 Announce Type: new Abstract: While Multimodal Large Language Models (MLLMs) perform strongly on single-turn chart generation, their ability to support real-world exploratory data analysis remains underexplored. In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences. We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset. Unlike prior one-shot benchmarks, ChartEditBench evaluates sustained, context-aware editing. We further propose a robust evaluation framework that mitigates limitations of LLM-as-a-Judge metrics by integrating execution-based fidelity checks, pixel-level visual similarity, and logical code verification. Experiments with state-of-the-art MLLMs reveal substantial degradation in multi-

Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Ros\'e · February 19, 2026 · 1 min read · 6 views

#cs.CL #cs.AI

Executive Summary

The article presents ChartEditBench, a novel benchmark for evaluating the performance of Multimodal Large Language Models (MLLMs) in sustaining multi-turn chart editing tasks. The authors introduce a rigorous evaluation framework that goes beyond traditional metrics by incorporating execution-based fidelity checks, pixel-level visual similarity, and logical code verification. Experiments with state-of-the-art MLLMs reveal significant degradation in multi-turn settings due to error accumulation and breakdowns in shared context. The study highlights the challenges in developing grounded, intent-aware multimodal programming and underscores the need for more robust evaluation frameworks. The findings have significant implications for the development of more effective multimodal language models, particularly in real-world exploratory data analysis settings.

Key Points

▸ ChartEditBench is a novel benchmark for evaluating MLLMs in multi-turn chart editing tasks
▸ The evaluation framework integrates execution-based fidelity checks, pixel-level visual similarity, and logical code verification
▸ State-of-the-art MLLMs perform poorly in multi-turn settings due to error accumulation and context breakdowns

Merits

Strength in evaluation framework

The proposed evaluation framework provides a more comprehensive assessment of MLLMs' performance in multi-turn chart editing tasks, addressing limitations of traditional metrics.

Insights into MLLMs' limitations

The study reveals substantial degradation in MLLMs' performance in multi-turn settings, highlighting the need for more robust development of grounded, intent-aware multimodal programming.

Demerits

Scope of evaluation

The study focuses on a specific domain (chart editing) and may not be representative of MLLMs' performance in other multimodal tasks.

Expert Commentary

The article presents a timely and thought-provoking study that sheds light on the limitations of Multimodal Large Language Models in sustaining multi-turn chart editing tasks. The proposed evaluation framework is a significant contribution to the field, as it provides a more comprehensive assessment of MLLMs' performance. The findings have significant implications for the development of more effective multimodal language models, particularly in real-world exploratory data analysis settings. However, the study's scope is limited to a specific domain, and future studies should aim to investigate MLLMs' performance in other multimodal tasks.

Recommendations

✓ Future studies should investigate the application of the proposed evaluation framework to other multimodal tasks.
✓ Researchers should focus on developing more robust development of grounded, intent-aware multimodal programming to address the limitations of MLLMs in sustaining multi-turn tasks.

Sources

arXiv - cs.CL

Something extraordinary is coming.

ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models

AI Commentary

Executive Summary

Key Points

Merits

Strength in evaluation framework

Insights into MLLMs' limitations

Demerits

Scope of evaluation

Expert Commentary

Recommendations

Sources

Related Articles

How Large Language Models Get Stuck: Early structure with persistent …

Distribution-Aware Companding Quantization of Large Language Models

Policy Compliance of User Requests in Natural Language for AI …

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

JCG, PC

HSOLLC Co., Ltd.