Skip to main content
Academic

ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models

arXiv:2602.15758v1 Announce Type: new Abstract: While Multimodal Large Language Models (MLLMs) perform strongly on single-turn chart generation, their ability to support real-world exploratory data analysis remains underexplored. In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences. We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset. Unlike prior one-shot benchmarks, ChartEditBench evaluates sustained, context-aware editing. We further propose a robust evaluation framework that mitigates limitations of LLM-as-a-Judge metrics by integrating execution-based fidelity checks, pixel-level visual similarity, and logical code verification. Experiments with state-of-the-art MLLMs reveal substantial degradation in multi-

arXiv:2602.15758v1 Announce Type: new Abstract: While Multimodal Large Language Models (MLLMs) perform strongly on single-turn chart generation, their ability to support real-world exploratory data analysis remains underexplored. In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences. We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset. Unlike prior one-shot benchmarks, ChartEditBench evaluates sustained, context-aware editing. We further propose a robust evaluation framework that mitigates limitations of LLM-as-a-Judge metrics by integrating execution-based fidelity checks, pixel-level visual similarity, and logical code verification. Experiments with state-of-the-art MLLMs reveal substantial degradation in multi-turn settings due to error accumulation and breakdowns in shared context, with strong performance on stylistic edits but frequent execution failures on data-centric transformations. ChartEditBench, establishes a challenging testbed for grounded, intent-aware multimodal programming.

Executive Summary

The article presents ChartEditBench, a novel benchmark for evaluating the performance of Multimodal Large Language Models (MLLMs) in sustaining multi-turn chart editing tasks. The authors introduce a rigorous evaluation framework that goes beyond traditional metrics by incorporating execution-based fidelity checks, pixel-level visual similarity, and logical code verification. Experiments with state-of-the-art MLLMs reveal significant degradation in multi-turn settings due to error accumulation and breakdowns in shared context. The study highlights the challenges in developing grounded, intent-aware multimodal programming and underscores the need for more robust evaluation frameworks. The findings have significant implications for the development of more effective multimodal language models, particularly in real-world exploratory data analysis settings.

Key Points

  • ChartEditBench is a novel benchmark for evaluating MLLMs in multi-turn chart editing tasks
  • The evaluation framework integrates execution-based fidelity checks, pixel-level visual similarity, and logical code verification
  • State-of-the-art MLLMs perform poorly in multi-turn settings due to error accumulation and context breakdowns

Merits

Strength in evaluation framework

The proposed evaluation framework provides a more comprehensive assessment of MLLMs' performance in multi-turn chart editing tasks, addressing limitations of traditional metrics.

Insights into MLLMs' limitations

The study reveals substantial degradation in MLLMs' performance in multi-turn settings, highlighting the need for more robust development of grounded, intent-aware multimodal programming.

Demerits

Scope of evaluation

The study focuses on a specific domain (chart editing) and may not be representative of MLLMs' performance in other multimodal tasks.

Expert Commentary

The article presents a timely and thought-provoking study that sheds light on the limitations of Multimodal Large Language Models in sustaining multi-turn chart editing tasks. The proposed evaluation framework is a significant contribution to the field, as it provides a more comprehensive assessment of MLLMs' performance. The findings have significant implications for the development of more effective multimodal language models, particularly in real-world exploratory data analysis settings. However, the study's scope is limited to a specific domain, and future studies should aim to investigate MLLMs' performance in other multimodal tasks.

Recommendations

  • Future studies should investigate the application of the proposed evaluation framework to other multimodal tasks.
  • Researchers should focus on developing more robust development of grounded, intent-aware multimodal programming to address the limitations of MLLMs in sustaining multi-turn tasks.

Sources