Academic

OmniDiagram: Advancing Unified Diagram Code Generation via Visual Interrogation Reward

arXiv:2604.05514v1 Announce Type: new Abstract: The paradigm of programmable diagram generation is evolving rapidly, playing a crucial role in structured visualization. However, most existing studies are confined to a narrow range of task formulations and language support, constraining their applicability to diverse diagram types. In this work, we propose OmniDiagram, a unified framework that incorporates diverse diagram code languages and task definitions. To address the challenge of aligning code logic with visual fidelity in Reinforcement Learning (RL), we introduce a novel visual feedback strategy named Visual Interrogation Verifies All (\textsc{Viva}). Unlike brittle syntax-based rules or pixel-level matching, \textsc{Viva} rewards the visual structure of rendered diagrams through a generative approach. Specifically, \textsc{Viva} actively generates targeted visual inquiries to scrutinize diagram visual fidelity and provides fine-grained feedback for optimization. This mechanism

arXiv:2604.05514v1 Announce Type: new Abstract: The paradigm of programmable diagram generation is evolving rapidly, playing a crucial role in structured visualization. However, most existing studies are confined to a narrow range of task formulations and language support, constraining their applicability to diverse diagram types. In this work, we propose OmniDiagram, a unified framework that incorporates diverse diagram code languages and task definitions. To address the challenge of aligning code logic with visual fidelity in Reinforcement Learning (RL), we introduce a novel visual feedback strategy named Visual Interrogation Verifies All (\textsc{Viva}). Unlike brittle syntax-based rules or pixel-level matching, \textsc{Viva} rewards the visual structure of rendered diagrams through a generative approach. Specifically, \textsc{Viva} actively generates targeted visual inquiries to scrutinize diagram visual fidelity and provides fine-grained feedback for optimization. This mechanism facilitates a self-evolving training process, effectively obviating the need for manually annotated ground truth code. Furthermore, we construct M3$^2$Diagram, the first large-scale diagram code generation dataset, containing over 196k high-quality instances. Experimental results confirm that the combination of SFT and our \textsc{Viva}-based RL allows OmniDiagram to establish a new state-of-the-art (SOTA) across diagram code generation benchmarks.

Executive Summary

OmniDiagram introduces a groundbreaking unified framework for diagram code generation, addressing the fragmentation in existing approaches by supporting diverse diagram types and task definitions. The paper’s core innovation lies in the Visual Interrogation Verifies All (Viva) mechanism, which leverages generative visual inquiries to provide fine-grained feedback on diagram visual fidelity, enabling a self-evolving training process without manual ground truth annotations. Complemented by the M3²Diagram dataset—comprising over 196,000 high-quality instances—OmniDiagram achieves state-of-the-art performance across multiple benchmarks. This work bridges the gap between code logic and visual accuracy, offering a scalable and adaptable solution for structured visualization tasks in both academic and industrial contexts.

Key Points

  • OmniDiagram presents a unified framework for diagram code generation, overcoming the limitations of narrow task formulations in prior work by integrating diverse diagram languages and task definitions.
  • The introduction of the Visual Interrogation Verifies All (Viva) mechanism marks a paradigm shift in reinforcement learning for diagram generation, replacing rigid syntax-based or pixel-level matching with dynamic, generative visual feedback to optimize visual fidelity.
  • The M3²Diagram dataset, with over 196,000 instances, is the first large-scale resource of its kind, enabling robust training and evaluation for diagram code generation tasks.
  • Experimental results demonstrate that OmniDiagram, when combined with Supervised Fine-Tuning (SFT) and Viva-based RL, achieves new state-of-the-art (SOTA) performance across multiple diagram code generation benchmarks.

Merits

Innovative Framework

OmniDiagram’s unified framework addresses the fragmentation in diagram code generation by supporting diverse languages and tasks, significantly expanding applicability compared to narrow, task-specific prior approaches.

Viva Mechanism

The Visual Interrogation Verifies All (Viva) mechanism introduces a generative, inquiry-based feedback loop for visual fidelity, replacing brittle syntax rules or pixel matching with adaptive, context-aware optimization.

M3²Diagram Dataset

The introduction of M3²Diagram, a large-scale dataset with 196,000+ instances, fills a critical gap in the field, enabling more robust training and evaluation paradigms for diagram code generation.

Empirical Rigor

The paper demonstrates strong empirical performance, achieving SOTA results across benchmarks, which validates the effectiveness of the proposed framework and mechanisms.

Demerits

Dataset Bias

While M3²Diagram is a significant contribution, its representativeness across niche or highly specialized diagram types may be limited, potentially constraining generalization in edge cases.

Viva Mechanism Complexity

The generative inquiry process in Viva introduces computational overhead and complexity, which may pose scalability challenges for real-time or resource-constrained applications.

Dependency on Visual Fidelity

The reliance on visual interrogations for feedback assumes that visual accuracy correlates strongly with functional correctness, which may not always hold, particularly for diagrams where semantic precision outweighs visual rendering.

Expert Commentary

OmniDiagram represents a significant leap forward in the intersection of code generation and structured visualization, addressing a critical gap in the field with its unified framework and innovative Viva mechanism. The paper’s contributions are both technically rigorous and conceptually novel, particularly in its departure from traditional syntax-based or pixel-level optimization strategies. The Viva mechanism, with its generative visual inquiries, introduces a self-evolving feedback loop that could redefine how we approach visual fidelity in AI-generated outputs. However, the paper’s reliance on visual interrogations for feedback raises questions about the balance between computational efficiency and accuracy, particularly in real-time applications. Additionally, while the M3²Diagram dataset is a monumental achievement, its long-term utility will depend on its adaptability to emerging diagram types and use cases. Overall, OmniDiagram sets a new benchmark for the field, and its methodologies could inspire broader applications in AI-driven design and visualization.

Recommendations

  • Further research should explore hybrid optimization strategies that combine Viva’s generative feedback with traditional syntax-based validation to enhance robustness and computational efficiency.
  • The M3²Diagram dataset should be expanded to include more niche and specialized diagram types, as well as annotations for semantic correctness, to improve generalization and applicability in edge cases.
  • Industry adoption of OmniDiagram should be accompanied by case studies that evaluate its performance in real-world scenarios, particularly in domains where visual accuracy is critical (e.g., medical or engineering diagrams).

Sources

Original: arXiv - cs.AI