Academic

TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning

Christian Greisinger, Steffen Eger · March 7, 2026 · 1 min read · 29 views

#cs.AI #cs.CL #cs.CV

arXiv:2603.03072v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we train TikZilla, a family of small open-source Qwen models (3B and 8B) with a two-stage pipeline of SFT followed by reinforcement learning (RL). For RL, we leverage an image encoder trained via inverse graphics to provide semantically faithful reward signals. Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at much smaller model sizes. Code, data, and models will be made available.

Executive Summary

This article presents TikZilla, a novel approach for generating high-quality figures from textual descriptions using a two-stage pipeline of supervised fine-tuning and reinforcement learning. The authors construct DaTikZ-V4, a larger and higher-quality dataset for Text-to-TikZ, and train TikZilla using a family of small open-source Qwen models. The results show significant improvements over baseline models and state-of-the-art models such as GPT-4o and GPT-5. The approach leverages an image encoder trained via inverse graphics to provide semantically faithful reward signals. The code, data, and models will be made available, which could facilitate further research and development in this area. The results have significant implications for the field of scientific figure generation and could potentially be applied to other areas such as computer-aided design and visualization.

Key Points

▸ TikZilla uses a two-stage pipeline of supervised fine-tuning and reinforcement learning for generating high-quality figures from textual descriptions.
▸ DaTikZ-V4 is a larger and higher-quality dataset for Text-to-TikZ, constructed by the authors.
▸ TikZilla shows significant improvements over baseline models and state-of-the-art models such as GPT-4o and GPT-5.

Merits

Strength in Addressing Dataset Limitations

The authors address the limitations of existing datasets for Text-to-TikZ by constructing a larger and higher-quality dataset, DaTikZ-V4.

Effective Use of Reinforcement Learning

The use of reinforcement learning with semantically faithful reward signals allows TikZilla to generate high-quality figures that capture the complexity of TikZ.

Significant Improvements over Baseline Models

TikZilla shows significant improvements over baseline models and state-of-the-art models such as GPT-4o and GPT-5.

Demerits

Limited Generalizability to Other Domains

The approach may not generalize well to other domains or applications beyond scientific figure generation.

Dependence on High-Quality Dataset

The performance of TikZilla relies heavily on the quality of the dataset used to train it, which may be challenging to obtain in other contexts.

Computational Requirements for Training

The training of TikZilla requires significant computational resources, which may be a barrier to adoption in certain settings.

Expert Commentary

The article presents a novel approach for generating high-quality figures from textual descriptions, leveraging a two-stage pipeline of supervised fine-tuning and reinforcement learning. The results show significant improvements over baseline models and state-of-the-art models, which is a testament to the effectiveness of the approach. However, the approach relies heavily on the quality of the dataset used to train it, which may be challenging to obtain in other contexts. Furthermore, the computational requirements for training the model may be a barrier to adoption in certain settings. Nevertheless, the results of this study have significant implications for the field of scientific figure generation and could potentially be applied to other areas such as computer-aided design and visualization.

Recommendations

✓ Future research should focus on developing more generalizable approaches that can be applied to other domains and applications beyond scientific figure generation.
✓ The authors should explore the use of other training datasets and evaluation metrics to further validate the effectiveness of the approach.

Sources

arXiv - cs.AI

TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning

AI Commentary

Executive Summary

Key Points

Merits

Strength in Addressing Dataset Limitations

Effective Use of Reinforcement Learning

Significant Improvements over Baseline Models

Demerits

Limited Generalizability to Other Domains

Dependence on High-Quality Dataset

Computational Requirements for Training

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs