TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning
arXiv:2603.03072v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we
arXiv:2603.03072v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we train TikZilla, a family of small open-source Qwen models (3B and 8B) with a two-stage pipeline of SFT followed by reinforcement learning (RL). For RL, we leverage an image encoder trained via inverse graphics to provide semantically faithful reward signals. Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at much smaller model sizes. Code, data, and models will be made available.
Executive Summary
This article presents TikZilla, a novel approach for generating high-quality figures from textual descriptions using a two-stage pipeline of supervised fine-tuning and reinforcement learning. The authors construct DaTikZ-V4, a larger and higher-quality dataset for Text-to-TikZ, and train TikZilla using a family of small open-source Qwen models. The results show significant improvements over baseline models and state-of-the-art models such as GPT-4o and GPT-5. The approach leverages an image encoder trained via inverse graphics to provide semantically faithful reward signals. The code, data, and models will be made available, which could facilitate further research and development in this area. The results have significant implications for the field of scientific figure generation and could potentially be applied to other areas such as computer-aided design and visualization.
Key Points
- ▸ TikZilla uses a two-stage pipeline of supervised fine-tuning and reinforcement learning for generating high-quality figures from textual descriptions.
- ▸ DaTikZ-V4 is a larger and higher-quality dataset for Text-to-TikZ, constructed by the authors.
- ▸ TikZilla shows significant improvements over baseline models and state-of-the-art models such as GPT-4o and GPT-5.
Merits
Strength in Addressing Dataset Limitations
The authors address the limitations of existing datasets for Text-to-TikZ by constructing a larger and higher-quality dataset, DaTikZ-V4.
Effective Use of Reinforcement Learning
The use of reinforcement learning with semantically faithful reward signals allows TikZilla to generate high-quality figures that capture the complexity of TikZ.
Significant Improvements over Baseline Models
TikZilla shows significant improvements over baseline models and state-of-the-art models such as GPT-4o and GPT-5.
Demerits
Limited Generalizability to Other Domains
The approach may not generalize well to other domains or applications beyond scientific figure generation.
Dependence on High-Quality Dataset
The performance of TikZilla relies heavily on the quality of the dataset used to train it, which may be challenging to obtain in other contexts.
Computational Requirements for Training
The training of TikZilla requires significant computational resources, which may be a barrier to adoption in certain settings.
Expert Commentary
The article presents a novel approach for generating high-quality figures from textual descriptions, leveraging a two-stage pipeline of supervised fine-tuning and reinforcement learning. The results show significant improvements over baseline models and state-of-the-art models, which is a testament to the effectiveness of the approach. However, the approach relies heavily on the quality of the dataset used to train it, which may be challenging to obtain in other contexts. Furthermore, the computational requirements for training the model may be a barrier to adoption in certain settings. Nevertheless, the results of this study have significant implications for the field of scientific figure generation and could potentially be applied to other areas such as computer-aided design and visualization.
Recommendations
- ✓ Future research should focus on developing more generalizable approaches that can be applied to other domains and applications beyond scientific figure generation.
- ✓ The authors should explore the use of other training datasets and evaluation metrics to further validate the effectiveness of the approach.