Skip to main content
Academic

How to Train Your Long-Context Visual Document Model

arXiv:2602.15257v1 Announce Type: cross Abstract: We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines

A
Austin Veselka
· · 1 min read · 6 views

arXiv:2602.15257v1 Announce Type: cross Abstract: We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.

Executive Summary

This article presents a comprehensive study on training long-context visual document models, targeting long-document visual question answering with measured transfer to long-context text. The authors systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations. They achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales and release MMLBD-C, a manually corrected version of MMLongBenchDoc. Their findings include the importance of matching context lengths, training with page indices, synthetic data pipelines, and visual long context transfer to long-context text performance. This study bridges the gap in long-context vision language models by providing reproducible training recipes and data pipelines.

Key Points

  • Comprehensive study on training long-context visual document models
  • Achieved state-of-the-art performance on MMLongBenchDoc for 24B and 32B parameter models
  • Released MMLBD-C, a manually corrected version of MMLongBenchDoc

Merits

Strength in Reproducibility

The authors provide reproducible training recipes and data pipelines, bridging the gap in long-context vision language models.

State-of-the-Art Performance

The authors achieve state-of-the-art performance on MMLongBenchDoc for both 24B and 32B parameter models.

Contributions to Research Community

The study provides new insights and findings that contribute to the development of long-context vision language models.

Demerits

Limited Generalizability

The study focuses on a specific domain and may not be generalizable to other domains or applications.

Computational Resources Required

Training and evaluating the models may require significant computational resources and expertise.

Expert Commentary

The study presents a comprehensive analysis of training long-context visual document models, providing new insights and findings that contribute to the development of these models. The authors' focus on reproducibility and transparency is commendable, and their release of MMLBD-C is a valuable resource for the research community. However, the study's limitations, such as the focus on a specific domain and the computational resources required, should be carefully considered. The implications of this study are significant, both practically and policy-wise, and highlight the importance of continued research and development in this area.

Recommendations

  • Further studies should explore the generalizability of the findings to other domains and applications.
  • Researchers should prioritize reproducibility and transparency in their work, releasing data and code whenever possible.

Sources