Skip to main content
Academic

Architecture-Agnostic Curriculum Learning for Document Understanding: Empirical Evidence from Text-Only and Multimodal

arXiv:2602.21225v1 Announce Type: cross Abstract: We investigate whether progressive data scheduling -- a curriculum learning strategy that incrementally increases training data exposure (33\%$\rightarrow$67\%$\rightarrow$100\%) -- yields consistent efficiency gains across architecturally distinct document understanding models. By evaluating BERT (text-only, 110M parameters) and LayoutLMv3 (multimodal, 126M parameters) on the FUNSD and CORD benchmarks, we establish that this schedule reduces wall-clock training time by approximately 33\%, commensurate with the reduction from 6.67 to 10.0 effective epoch-equivalents of data. To isolate curriculum effects from compute reduction, we introduce matched-compute baselines (Standard-7) that control for total gradient updates. On the FUNSD dataset, the curriculum significantly outperforms the matched-compute baseline for BERT ($\Delta$F1 = +0.023, $p=0.022$, $d_z=3.83$), constituting evidence for a genuine scheduling benefit in capacity-constr

arXiv:2602.21225v1 Announce Type: cross Abstract: We investigate whether progressive data scheduling -- a curriculum learning strategy that incrementally increases training data exposure (33\%$\rightarrow$67\%$\rightarrow$100\%) -- yields consistent efficiency gains across architecturally distinct document understanding models. By evaluating BERT (text-only, 110M parameters) and LayoutLMv3 (multimodal, 126M parameters) on the FUNSD and CORD benchmarks, we establish that this schedule reduces wall-clock training time by approximately 33\%, commensurate with the reduction from 6.67 to 10.0 effective epoch-equivalents of data. To isolate curriculum effects from compute reduction, we introduce matched-compute baselines (Standard-7) that control for total gradient updates. On the FUNSD dataset, the curriculum significantly outperforms the matched-compute baseline for BERT ($\Delta$F1 = +0.023, $p=0.022$, $d_z=3.83$), constituting evidence for a genuine scheduling benefit in capacity-constrained models. In contrast, no analogous benefit is observed for LayoutLMv3 ($p=0.621$), whose multimodal representations provide sufficient inductive bias. On the CORD dataset, all conditions converge to equivalent F1 scores ($\geq$0.947) irrespective of scheduling, indicating a performance ceiling. Schedule ablations comparing progressive, two-phase, reverse, and random pacing confirm that the efficiency gain derives from reduced data volume rather than ordering. Taken together, these findings demonstrate that progressive scheduling is a reliable compute-reduction strategy across model families, with curriculum-specific benefits contingent on the interaction between model capacity and task complexity.

Executive Summary

The article explores the efficacy of curriculum learning, specifically progressive data scheduling, across different document understanding models. By evaluating BERT and LayoutLMv3 on the FUNSD and CORD benchmarks, the study finds that progressive data scheduling reduces training time by 33% and yields significant performance improvements for BERT but not for LayoutLMv3. The findings suggest that curriculum learning can be a reliable compute-reduction strategy, with its benefits depending on the model's capacity and the task's complexity.

Key Points

  • Progressive data scheduling reduces wall-clock training time by 33%.
  • BERT shows significant performance improvements with curriculum learning, while LayoutLMv3 does not.
  • The benefits of curriculum learning are contingent on model capacity and task complexity.

Merits

Rigorous Empirical Analysis

The study provides a thorough empirical analysis of curriculum learning across different models and datasets, offering robust evidence for its effectiveness.

Controlled Experimentation

The introduction of matched-compute baselines ensures that the observed benefits are attributable to curriculum learning rather than compute reduction.

Demerits

Limited Generalizability

The study focuses on only two models and two datasets, which may limit the generalizability of the findings to other models and tasks.

Performance Ceiling on CORD Dataset

The convergence of all conditions to equivalent F1 scores on the CORD dataset suggests a performance ceiling, which may obscure potential benefits of curriculum learning.

Expert Commentary

The article presents a well-designed and executed study on the efficacy of curriculum learning in document understanding models. The rigorous empirical analysis and controlled experimentation provide strong evidence for the benefits of progressive data scheduling, particularly for models with limited capacity. However, the study's focus on a narrow set of models and datasets limits the generalizability of the findings. Future research should explore the applicability of these findings to a broader range of models and tasks. Additionally, the performance ceiling observed on the CORD dataset underscores the need for more challenging benchmarks to fully evaluate the potential of curriculum learning. Overall, the study contributes valuable insights to the field of machine learning and curriculum learning, highlighting the importance of model capacity and task complexity in determining the effectiveness of these strategies.

Recommendations

  • Future studies should investigate the effectiveness of curriculum learning across a more diverse set of models and datasets to enhance the generalizability of the findings.
  • Researchers should develop more challenging benchmarks to better evaluate the potential benefits of curriculum learning, especially in scenarios where performance ceilings are not a limiting factor.

Sources