Long-Context Encoder Models for Polish Language Understanding
arXiv:2603.12191v1 Announce Type: new Abstract: While decoder-only Large Language Models (LLMs) have recently dominated the NLP landscape, encoder-only architectures remain a cost-effective and parameter-efficient standard for discriminative tasks. However, classic encoders like BERT are limited by a short context window, which is insufficient for processing long documents. In this paper, we address this limitation for the Polish by introducing a high-quality Polish model capable of processing sequences of up to 8192 tokens. The model was developed by employing a two-stage training procedure that involves positional embedding adaptation and full parameter continuous pre-training. Furthermore, we propose compressed model variants trained via knowledge distillation. The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding
arXiv:2603.12191v1 Announce Type: new Abstract: While decoder-only Large Language Models (LLMs) have recently dominated the NLP landscape, encoder-only architectures remain a cost-effective and parameter-efficient standard for discriminative tasks. However, classic encoders like BERT are limited by a short context window, which is insufficient for processing long documents. In this paper, we address this limitation for the Polish by introducing a high-quality Polish model capable of processing sequences of up to 8192 tokens. The model was developed by employing a two-stage training procedure that involves positional embedding adaptation and full parameter continuous pre-training. Furthermore, we propose compressed model variants trained via knowledge distillation. The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding. The results demonstrate that our model achieves the best average performance among Polish and multilingual models, significantly outperforming competitive solutions in long-context tasks while maintaining comparable quality on short texts.
Executive Summary
This paper introduces a novel encoder-only model tailored for Polish language understanding, capable of processing up to 8192 tokens—a significant advancement over conventional short-context encoders like BERT. The model leverages a two-stage training approach combining positional embedding adaptation and continuous pre-training, with additional compressed variants via knowledge distillation. Evaluated across 25 tasks, including the newly introduced FinBench suite and the KLEJ benchmark, the model demonstrates superior performance in long-context scenarios while maintaining parity on short texts, establishing it as the top-performing Polish and multilingual model in this domain. The work fills a critical gap in Polish NLP infrastructure by enabling scalable, efficient processing of long documents without compromising accuracy.
Key Points
- ▸ Introduction of high-context Polish encoder (8192 tokens)
- ▸ Two-stage training with positional embedding adaptation and continuous pre-training
- ▸ Compressed variants via knowledge distillation
Merits
Context Expansion
The model successfully addresses the critical limitation of short context windows in encoder models by enabling processing of long documents (up to 8192 tokens) without degradation in performance on shorter texts.
Evaluation Breadth
Comprehensive evaluation across diverse tasks—including specialized financial tasks (FinBench)—demonstrates generalizability and robustness, enhancing credibility.
Competitive Advantage
Achieves superior average performance against Polish and multilingual models in long-context tasks, validating the effectiveness of the training methodology.
Demerits
Training Complexity
The two-stage training process may increase computational overhead and development time compared to standard pre-training pipelines.
Knowledge Distillation Limitations
Compressed variants may introduce trade-offs in performance if distillation accuracy degrades significantly under extreme compression.
Expert Commentary
The paper represents a meaningful contribution to the field by bridging a persistent gap in encoder-based NLP for underrepresented languages. The combination of a robust two-stage training strategy with knowledge distillation variants represents a scalable blueprint for adapting encoder architectures beyond English. Notably, the authors’ decision to evaluate on a newly created financial task suite (FinBench) signals a commitment to domain-specific relevance and application-driven innovation—a hallmark of high-impact research. While the computational cost of the two-stage approach warrants further scrutiny, the empirical results justify the effort. This work not only advances Polish NLP capabilities but also sets a precedent for similar adaptations in other low-resource languages, potentially catalyzing a wave of localized encoder adaptations. As such, it merits recognition as a pivotal step toward equitable global NLP access.
Recommendations
- ✓ Extend evaluation to additional low-resource languages using analogous task suites to validate generalizability.
- ✓ Publish model weights and training code openly to facilitate reproducibility and community adoption.