MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages
arXiv:2603.20732v1 Announce Type: new Abstract: Decoder-only language models can be adapted to diverse tasks through instruction finetuning, but the extent to which this generalizes at small scale for low-resource languages remains unclear. We focus on the languages of South Africa, where we are not aware of a publicly available decoder-only model that explicitly targets all eleven official written languages, nine of which are low-resource. We introduce MzansiText, a curated multilingual pretraining corpus with a reproducible filtering pipeline, and MzansiLM, a 125M-parameter language model trained from scratch. We evaluate MzansiLM on natural language understanding and generation using three adaptation regimes: monolingual task-specific finetuning, multilingual task-specific finetuning, and general multi-task instruction finetuning. Monolingual task-specific finetuning achieves strong performance on data-to-text generation, reaching 20.65 BLEU on isiXhosa and competing with encoder-d
arXiv:2603.20732v1 Announce Type: new Abstract: Decoder-only language models can be adapted to diverse tasks through instruction finetuning, but the extent to which this generalizes at small scale for low-resource languages remains unclear. We focus on the languages of South Africa, where we are not aware of a publicly available decoder-only model that explicitly targets all eleven official written languages, nine of which are low-resource. We introduce MzansiText, a curated multilingual pretraining corpus with a reproducible filtering pipeline, and MzansiLM, a 125M-parameter language model trained from scratch. We evaluate MzansiLM on natural language understanding and generation using three adaptation regimes: monolingual task-specific finetuning, multilingual task-specific finetuning, and general multi-task instruction finetuning. Monolingual task-specific finetuning achieves strong performance on data-to-text generation, reaching 20.65 BLEU on isiXhosa and competing with encoder-decoder baselines over ten times larger. Multilingual task-specific finetuning benefits closely related languages on topic classification, achieving 78.5% macro-F1 on isiXhosa news classification. While MzansiLM adapts effectively to supervised NLU and NLG tasks, few-shot reasoning remains challenging at this model size, with performance near chance even for much larger decoder-only models. We release MzansiText and MzansiLM to provide a reproducible decoder-only baseline and clear guidance on adaptation strategies for South African languages at small scale.
Executive Summary
This research introduces MzansiText, a curated multilingual pretraining corpus, and MzansiLM, a 125M-parameter decoder-only language model trained for South African languages. The study evaluates the effectiveness of MzansiLM on natural language understanding and generation tasks, exploring three adaptation regimes and releasing the models for reproducibility. The results demonstrate strong performance on data-to-text generation and topic classification, but limited ability in few-shot reasoning tasks. This contribution addresses the dearth of publicly available decoder-only models for low-resource languages, providing valuable insights and guidance for future research. The findings have significant implications for natural language processing in multilingual contexts, particularly in regions with linguistic diversity like South Africa.
Key Points
- ▸ MzansiText is a curated multilingual pretraining corpus for South African languages.
- ▸ MzansiLM is a 125M-parameter decoder-only language model trained from scratch.
- ▸ The study evaluates MzansiLM on natural language understanding and generation tasks using three adaptation regimes.
Merits
Strength in Low-Resource Languages
The research addresses the significant challenge of developing language models for low-resource languages, providing a valuable contribution to the field of natural language processing.
Reproducibility and Open-Source Models
The release of MzansiText and MzansiLM enables reproducibility and facilitates further research on decoder-only language models for South African languages.
Demerits
Limited Performance in Few-Shot Reasoning
The study finds that MzansiLM struggles with few-shot reasoning tasks, highlighting a limitation of decoder-only language models at smaller parameter sizes.
Scope Limited to South African Languages
The research focuses on South African languages, which may limit the generalizability of the findings to other low-resource languages.
Expert Commentary
This research is a significant contribution to the field of natural language processing, particularly in the context of low-resource languages. The introduction of MzansiText and MzansiLM provides a valuable resource for researchers and practitioners, enabling reproducibility and facilitating further research on decoder-only language models. The findings have significant practical and policy implications, highlighting the importance of developing language models for low-resource languages and the need for continued research in this area. While the study has limitations, particularly in the scope of languages and the performance on few-shot reasoning tasks, it provides a valuable foundation for future research and has the potential to impact the development of language models for multilingual contexts.
Recommendations
- ✓ Future research should focus on developing more robust decoder-only language models that can handle few-shot reasoning tasks effectively.
- ✓ The development of language models for other low-resource languages should be prioritized to address the significant challenges in natural language processing for multilingual regions.
Sources
Original: arXiv - cs.CL