Academic

Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi

arXiv:2603.03508v1 Announce Type: new Abstract: The dominance of large multilingual foundation models has widened linguistic inequalities in Natural Language Processing (NLP), often leaving low-resource languages underrepresented. This paper introduces LilMoo, a 0.6-billion-parameter Hindi language model trained entirely from scratch to address this gap. Unlike prior Hindi models that rely on continual pretraining from opaque multilingual foundations, LilMoo is developed through a fully transparent and reproducible pipeline optimized for limited compute environments. We construct a high-quality Hindi corpus (GigaLekh) filtered through both heuristic and learned (LLM-as-a-judge) methods, complemented by bilingual augmentation with curated English data. Using this dataset, we explore various training recipes for small-scale language models. Across comprehensive evaluation suites, LilMoo consistently outperforms comparably sized multilingual baselines such as Qwen2.5-0.5B and Qwen3-0.6B,

arXiv:2603.03508v1 Announce Type: new Abstract: The dominance of large multilingual foundation models has widened linguistic inequalities in Natural Language Processing (NLP), often leaving low-resource languages underrepresented. This paper introduces LilMoo, a 0.6-billion-parameter Hindi language model trained entirely from scratch to address this gap. Unlike prior Hindi models that rely on continual pretraining from opaque multilingual foundations, LilMoo is developed through a fully transparent and reproducible pipeline optimized for limited compute environments. We construct a high-quality Hindi corpus (GigaLekh) filtered through both heuristic and learned (LLM-as-a-judge) methods, complemented by bilingual augmentation with curated English data. Using this dataset, we explore various training recipes for small-scale language models. Across comprehensive evaluation suites, LilMoo consistently outperforms comparably sized multilingual baselines such as Qwen2.5-0.5B and Qwen3-0.6B, demonstrating that well-designed language-specific pretraining can rival large multilingual models at the sub-billion-parameter range.

Executive Summary

The article introduces LilMoo, a 0.6-billion-parameter Hindi language model trained from scratch to address linguistic inequalities in Natural Language Processing. LilMoo outperforms comparably sized multilingual baselines, demonstrating the effectiveness of well-designed language-specific pretraining. The model is developed through a transparent and reproducible pipeline, utilizing a high-quality Hindi corpus and bilingual augmentation with curated English data.

Key Points

  • Introduction of LilMoo, a Hindi language model trained from scratch
  • Use of a fully transparent and reproducible pipeline for development
  • Outperformance of comparably sized multilingual baselines

Merits

Language-Specific Pretraining

LilMoo's language-specific pretraining approach allows for tailored optimization and improved performance for the Hindi language.

Transparency and Reproducibility

The transparent and reproducible pipeline used for LilMoo's development ensures accountability and facilitates future research.

Demerits

Limited Scope

LilMoo's focus on the Hindi language may limit its applicability to other languages, potentially perpetuating linguistic inequalities in other low-resource languages.

Expert Commentary

The introduction of LilMoo marks a significant step towards addressing linguistic inequalities in NLP. By demonstrating the effectiveness of language-specific pretraining, the authors highlight the importance of tailored approaches for low-resource languages. However, the limited scope of the model raises questions about its potential impact on the broader NLP landscape. Further research is needed to explore the applicability of LilMoo's approach to other languages and to fully realize its potential for promoting linguistic diversity in NLP.

Recommendations

  • Future research should focus on developing similar language-specific models for other low-resource languages to promote linguistic diversity in NLP.
  • The development of transparent and reproducible pipelines for language model development should be prioritized to ensure accountability and facilitate future research.

Sources