Academic

Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language

arXiv:2603.23529v1 Announce Type: new Abstract: Large Language Models (LLMs) consistently under perform in low-resource linguistic contexts such as Konkani. This performance deficit stems from acute training data scarcity compounded by high script diversity across Devanagari, Romi and Kannada orthographies. To address this gap, we introduce Konkani-Instruct-100k, a comprehensive synthetic instruction-tuning dataset generated through Gemini 3. We establish rigorous baseline benchmarks by evaluating leading open-weights architectures including Llama 3.1, Qwen2.5 and Gemma 3 alongside proprietary closed-source models. Our primary contribution involves the development of Konkani LLM, a series of fine-tuned models optimized for regional nuances. Furthermore, we are developing the Multi-Script Konkani Benchmark to facilitate cross-script linguistic evaluation. In machine translation, Konkani LLM delivers consistent gains over the corresponding base models and is competitive with and in se

R
Reuben Chagas Fernandes, Gaurang S. Patkar
· · 1 min read · 40 views

arXiv:2603.23529v1 Announce Type: new Abstract: Large Language Models (LLMs) consistently under perform in low-resource linguistic contexts such as Konkani. This performance deficit stems from acute training data scarcity compounded by high script diversity across Devanagari, Romi and Kannada orthographies. To address this gap, we introduce Konkani-Instruct-100k, a comprehensive synthetic instruction-tuning dataset generated through Gemini 3. We establish rigorous baseline benchmarks by evaluating leading open-weights architectures including Llama 3.1, Qwen2.5 and Gemma 3 alongside proprietary closed-source models. Our primary contribution involves the development of Konkani LLM, a series of fine-tuned models optimized for regional nuances. Furthermore, we are developing the Multi-Script Konkani Benchmark to facilitate cross-script linguistic evaluation. In machine translation, Konkani LLM delivers consistent gains over the corresponding base models and is competitive with and in several settings surpasses proprietary baselines

Executive Summary

The article presents a groundbreaking contribution to addressing the critical challenge of low-resource language modeling by focusing on Konkani, a language with significant script diversity (Devanagari, Romi, Kannada) and limited training data. The authors introduce Konkani-Instruct-100k, a synthetic instruction-tuning dataset generated via Gemini 3, and develop Konkani LLM, a series of fine-tuned models optimized for regional linguistic nuances. The study establishes baseline benchmarks for leading open-weights and proprietary architectures, while also developing the Multi-Script Konkani Benchmark for cross-script evaluation. The results demonstrate consistent performance gains in machine translation, with Konkani LLM either matching or surpassing proprietary baselines in several settings. This work not only advances the technical capabilities of LLMs for low-resource languages but also sets a methodological precedent for addressing script diversity in multilingual contexts.

Key Points

  • Konkani’s low-resource status is exacerbated by its multi-script orthographies (Devanagari, Romi, Kannada), complicating standard LLM training and evaluation.
  • The introduction of Konkani-Instruct-100k, a synthetic instruction-tuning dataset generated via Gemini 3, represents a scalable solution to data scarcity in low-resource languages.
  • Konkani LLM, fine-tuned on this dataset, demonstrates superior performance in machine translation compared to base models and competes favorably with proprietary baselines.
  • The development of the Multi-Script Konkani Benchmark addresses a critical gap in cross-script evaluation, enabling standardized assessment of linguistic performance across orthographies.
  • The study evaluates both open-weights (Llama 3.1, Qwen2.5, Gemma 3) and proprietary closed-source models, providing a comprehensive baseline for future research in low-resource language modeling.

Merits

Innovative Dataset Construction

The creation of Konkani-Instruct-100k via synthetic data generation using Gemini 3 is a novel and scalable approach to addressing data scarcity in low-resource languages, overcoming the limitations of traditional corpora.

Comprehensive Benchmarking

The study establishes rigorous baseline benchmarks for both open-weights and proprietary models, offering a transparent and reproducible framework for future evaluations in Konkani and similar languages.

Cross-Script Evaluation Framework

The Multi-Script Konkani Benchmark provides a standardized method for assessing performance across diverse orthographies, addressing a critical gap in multilingual LLM research.

Performance Gains in Translation

Konkani LLM achieves consistent improvements in machine translation over base models and demonstrates competitiveness with proprietary baselines, validating the effectiveness of the proposed approach.

Demerits

Synthetic Data Limitations

While synthetic data generation via Gemini 3 mitigates data scarcity, it may introduce biases or inaccuracies inherent in the model’s training, potentially affecting the quality and representativeness of Konkani-Instruct-100k.

Limited Generalizability

The study focuses exclusively on Konkani, a language with specific linguistic and scriptural characteristics. The applicability of the proposed methods to other low-resource languages with different linguistic structures remains untested.

Proprietary Model Dependence

The reliance on proprietary models (e.g., Gemini 3) for dataset generation introduces potential black-box dependencies and reproducibility concerns, particularly for researchers without access to such models.

Evaluation Scope Constraints

The benchmarks and evaluations are confined to machine translation and cross-script performance. Broader linguistic tasks (e.g., reasoning, summarization) are not addressed, limiting the scope of the study’s conclusions.

Expert Commentary

The authors have made a commendable contribution to the field of low-resource language modeling by addressing Konkani’s unique challenges with a multi-faceted approach. The introduction of Konkani-Instruct-100k and the subsequent fine-tuning of Konkani LLM represent significant advancements, particularly in the context of script diversity and data scarcity. The study’s rigorous benchmarking of both open-weights and proprietary models provides a valuable baseline for future research, while the development of the Multi-Script Konkani Benchmark is a critical step toward standardized cross-script evaluation. However, the reliance on synthetic data generation via a proprietary model introduces potential biases and reproducibility concerns, which warrant further scrutiny. Additionally, the study’s narrow focus on machine translation limits the broader applicability of its findings. That said, the methodological innovations—particularly the synthetic data pipeline and the cross-script benchmark—offer a blueprint for tackling similar challenges in other low-resource languages. This work not only advances Konkani-specific NLP but also contributes to the broader discourse on multilingual LLMs, making it a seminal contribution to the field.

Recommendations

  • Extend the evaluation framework to include broader linguistic tasks such as reasoning, summarization, and question-answering to assess the generalizability of Konkani LLM beyond machine translation.
  • Explore alternative methods for synthetic data generation that minimize reliance on proprietary models, such as leveraging open-weights LLMs or crowdsourced data to enhance reproducibility and reduce bias.
  • Develop a community-driven effort to adapt and expand the Multi-Script Konkani Benchmark for other languages with script diversity, fostering collaboration and standardization in multilingual NLP evaluation.
  • Conduct a detailed analysis of the biases and inaccuracies introduced by synthetic data generation, particularly those stemming from the use of proprietary models like Gemini 3, to ensure the robustness of Konkani-Instruct-100k.
  • Engage with Konkani-speaking communities and linguists to validate the dataset and models, ensuring cultural and linguistic authenticity while mitigating potential biases in the synthetic data.

Sources

Original: arXiv - cs.CL