Academic

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

arXiv:2603.13045v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on low-resource translation still lags behind. Existing post-training methods rely heavily on high-quality parallel data, which are often scarce or unavailable for low-resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs' translation capabilities on massive low-resource languages while retaining their performance on high-resource languages. Our key insight is based on the observation of failure modes (or "holes") in existing source-based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR's reward for RL tra

Yifeng Liu, Siqi Ouyang, Yatish Hosmane Revanasiddappa, Lei Li · March 16, 2026 · 1 min read · 16 views

#cs.CL

Executive Summary

This article introduces WALAR, a reinforcement training method that leverages monolingual text to improve large language models' (LLMs) translation capabilities for low-resource languages. By addressing 'holes' in existing source-based multilingual quality estimation models, WALAR enhances LLMs' performance on low-resource languages while maintaining their performance on high-resource languages. The method demonstrates significant improvements over existing models, outperforming LLaMAX on 1400 language directions.

Key Points

▸ WALAR uses monolingual text for reinforcement training
▸ Techniques such as word alignment and language alignment mitigate 'holes' in quality estimation models
▸ WALAR achieves significant improvements over existing models on low-resource languages

Merits

Improved Performance on Low-Resource Languages

WALAR's ability to leverage monolingual text enables significant improvements in translation capabilities for low-resource languages.

Demerits

Potential Overfitting to Monolingual Data

The reliance on monolingual text may lead to overfitting, potentially affecting the model's performance on high-resource languages or in certain translation tasks.

Expert Commentary

The introduction of WALAR marks a significant advancement in the field of multilingual machine translation. By addressing the limitations of existing quality estimation models, WALAR demonstrates the potential for reinforcement learning to improve LLMs' performance on low-resource languages. However, further research is needed to fully explore the capabilities and limitations of this approach, particularly in regards to potential overfitting and the impact on high-resource languages. As the field continues to evolve, it is essential to consider the practical and policy implications of such advancements.

Recommendations

✓ Further investigation into the potential applications and limitations of WALAR's approach
✓ Exploration of strategies to mitigate potential overfitting and ensure the model's performance on high-resource languages

Sources

arXiv - cs.CL

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

AI Commentary

Executive Summary

Key Points

Merits

Improved Performance on Low-Resource Languages

Demerits

Potential Overfitting to Monolingual Data

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs