Academic

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

arXiv:2603.13045v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on low-resource translation still lags behind. Existing post-training methods rely heavily on high-quality parallel data, which are often scarce or unavailable for low-resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs' translation capabilities on massive low-resource languages while retaining their performance on high-resource languages. Our key insight is based on the observation of failure modes (or "holes") in existing source-based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR's reward for RL tra

arXiv:2603.13045v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on low-resource translation still lags behind. Existing post-training methods rely heavily on high-quality parallel data, which are often scarce or unavailable for low-resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs' translation capabilities on massive low-resource languages while retaining their performance on high-resource languages. Our key insight is based on the observation of failure modes (or "holes") in existing source-based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR's reward for RL training. We continually trained an LLM supporting translation of 101 languages using WALAR. The experiments show that our new model outperforms LLaMAX, one of the strongest open-source multilingual LLMs by a large margin on 1400 language directions on Flores-101 dataset.

Executive Summary

This article introduces WALAR, a reinforcement training method that leverages monolingual text to improve large language models' (LLMs) translation capabilities for low-resource languages. By addressing 'holes' in existing source-based multilingual quality estimation models, WALAR enhances LLMs' performance on low-resource languages while maintaining their performance on high-resource languages. The method demonstrates significant improvements over existing models, outperforming LLaMAX on 1400 language directions.

Key Points

  • WALAR uses monolingual text for reinforcement training
  • Techniques such as word alignment and language alignment mitigate 'holes' in quality estimation models
  • WALAR achieves significant improvements over existing models on low-resource languages

Merits

Improved Performance on Low-Resource Languages

WALAR's ability to leverage monolingual text enables significant improvements in translation capabilities for low-resource languages.

Demerits

Potential Overfitting to Monolingual Data

The reliance on monolingual text may lead to overfitting, potentially affecting the model's performance on high-resource languages or in certain translation tasks.

Expert Commentary

The introduction of WALAR marks a significant advancement in the field of multilingual machine translation. By addressing the limitations of existing quality estimation models, WALAR demonstrates the potential for reinforcement learning to improve LLMs' performance on low-resource languages. However, further research is needed to fully explore the capabilities and limitations of this approach, particularly in regards to potential overfitting and the impact on high-resource languages. As the field continues to evolve, it is essential to consider the practical and policy implications of such advancements.

Recommendations

  • Further investigation into the potential applications and limitations of WALAR's approach
  • Exploration of strategies to mitigate potential overfitting and ensure the model's performance on high-resource languages

Sources