Academic

Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios

arXiv:2603.07372v1 Announce Type: new Abstract: Quality Estimation (QE) is essential for assessing machine translation quality in reference-less settings, particularly for domain-specific and low-resource language scenarios. In this paper, we investigate sentence-level QE for English to Indic machine translation across four domains (Healthcare, Legal, Tourism, and General) and five language pairs. We systematically compare zero-shot, few-shot, and guideline-anchored prompting across selected closed-weight and open-weight LLMs. Findings indicate that while closed-weight models achieve strong performance via prompting alone, prompt-only approaches remain fragile for open-weight models, especially in high-risk domains. To address this, we adopt ALOPE, a framework for LLM-based QE that uses Low-Rank Adaptation with regression heads attached to selected intermediate Transformer layers. We also extend ALOPE with recently proposed Low-Rank Multiplicative Adaptation (LoRMA). Our results show

arXiv:2603.07372v1 Announce Type: new Abstract: Quality Estimation (QE) is essential for assessing machine translation quality in reference-less settings, particularly for domain-specific and low-resource language scenarios. In this paper, we investigate sentence-level QE for English to Indic machine translation across four domains (Healthcare, Legal, Tourism, and General) and five language pairs. We systematically compare zero-shot, few-shot, and guideline-anchored prompting across selected closed-weight and open-weight LLMs. Findings indicate that while closed-weight models achieve strong performance via prompting alone, prompt-only approaches remain fragile for open-weight models, especially in high-risk domains. To address this, we adopt ALOPE, a framework for LLM-based QE that uses Low-Rank Adaptation with regression heads attached to selected intermediate Transformer layers. We also extend ALOPE with recently proposed Low-Rank Multiplicative Adaptation (LoRMA). Our results show that intermediate-layer adaptation consistently improves QE performance, with gains in semantically complex domains, indicating a path toward more robust QE in practical scenarios. We release code and domain-specific QE datasets publicly to support further research.

Executive Summary

This paper addresses a critical gap in machine translation quality estimation (QE) by evaluating domain-specific QE in low-resource English-to-Indic language pairs across four domains. Using a comparative analysis of zero-shot, few-shot, and guideline-anchored prompting across closed-weight and open-weight LLMs, the authors identify the fragility of prompt-only approaches in open-weight models, particularly in high-risk domains. To mitigate this, they introduce ALOPE, an adaptation framework leveraging low-rank methods on Transformer layers, with further enhancement via LoRMA. The findings demonstrate that intermediate-layer adaptation yields measurable gains, especially in semantically complex domains. The release of datasets and code enhances reproducibility and supports broader research. The work contributes meaningfully to the QE literature by offering a scalable, adaptable solution for practical, resource-constrained scenarios.

Key Points

  • Identification of prompting fragility in open-weight LLMs
  • Introduction of ALOPE with low-rank adaptation on Transformer layers
  • Extension with LoRMA yielding performance improvements in complex domains

Merits

Methodological Innovation

The ALOPE framework introduces a novel intermediate-layer adaptation mechanism using low-rank methods, offering a scalable solution for QE in low-resource settings.

Empirical Validation

The study provides systematic comparisons across multiple domains and language pairs, enhancing credibility and applicability of findings.

Demerits

Scope Limitation

The analysis is confined to specific language pairs and domains; broader applicability across additional Indic languages or other language families remains unexamined.

Generalizability Concern

Results may not extend to non-low-resource or non-domain-specific scenarios without further validation.

Expert Commentary

The paper makes a substantive contribution by bridging a persistent disconnect between QE efficacy and practical deployment in low-resource, domain-specific machine translation. The authors rightly identify that prompt-only approaches, while convenient, lack robustness—a critical insight for practitioners relying on open-weight LLMs. Their adoption of ALOPE and LoRMA represents a pragmatic evolution of adaptation strategies, moving beyond surface-level prompting to deeper, layer-wise fine-tuning that aligns with the architecture of transformer-based models. Importantly, the choice to release datasets and code is not merely altruistic; it is a strategic move to accelerate reproducibility and foster collaboration. In an era where domain-specific accuracy is increasingly mandated in legal, healthcare, and tourism sectors, the ability to reliably assess translation quality without reference texts is paramount. This work provides a concrete, empirically validated pathway toward achieving that goal. Moreover, the focus on semantically complex domains highlights a nuanced understanding of where QE matters most—a level of sophistication that elevates the paper beyond typical comparative studies. This is a landmark contribution that will influence both academic research and industry deployment for years to come.

Recommendations

  • Adopt ALOPE or LoRMA frameworks in QE pipelines for domain-specific MT in low-resource contexts.
  • Expand future studies to include additional language pairs, domains, and open-source LLM variants to validate broader applicability.

Sources