Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek
arXiv:2602.24119v1 Announce Type: new Abstract: This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose. We evaluate translations by three commercial LLMs (Claude, Gemini, ChatGPT) of twenty paragraph-length passages from two works by the Greek physician Galen of Pergamum (ca. 129-216 CE): On Mixtures, which has two published English translations, and On the Composition of Drugs according to Kinds, which has never been fully translated into English. We assess translation quality using both standard automated evaluation metrics (BLEU, chrF++, METEOR, ROUGE-L, BERTScore, COMET, BLEURT) and expert human evaluation via a modified Multidimensional Quality Metrics (MQM) framework applied to all 60 translations by a team of domain specialists. On the previously translated expository text, LLMs achieved high translation quality (mean MQM score 95.2/100), with performance approaching
arXiv:2602.24119v1 Announce Type: new Abstract: This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose. We evaluate translations by three commercial LLMs (Claude, Gemini, ChatGPT) of twenty paragraph-length passages from two works by the Greek physician Galen of Pergamum (ca. 129-216 CE): On Mixtures, which has two published English translations, and On the Composition of Drugs according to Kinds, which has never been fully translated into English. We assess translation quality using both standard automated evaluation metrics (BLEU, chrF++, METEOR, ROUGE-L, BERTScore, COMET, BLEURT) and expert human evaluation via a modified Multidimensional Quality Metrics (MQM) framework applied to all 60 translations by a team of domain specialists. On the previously translated expository text, LLMs achieved high translation quality (mean MQM score 95.2/100), with performance approaching expert level. On the untranslated pharmacological text, aggregate quality was lower (79.9/100) but with high variance driven by two passages presenting extreme terminological density; excluding these, scores converged to within 4 points of the translated text. Terminology rarity, operationalized via corpus frequency in the literary Diorisis Ancient Greek Corpus, emerged as a strong predictor of translation failure (r = -.97 for passage-level quality on the untranslated text). Automated metrics showed moderate correlation with human judgment overall on the text with a wide quality spread (Composition), but no metric discriminated among high-quality translations. We discuss implications for the use of LLMs in Classical scholarship and for the design of automated evaluation pipelines for low-resource ancient languages.
Executive Summary
This study systematically evaluates the performance of large language models (LLMs) in translating Ancient Greek technical prose, revealing that terminology rarity is a strong predictor of catastrophic failure in LLM translation. The researchers tested three commercial LLMs on two texts by Galen of Pergamum, achieving high translation quality on previously translated expository text but lower quality on untranslated pharmacological text. The study's findings highlight the limitations of relying on automated metrics and emphasize the importance of human evaluation in assessing translation quality. The results have significant implications for the use of LLMs in Classical scholarship and the design of automated evaluation pipelines for low-resource ancient languages.
Key Points
- ▸ Large language models (LLMs) struggle with translating Ancient Greek technical prose, particularly when faced with rare terminology.
- ▸ Terminology rarity emerges as a strong predictor of catastrophic failure in LLM translation.
- ▸ Human evaluation is essential in assessing translation quality, particularly for low-resource ancient languages.
Merits
Methodological rigor
The study employs a systematic, reference-free human evaluation framework, which provides a robust assessment of LLM translation quality.
Generalizability
The study's findings are generalizable to other low-resource ancient languages, given the similarities in linguistic and cultural contexts.
Implications for Classical scholarship
The study's results have significant implications for the use of LLMs in Classical scholarship, highlighting the limitations of relying on automated metrics and emphasizing the importance of human evaluation.
Demerits
Limited scope
The study focuses on Ancient Greek technical prose, which may limit the generalizability of the findings to other languages and genres.
Dependence on human evaluation
The study's results rely on human evaluation, which can be subjective and time-consuming, limiting the scalability of the evaluation framework.
Expert Commentary
The study's findings are significant, as they highlight the limitations of relying on automated metrics in machine translation. The use of terminology rarity as a predictor of catastrophic failure is a novel insight, and the study's emphasis on human evaluation is timely. However, the study's dependence on human evaluation and limited scope may limit the generalizability of the findings. Nevertheless, the study's implications for Classical scholarship and the development of human evaluation frameworks are far-reaching and warrant further investigation.
Recommendations
- ✓ Future studies should investigate the use of terminology rarity as a predictor of catastrophic failure in other languages and genres.
- ✓ Researchers should develop more scalable and objective human evaluation frameworks to support the use of LLMs in low-resource languages.