Academic

Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English

Yue Zhang, Rodney Beard, John Hawkins, Rohitash Chandra · March 12, 2026 · 1 min read · 18 views

#cs.CL #cs.AI

arXiv:2603.09998v1 Announce Type: cross Abstract: Although Large Language Models (LLMs) have exceptional performance in machine translation, only a limited systematic assessment of translation quality has been done. The challenge lies in automated frameworks, as human-expert-based evaluations can be time-consuming, given the fast-evolving LLMs and the need for a diverse set of texts to ensure fair assessments of translation quality. In this paper, we utilise an automated machine learning framework featuring semantic and sentiment analysis to assess Mandarin Chinese to English translation using Google Translate and LLMs, including GPT-4, GPT-4o, and DeepSeek. We compare original and translated texts in various classes of high-profile Chinese texts, which include novel texts that span modern and classical literature, as well as news articles. As the main evaluation measures, we utilise novel similarity metrics to compare the quality of translations produced by LLMs and further evaluate them by an expert human translator. Our results indicate that the LLMs perform well in news media translation, but show divergence in their performance when applied to literary texts. Although GPT-4o and DeepSeek demonstrated better semantic conservation in complex situations, DeepSeek demonstrated better performance in preserving cultural subtleties and grammatical rendering. Nevertheless, the subtle challenges in translation remain: maintaining cultural details, classical references and figurative expressions remain an open problem for all the models.

Executive Summary

This study examines the performance of Large Language Models (LLMs) in machine translation of Mandarin Chinese to English, utilizing an automated machine learning framework featuring semantic and sentiment analysis. The authors assess the quality of translations produced by Google Translate and three LLMs, including GPT-4, GPT-4o, and DeepSeek, using novel similarity metrics and expert human evaluation. The results indicate that LLMs perform well in news media translation but struggle with literary texts, with varying levels of performance among the models. The study highlights the challenges in translation, particularly in maintaining cultural details, classical references, and figurative expressions. The findings have implications for the development and application of LLMs in machine translation and language learning.

Key Points

▸ Automated machine learning framework utilized for assessment of LLMs in machine translation
▸ Novel similarity metrics used to evaluate translation quality
▸ GPT-4o and DeepSeek demonstrate better semantic conservation in complex situations
▸ DeepSeek shows better performance in preserving cultural subtleties and grammatical rendering
▸ Challenges in translation remain, including cultural details and figurative expressions

Merits

Comprehensive evaluation framework

The study presents a thorough evaluation framework for assessing LLMs in machine translation, incorporating both automated and human-expert-based assessments.

Novel similarity metrics

The authors propose novel similarity metrics for evaluating translation quality, providing a more nuanced understanding of LLM performance.

Demerits

Limited scope of analysis

The study focuses primarily on machine translation of Mandarin Chinese to English, limiting the generalizability of the findings to other languages and translation directions.

Insufficient attention to cultural context

While the study highlights the importance of cultural subtleties in translation, it does not fully explore the cultural context of the source texts, potentially limiting the validity of the findings.

Expert Commentary

This study provides a valuable contribution to the field of machine translation, highlighting the importance of comprehensive evaluation frameworks and novel similarity metrics. However, the study's limitations, including the limited scope of analysis and insufficient attention to cultural context, should be addressed in future research. The findings have significant implications for the development and application of LLMs in machine translation and language learning. As the field continues to evolve, it is essential to prioritize the preservation of cultural subtleties and grammatical rendering in LLMs and to address the challenges in translation. Future research should focus on developing more nuanced assessment frameworks and exploring the cultural context of source texts.

Recommendations

✓ Developers should prioritize the development of LLMs that can effectively preserve cultural subtleties and grammatical rendering.
✓ Policymakers should consider the limitations of LLMs in machine translation when developing language learning programs or implementing translation policies.

Sources

arXiv - cs.AI

Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive evaluation framework

Novel similarity metrics

Demerits

Limited scope of analysis

Insufficient attention to cultural context

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs