Academic

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

arXiv:2603.09821v1 Announce Type: new Abstract: Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while pre

Chengyu Shen, Yanheng Hou, Minghui Pan, Runming He, Zhen Hao Wong, Meiyi Qiang, Zhou Liu, Hao Liang, Peichao Lai, Zeang Sheng, Wentao Zhang · March 11, 2026 · 1 min read · 33 views

#cs.CL

Executive Summary

One-Eval is an innovative evaluation system that aims to address the challenges of reliable and efficient large language model (LLM) evaluation. Developed by the OpenDCAI team, the system leverages natural language processing to convert evaluation requests into executable workflows. This reduces manual effort and enables reproducible evaluations. One-Eval integrates three key components: NL2Bench for intent structuring and personalized benchmark planning, BenchResolve for benchmark resolution and dataset acquisition, and Metrics & Reporting for task-aware metric selection. The system also includes human-in-the-loop checkpoints for review and auditability. Experiments demonstrate the efficiency and reproducibility of One-Eval in industrial settings, making it a valuable tool for LLM development and deployment.

Key Points

▸ One-Eval is an agentic evaluation system that converts natural-language evaluation requests into executable workflows.
▸ The system integrates NL2Bench, BenchResolve, and Metrics & Reporting for intent structuring, benchmark resolution, and task-aware metric selection.
▸ One-Eval includes human-in-the-loop checkpoints for review, editing, and rollback, ensuring auditability and reproducibility.

Merits

Strength in Efficiency

One-Eval significantly reduces manual effort required for LLM evaluation, enabling practitioners to execute end-to-end evaluations with minimal user input.

Enhanced Reproducibility

The system's design ensures reproducibility through human-in-the-loop checkpoints, preserving sample evidence trails for debugging and auditability.

Customizability

One-Eval allows for customizable evaluation workflows tailored to specific use cases and requirements, supporting more efficient and effective LLM development and deployment.

Demerits

Limited Scalability

As the complexity of LLMs and their applications increases, One-Eval may face challenges in scaling to meet the demands of large-scale evaluations, potentially impacting its efficiency and effectiveness.

Dependence on NL2Bench and BenchResolve

The success of One-Eval relies heavily on the performance and accuracy of its integrated components, NL2Bench and BenchResolve, which may limit the system's overall reliability and flexibility.

Expert Commentary

One-Eval is a significant contribution to the field of LLM evaluation, offering a novel approach to addressing the challenges of reliable and efficient evaluation. While the system demonstrates promising results, its scalability and dependence on integrated components are areas that require further attention. Nevertheless, One-Eval has the potential to revolutionize LLM development and deployment, making it an essential tool for practitioners and researchers in the field.

Recommendations

✓ Future research should focus on addressing the scalability limitations of One-Eval and exploring its integration with other LLM evaluation frameworks.
✓ The development of One-Eval's integrated components, NL2Bench and BenchResolve, should be prioritized to ensure the system's reliability and flexibility in diverse evaluation scenarios.

Sources

arXiv - cs.CL

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

AI Commentary

Executive Summary

Key Points

Merits

Strength in Efficiency

Enhanced Reproducibility

Customizability

Demerits

Limited Scalability

Dependence on NL2Bench and BenchResolve

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs