One-Eval: An Agentic System for Automated and Traceable LLM Evaluation
arXiv:2603.09821v1 Announce Type: new Abstract: Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while pre
arXiv:2603.09821v1 Announce Type: new Abstract: Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One-Eval.
Executive Summary
One-Eval is an innovative evaluation system that aims to address the challenges of reliable and efficient large language model (LLM) evaluation. Developed by the OpenDCAI team, the system leverages natural language processing to convert evaluation requests into executable workflows. This reduces manual effort and enables reproducible evaluations. One-Eval integrates three key components: NL2Bench for intent structuring and personalized benchmark planning, BenchResolve for benchmark resolution and dataset acquisition, and Metrics & Reporting for task-aware metric selection. The system also includes human-in-the-loop checkpoints for review and auditability. Experiments demonstrate the efficiency and reproducibility of One-Eval in industrial settings, making it a valuable tool for LLM development and deployment.
Key Points
- ▸ One-Eval is an agentic evaluation system that converts natural-language evaluation requests into executable workflows.
- ▸ The system integrates NL2Bench, BenchResolve, and Metrics & Reporting for intent structuring, benchmark resolution, and task-aware metric selection.
- ▸ One-Eval includes human-in-the-loop checkpoints for review, editing, and rollback, ensuring auditability and reproducibility.
Merits
Strength in Efficiency
One-Eval significantly reduces manual effort required for LLM evaluation, enabling practitioners to execute end-to-end evaluations with minimal user input.
Enhanced Reproducibility
The system's design ensures reproducibility through human-in-the-loop checkpoints, preserving sample evidence trails for debugging and auditability.
Customizability
One-Eval allows for customizable evaluation workflows tailored to specific use cases and requirements, supporting more efficient and effective LLM development and deployment.
Demerits
Limited Scalability
As the complexity of LLMs and their applications increases, One-Eval may face challenges in scaling to meet the demands of large-scale evaluations, potentially impacting its efficiency and effectiveness.
Dependence on NL2Bench and BenchResolve
The success of One-Eval relies heavily on the performance and accuracy of its integrated components, NL2Bench and BenchResolve, which may limit the system's overall reliability and flexibility.
Expert Commentary
One-Eval is a significant contribution to the field of LLM evaluation, offering a novel approach to addressing the challenges of reliable and efficient evaluation. While the system demonstrates promising results, its scalability and dependence on integrated components are areas that require further attention. Nevertheless, One-Eval has the potential to revolutionize LLM development and deployment, making it an essential tool for practitioners and researchers in the field.
Recommendations
- ✓ Future research should focus on addressing the scalability limitations of One-Eval and exploring its integration with other LLM evaluation frameworks.
- ✓ The development of One-Eval's integrated components, NL2Bench and BenchResolve, should be prioritized to ensure the system's reliability and flexibility in diverse evaluation scenarios.