Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents
arXiv:2603.02239v1 Announce Type: new Abstract: The Engineering Reasoning and Instruction (ERI) benchmark is a taxonomy-driven instruction dataset designed to train and evaluate engineering-capable large language models (LLMs) and agents. This dataset spans nine engineering fields (namely: civil, mechanical, electrical, chemical, environmental, aerospace, materials, fire, and industrial engineering) and 55 subdomains, and is crossed with seven intent types (i.e., definition, explanation, calculation, comparison, design/synthesis, troubleshooting, and code-related) and three difficulty tiers (undergraduate, graduate, and professional), yielding 57,750 records with field/subdomain/type/difficulty metadata and solution formatting. We examined ERI via seven LLMs and report a statistically significant three-tier performance structure, with frontier models (GPT-5, Claude Sonnet 4, DeepSeek V3.1) achieving mean scores above 4.30 on a five-point scale, while mid-tier and smaller models exhibi
arXiv:2603.02239v1 Announce Type: new Abstract: The Engineering Reasoning and Instruction (ERI) benchmark is a taxonomy-driven instruction dataset designed to train and evaluate engineering-capable large language models (LLMs) and agents. This dataset spans nine engineering fields (namely: civil, mechanical, electrical, chemical, environmental, aerospace, materials, fire, and industrial engineering) and 55 subdomains, and is crossed with seven intent types (i.e., definition, explanation, calculation, comparison, design/synthesis, troubleshooting, and code-related) and three difficulty tiers (undergraduate, graduate, and professional), yielding 57,750 records with field/subdomain/type/difficulty metadata and solution formatting. We examined ERI via seven LLMs and report a statistically significant three-tier performance structure, with frontier models (GPT-5, Claude Sonnet 4, DeepSeek V3.1) achieving mean scores above 4.30 on a five-point scale, while mid-tier and smaller models exhibited progressively higher failure rates and steeper performance degradation on graduate-level questions. To address circularity concerns inherent in LLM benchmarks, we developed a convergent validation protocol that leverages cross-provider independence, multi-judge averaging, and frontier-model agreement analysis to empirically bound hallucination risk to 1.7%. ERI is released with taxonomy specifications, validation scripts, and an evaluation harness to enable reproducible comparisons and regression testing for instruction tuning, routing, retrieval-augmented evaluation, and agentic tool-use workflows in engineering settings.
Executive Summary
The Engineering Reasoning and Instruction (ERI) benchmark is a comprehensive dataset designed to evaluate the capabilities of large language models (LLMs) and agents in engineering fields. The dataset spans nine engineering fields, 55 subdomains, and seven intent types, yielding 57,750 records with detailed metadata. The study reports a statistically significant three-tier performance structure among LLMs, highlighting the capabilities of frontier models. To address circularity concerns, the authors developed a convergent validation protocol, empirically bounding hallucination risk to 1.7%. The ERI benchmark is released with taxonomy specifications, validation scripts, and an evaluation harness, enabling reproducible comparisons and regression testing for instruction tuning and other workflows.
Key Points
- ▸ ERI is a taxonomy-driven instruction dataset designed to train and evaluate engineering-capable LLMs and agents.
- ▸ The dataset spans nine engineering fields, 55 subdomains, and seven intent types, yielding 57,750 records.
- ▸ Frontier models achieve mean scores above 4.30 on a five-point scale, while smaller models exhibit higher failure rates and steeper performance degradation.
Merits
Comprehensive Dataset Design
The ERI dataset is carefully designed to cover a broad range of engineering fields, subdomains, and intent types, providing a comprehensive evaluation framework for LLMs and agents.
Convergent Validation Protocol
The authors' convergent validation protocol addresses circularity concerns and empirically bounds hallucination risk to 1.7%, enhancing the reliability of the benchmark.
Reproducible Evaluations
The ERI benchmark is released with taxonomy specifications, validation scripts, and an evaluation harness, enabling reproducible comparisons and regression testing for instruction tuning and other workflows.
Demerits
Limited Evaluation Scope
The study primarily focuses on the evaluation of LLMs and agents in engineering fields, which may limit its applicability to other domains.
Dependence on Frontier Models
The study's results are heavily reliant on the performance of frontier models, which may not generalize to other models or domains.
Expert Commentary
The ERI benchmark is a significant contribution to the field of natural language processing, providing a comprehensive evaluation framework for LLMs and agents in engineering fields. The study's findings on the performance of frontier models and the development of a convergent validation protocol are particularly noteworthy. However, the study's limitations, including its dependence on frontier models and limited evaluation scope, highlight the need for further research in this area. Overall, the ERI benchmark has the potential to significantly impact the development and evaluation of LLMs and agents in engineering settings.
Recommendations
- ✓ Future studies should investigate the generalizability of the ERI benchmark to other domains and the applicability of the convergent validation protocol to other evaluation frameworks.
- ✓ Researchers should explore the development of more diverse and representative datasets, including those that cover a broader range of engineering fields and intent types.