Academic

Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents

MZ Naser, Ahmad Bani Awwad, Zoie McCreery, Radwa Eissa, Ahmad Naser, Gianluca Cusatis, Andrew Metcalf, Kapil Madathil, Jamal Abdalla, Venkatesh Kodur, Mohammad Reza Saeb · March 7, 2026 · 1 min read · 3 views

#cs.AI #cs.SE

arXiv:2603.02239v1 Announce Type: new Abstract: The Engineering Reasoning and Instruction (ERI) benchmark is a taxonomy-driven instruction dataset designed to train and evaluate engineering-capable large language models (LLMs) and agents. This dataset spans nine engineering fields (namely: civil, mechanical, electrical, chemical, environmental, aerospace, materials, fire, and industrial engineering) and 55 subdomains, and is crossed with seven intent types (i.e., definition, explanation, calculation, comparison, design/synthesis, troubleshooting, and code-related) and three difficulty tiers (undergraduate, graduate, and professional), yielding 57,750 records with field/subdomain/type/difficulty metadata and solution formatting. We examined ERI via seven LLMs and report a statistically significant three-tier performance structure, with frontier models (GPT-5, Claude Sonnet 4, DeepSeek V3.1) achieving mean scores above 4.30 on a five-point scale, while mid-tier and smaller models exhibited progressively higher failure rates and steeper performance degradation on graduate-level questions. To address circularity concerns inherent in LLM benchmarks, we developed a convergent validation protocol that leverages cross-provider independence, multi-judge averaging, and frontier-model agreement analysis to empirically bound hallucination risk to 1.7%. ERI is released with taxonomy specifications, validation scripts, and an evaluation harness to enable reproducible comparisons and regression testing for instruction tuning, routing, retrieval-augmented evaluation, and agentic tool-use workflows in engineering settings.

Executive Summary

The Engineering Reasoning and Instruction (ERI) benchmark is a comprehensive dataset designed to evaluate the capabilities of large language models (LLMs) and agents in engineering fields. The dataset spans nine engineering fields, 55 subdomains, and seven intent types, yielding 57,750 records with detailed metadata. The study reports a statistically significant three-tier performance structure among LLMs, highlighting the capabilities of frontier models. To address circularity concerns, the authors developed a convergent validation protocol, empirically bounding hallucination risk to 1.7%. The ERI benchmark is released with taxonomy specifications, validation scripts, and an evaluation harness, enabling reproducible comparisons and regression testing for instruction tuning and other workflows.

Key Points

▸ ERI is a taxonomy-driven instruction dataset designed to train and evaluate engineering-capable LLMs and agents.
▸ The dataset spans nine engineering fields, 55 subdomains, and seven intent types, yielding 57,750 records.
▸ Frontier models achieve mean scores above 4.30 on a five-point scale, while smaller models exhibit higher failure rates and steeper performance degradation.

Merits

Comprehensive Dataset Design

The ERI dataset is carefully designed to cover a broad range of engineering fields, subdomains, and intent types, providing a comprehensive evaluation framework for LLMs and agents.

Convergent Validation Protocol

The authors' convergent validation protocol addresses circularity concerns and empirically bounds hallucination risk to 1.7%, enhancing the reliability of the benchmark.

Reproducible Evaluations

The ERI benchmark is released with taxonomy specifications, validation scripts, and an evaluation harness, enabling reproducible comparisons and regression testing for instruction tuning and other workflows.

Demerits

Limited Evaluation Scope

The study primarily focuses on the evaluation of LLMs and agents in engineering fields, which may limit its applicability to other domains.

Dependence on Frontier Models

The study's results are heavily reliant on the performance of frontier models, which may not generalize to other models or domains.

Expert Commentary

The ERI benchmark is a significant contribution to the field of natural language processing, providing a comprehensive evaluation framework for LLMs and agents in engineering fields. The study's findings on the performance of frontier models and the development of a convergent validation protocol are particularly noteworthy. However, the study's limitations, including its dependence on frontier models and limited evaluation scope, highlight the need for further research in this area. Overall, the ERI benchmark has the potential to significantly impact the development and evaluation of LLMs and agents in engineering settings.

Recommendations

✓ Future studies should investigate the generalizability of the ERI benchmark to other domains and the applicability of the convergent validation protocol to other evaluation frameworks.
✓ Researchers should explore the development of more diverse and representative datasets, including those that cover a broader range of engineering fields and intent types.

Sources

arXiv - cs.AI

Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Dataset Design

Convergent Validation Protocol

Reproducible Evaluations

Demerits

Limited Evaluation Scope

Dependence on Frontier Models

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs