Academic

AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems

Athanasios Davvetas, Michael Papademas, Xenia Ziouvelou, Vangelis Karkaletsis · March 11, 2026 · 1 min read · 4 views

#cs.AI

arXiv:2603.09435v1 Announce Type: new Abstract: The rapid rollout of AI in heterogeneous public and societal sectors has subsequently escalated the need for compliance with regulatory standards and frameworks. The EU AI Act has emerged as a landmark in the regulatory landscape. The development of solutions that elicit the level of AI systems' compliance with such standards is often limited by the lack of resources, hindering the semi-automated or automated evaluation of their performance. This generates the need for manual work, which is often error-prone, resource-limited or limited to cases not clearly described by the regulation. This paper presents an open, transparent, and reproducible method of creating a resource that facilitates the evaluation of NLP models with a strong focus on RAG systems. We have developed a dataset that contain the tasks of risk-level classification, article retrieval, obligation generation, and question-answering for the EU AI Act. The dataset files are in a machine-to-machine appropriate format. To generate the files, we utilise domain knowledge as an exegetical basis, combining with the processing and reasoning power of large language models to generate scenarios along with the respective tasks. Our methodology demonstrates a way to harness language models for grounded generation with high document relevancy. Besides, we overcome limitations such as navigating the decision boundaries of risk-levels that are not explicitly defined within the EU AI Act, such as limited and minimal cases. Finally, we demonstrate our dataset's effectiveness by evaluating a RAG-based solution that reaches 0.87 and 0.85 F1-score for prohibited and high-risk scenarios.

Executive Summary

This article presents an evaluation benchmark dataset for NLP and RAG systems, specifically designed to assess compliance with the EU AI Act. The dataset includes tasks such as risk-level classification, article retrieval, obligation generation, and question-answering. Utilizing a combination of domain knowledge and large language models, the authors demonstrate a method for generating scenarios and respective tasks with high document relevancy. The dataset's effectiveness is evaluated with a RAG-based solution achieving 0.87 and 0.85 F1-score for prohibited and high-risk scenarios. This benchmark dataset aims to facilitate the semi-automated or automated evaluation of AI systems' performance, addressing the need for compliance with regulatory standards and frameworks.

Key Points

▸ The article presents a novel evaluation benchmark dataset for NLP and RAG systems.
▸ The dataset is specifically designed to assess compliance with the EU AI Act.
▸ The authors utilize a combination of domain knowledge and large language models to generate scenarios and respective tasks.

Merits

Strength in Methodology

The article presents a well-structured methodology for generating scenarios and respective tasks, leveraging the processing and reasoning power of large language models.

Effective Evaluation

The dataset's effectiveness is demonstrated through a RAG-based solution achieving high F1-scores for prohibited and high-risk scenarios.

Demerits

Limitation in Generalizability

The dataset's design and development are heavily reliant on the EU AI Act, limiting its generalizability to other regulatory frameworks.

Overreliance on Large Language Models

The article's methodology relies heavily on the processing and reasoning power of large language models, which may introduce limitations and biases in the generated scenarios and tasks.

Expert Commentary

This article presents a significant contribution to the field of AI evaluation, particularly in the context of regulatory compliance. The authors' methodology for generating scenarios and respective tasks demonstrates a novel approach to harnessing language models for grounded generation. While the dataset's design and development are heavily reliant on the EU AI Act, the article's implications for practical and policy considerations are substantial. The development of evaluation benchmark datasets like the one presented in this article can facilitate the semi-automated or automated evaluation of AI systems' performance, reducing the need for manual work and potential errors. However, the overreliance on large language models and limitations in generalizability highlight the need for future research in these areas.

Recommendations

✓ Future research should focus on developing evaluation benchmark datasets for various regulatory frameworks, ensuring the generalizability of these datasets.
✓ The authors should explore alternative methodologies for generating scenarios and respective tasks, reducing the reliance on large language models and potential biases.

Sources

arXiv - cs.AI

AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems

AI Commentary

Executive Summary

Key Points

Merits

Strength in Methodology

Effective Evaluation

Demerits

Limitation in Generalizability

Overreliance on Large Language Models

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs