Policy Compliance of User Requests in Natural Language for AI Systems
arXiv:2603.00369v1 Announce Type: new Abstract: Consider an organization whose users send requests in natural language to an AI system that fulfills them by carrying out specific tasks. In this paper, we consider the problem of ensuring such user requests comply with a list of diverse policies determined by the organization with the purpose of guaranteeing the safe and reliable use of the AI system. We propose, to the best of our knowledge, the first benchmark consisting of annotated user requests of diverse compliance with respect to a list of policies. Our benchmark is related to industrial applications in the technology sector. We then use our benchmark to evaluate the performance of various LLM models on policy compliance assessment under different solution methods. We analyze the differences on performance metrics across the models and solution methods, showcasing the challenging nature of our problem.
arXiv:2603.00369v1 Announce Type: new Abstract: Consider an organization whose users send requests in natural language to an AI system that fulfills them by carrying out specific tasks. In this paper, we consider the problem of ensuring such user requests comply with a list of diverse policies determined by the organization with the purpose of guaranteeing the safe and reliable use of the AI system. We propose, to the best of our knowledge, the first benchmark consisting of annotated user requests of diverse compliance with respect to a list of policies. Our benchmark is related to industrial applications in the technology sector. We then use our benchmark to evaluate the performance of various LLM models on policy compliance assessment under different solution methods. We analyze the differences on performance metrics across the models and solution methods, showcasing the challenging nature of our problem.
Executive Summary
This paper addresses a critical gap in AI governance by introducing the first benchmark for evaluating policy compliance of user requests in natural language. The study recognizes the growing reliance on AI systems in organizational contexts and the necessity of aligning user requests with institutional policies to ensure safety, reliability, and compliance. By annotating user requests across diverse compliance scenarios, the authors provide a novel resource for benchmarking large language models (LLMs) on their ability to assess compliance. The work bridges academic research and industrial application, offering empirical insights into model performance under varying compliance assessment methods. The benchmark’s relevance to real-world technology sector applications enhances its applicability.
Key Points
- ▸ Introduction of the first annotated benchmark for policy compliance assessment
- ▸ Evaluation of LLM performance across compliance scenarios
- ▸ Application of findings to industrial technology sector contexts
Merits
Innovation
The creation of a novel, annotated benchmark fills a significant void in compliance-aware AI evaluation and provides a scalable framework for future research.
Empirical Contribution
The benchmark enables actionable comparisons between LLM models and compliance assessment strategies, offering empirical validation of model capabilities under real-world constraints.
Demerits
Scope Limitation
The current benchmark is limited to the technology sector and may not generalize to other domains, potentially restricting applicability beyond the studied context.
Evaluation Constraint
The reliance on annotated data may introduce subjectivity in compliance interpretation, raising questions about the scalability and objectivity of compliance metrics.
Expert Commentary
This work represents a pivotal step toward operationalizing compliance in AI systems. The authors rightly identify that compliance is not merely a legal or ethical concern but an operational necessity, particularly as AI systems scale and interact with diverse user bases. The benchmark’s design—annotated, diverse, and industry-aligned—demonstrates thoughtful alignment with practical constraints. While the sector-specific limitation is valid, the paper’s contribution lies in proving feasibility: demonstrating that policy compliance can be quantified, benchmarked, and evaluated using algorithmic methods. Moreover, the empirical analysis of LLM performance reveals nuanced differences in model interpretability and sensitivity to contextual ambiguity, which are critical for future work on explainability and accountability. The authors have effectively positioned their work at the intersection of AI safety, governance, and machine learning evaluation, offering a template for future frameworks in compliance-aware AI.
Recommendations
- ✓ Expand the benchmark to include additional domains beyond technology to broaden applicability.
- ✓ Develop standardized annotation protocols for compliance interpretation to reduce subjectivity and improve reproducibility.