Skip to main content
Academic

Policy Compliance of User Requests in Natural Language for AI Systems

arXiv:2603.00369v1 Announce Type: new Abstract: Consider an organization whose users send requests in natural language to an AI system that fulfills them by carrying out specific tasks. In this paper, we consider the problem of ensuring such user requests comply with a list of diverse policies determined by the organization with the purpose of guaranteeing the safe and reliable use of the AI system. We propose, to the best of our knowledge, the first benchmark consisting of annotated user requests of diverse compliance with respect to a list of policies. Our benchmark is related to industrial applications in the technology sector. We then use our benchmark to evaluate the performance of various LLM models on policy compliance assessment under different solution methods. We analyze the differences on performance metrics across the models and solution methods, showcasing the challenging nature of our problem.

P
Pedro Cisneros-Velarde
· · 1 min read · 0 views

arXiv:2603.00369v1 Announce Type: new Abstract: Consider an organization whose users send requests in natural language to an AI system that fulfills them by carrying out specific tasks. In this paper, we consider the problem of ensuring such user requests comply with a list of diverse policies determined by the organization with the purpose of guaranteeing the safe and reliable use of the AI system. We propose, to the best of our knowledge, the first benchmark consisting of annotated user requests of diverse compliance with respect to a list of policies. Our benchmark is related to industrial applications in the technology sector. We then use our benchmark to evaluate the performance of various LLM models on policy compliance assessment under different solution methods. We analyze the differences on performance metrics across the models and solution methods, showcasing the challenging nature of our problem.

Executive Summary

This paper addresses a critical gap in AI governance by introducing the first benchmark for evaluating policy compliance of user requests in natural language. The study recognizes the growing reliance on AI systems in organizational contexts and the necessity of aligning user requests with institutional policies to ensure safety, reliability, and compliance. By annotating user requests across diverse compliance scenarios, the authors provide a novel resource for benchmarking large language models (LLMs) on their ability to assess compliance. The work bridges academic research and industrial application, offering empirical insights into model performance under varying compliance assessment methods. The benchmark’s relevance to real-world technology sector applications enhances its applicability.

Key Points

  • Introduction of the first annotated benchmark for policy compliance assessment
  • Evaluation of LLM performance across compliance scenarios
  • Application of findings to industrial technology sector contexts

Merits

Innovation

The creation of a novel, annotated benchmark fills a significant void in compliance-aware AI evaluation and provides a scalable framework for future research.

Empirical Contribution

The benchmark enables actionable comparisons between LLM models and compliance assessment strategies, offering empirical validation of model capabilities under real-world constraints.

Demerits

Scope Limitation

The current benchmark is limited to the technology sector and may not generalize to other domains, potentially restricting applicability beyond the studied context.

Evaluation Constraint

The reliance on annotated data may introduce subjectivity in compliance interpretation, raising questions about the scalability and objectivity of compliance metrics.

Expert Commentary

This work represents a pivotal step toward operationalizing compliance in AI systems. The authors rightly identify that compliance is not merely a legal or ethical concern but an operational necessity, particularly as AI systems scale and interact with diverse user bases. The benchmark’s design—annotated, diverse, and industry-aligned—demonstrates thoughtful alignment with practical constraints. While the sector-specific limitation is valid, the paper’s contribution lies in proving feasibility: demonstrating that policy compliance can be quantified, benchmarked, and evaluated using algorithmic methods. Moreover, the empirical analysis of LLM performance reveals nuanced differences in model interpretability and sensitivity to contextual ambiguity, which are critical for future work on explainability and accountability. The authors have effectively positioned their work at the intersection of AI safety, governance, and machine learning evaluation, offering a template for future frameworks in compliance-aware AI.

Recommendations

  • Expand the benchmark to include additional domains beyond technology to broaden applicability.
  • Develop standardized annotation protocols for compliance interpretation to reduce subjectivity and improve reproducibility.

Sources