Semantic Invariance in Agentic AI
arXiv:2603.13173v1 Announce Type: new Abstract: Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic invariance.Standard benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four distinct architectural families: Hermes (70B, 405B), Q
arXiv:2603.13173v1 Announce Type: new Abstract: Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic invariance.Standard benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four distinct architectural families: Hermes (70B, 405B), Qwen3 (30B-A3B, 235B-A22B), DeepSeek-R1, and gpt-oss (20B, 120B). Our evaluation encompasses 19 multi-step reasoning problems across eight scientific domains. The results reveal that model scale does not predict robustness: the smaller Qwen3-30B-A3B achieves the highest stability (79.6% invariant responses, semantic similarity 0.91), while larger models exhibit greater fragility.
Executive Summary
The article discusses the concept of semantic invariance in Agentic AI, which refers to the ability of Large Language Models (LLMs) to maintain stable reasoning under semantically equivalent input variations. The authors propose a metamorphic testing framework to assess the robustness of LLMs and evaluate eight semantic-preserving transformations across seven foundation models. The results show that model scale does not predict robustness, with smaller models achieving higher stability. The study highlights the importance of semantic invariance in consequential applications and provides insights for improving LLM reliability.
Key Points
- ▸ Semantic invariance is crucial for reliable LLM performance in consequential applications
- ▸ Standard benchmark evaluations fail to capture semantic invariance
- ▸ Model scale does not predict robustness, with smaller models achieving higher stability
Merits
Comprehensive Evaluation Framework
The proposed metamorphic testing framework provides a systematic approach to assessing LLM robustness
Demerits
Limited Model Scope
The evaluation is limited to seven foundation models, which may not be representative of the broader LLM landscape
Expert Commentary
The article contributes significantly to the ongoing discussion on AI reliability and safety. The findings on model scale and robustness are particularly noteworthy, as they challenge the common assumption that larger models are inherently more reliable. The proposed evaluation framework provides a valuable tool for assessing LLM robustness and highlights the need for further research on semantic invariance. As AI continues to permeate critical domains, ensuring the reliability and stability of LLMs is essential for maintaining public trust and preventing potential harms.
Recommendations
- ✓ Developers should incorporate semantic invariance testing into their LLM evaluation pipelines
- ✓ Further research should investigate the relationship between model architecture and semantic invariance