Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
arXiv:2602.17316v1 Announce Type: new Abstract: The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects,
arXiv:2602.17316v1 Announce Type: new Abstract: The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence. Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.
Executive Summary
This article sheds light on the limitations of current Large Language Model (LLM) evaluation methods by examining the impact of controlled lexical and syntactic perturbations on model performance. The authors employ two linguistically principled pipelines to generate meaning-preserving variations, demonstrating that lexical perturbations consistently induce performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects. This suggests that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence. The findings underscore the need for robustness testing as a standard component of LLM evaluation, challenging the current reliance on standardized evaluation benchmarks. This research has significant implications for the development and assessment of AI models, emphasizing the importance of evaluating their ability to generalize beyond surface-level patterns.
Key Points
- ▸ Lexical perturbations consistently induce performance degradation across nearly all models and tasks.
- ▸ Syntactic perturbations have more heterogeneous effects, occasionally improving results.
- ▸ Model robustness does not consistently scale with model size, revealing strong task dependence.
Merits
Methodological rigor
The authors employ two linguistically principled pipelines to generate meaning-preserving variations, ensuring the reliability of their findings.
Relevance to current LLM evaluation methods
The study highlights the limitations of current LLM evaluation methods, challenging the current reliance on standardized evaluation benchmarks.
Demerits
Limited scope
The study focuses on a specific set of LLMs and tasks, which may not be representative of the broader LLM landscape.
Lack of contextual understanding
The authors' emphasis on surface-level lexical patterns may overlook the potential importance of contextual understanding in LLM evaluation.
Expert Commentary
This study represents a significant contribution to the field of LLM evaluation, highlighting the need for a more nuanced understanding of AI models' language abilities. By demonstrating the limitations of current evaluation methods, the authors provide a compelling case for the importance of robustness testing and the need to move beyond surface-level lexical patterns. As the field continues to evolve, it is essential to consider the implications of this research for LLM development, evaluation, and deployment. The findings also raise important questions about the nature of linguistic competence in AI models, underscoring the need for further research in this area.
Recommendations
- ✓ Future research should focus on developing more comprehensive evaluation methods that account for both surface-level and abstract linguistic competence.
- ✓ LLM developers and policymakers should prioritize robustness testing and contextual understanding in AI model evaluation.