Academic

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

arXiv:2602.17316v1 Announce Type: new Abstract: The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects,

Bogdan Kosti\'c, Conor Fallon, Julian Risch, Alexander L\"oser · February 21, 2026 · 1 min read · 7 views

#cs.CL #cs.AI

Executive Summary

This article sheds light on the limitations of current Large Language Model (LLM) evaluation methods by examining the impact of controlled lexical and syntactic perturbations on model performance. The authors employ two linguistically principled pipelines to generate meaning-preserving variations, demonstrating that lexical perturbations consistently induce performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects. This suggests that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence. The findings underscore the need for robustness testing as a standard component of LLM evaluation, challenging the current reliance on standardized evaluation benchmarks. This research has significant implications for the development and assessment of AI models, emphasizing the importance of evaluating their ability to generalize beyond surface-level patterns.

Key Points

▸ Lexical perturbations consistently induce performance degradation across nearly all models and tasks.
▸ Syntactic perturbations have more heterogeneous effects, occasionally improving results.
▸ Model robustness does not consistently scale with model size, revealing strong task dependence.

Merits

Methodological rigor

The authors employ two linguistically principled pipelines to generate meaning-preserving variations, ensuring the reliability of their findings.

Relevance to current LLM evaluation methods

The study highlights the limitations of current LLM evaluation methods, challenging the current reliance on standardized evaluation benchmarks.

Demerits

Limited scope

The study focuses on a specific set of LLMs and tasks, which may not be representative of the broader LLM landscape.

Lack of contextual understanding

The authors' emphasis on surface-level lexical patterns may overlook the potential importance of contextual understanding in LLM evaluation.

Expert Commentary

This study represents a significant contribution to the field of LLM evaluation, highlighting the need for a more nuanced understanding of AI models' language abilities. By demonstrating the limitations of current evaluation methods, the authors provide a compelling case for the importance of robustness testing and the need to move beyond surface-level lexical patterns. As the field continues to evolve, it is essential to consider the implications of this research for LLM development, evaluation, and deployment. The findings also raise important questions about the nature of linguistic competence in AI models, underscoring the need for further research in this area.

Recommendations

✓ Future research should focus on developing more comprehensive evaluation methods that account for both surface-level and abstract linguistic competence.
✓ LLM developers and policymakers should prioritize robustness testing and contextual understanding in AI model evaluation.

Sources

arXiv - cs.CL

Something extraordinary is coming.

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

AI Commentary

Executive Summary

Key Points

Merits

Methodological rigor

Relevance to current LLM evaluation methods

Demerits

Limited scope

Lack of contextual understanding

Expert Commentary

Recommendations

Sources

Related Articles

Humans and LLMs Diverge on Probabilistic Inferences

France or Spain or Germany or France: A Neural Account …

Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of …

JCG, PC

HSOLLC Co., Ltd.