Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
arXiv:2602.17316v1 Announce Type: new Abstract: The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. …
Bogdan Kosti\'c, Conor Fallon, Julian Risch, Alexander L\"oser
8 views