Brittlebench: Quantifying LLM robustness via prompt sensitivity
arXiv:2603.13285v1 Announce Type: new Abstract: Existing evaluation methods largely rely on clean, static benchmarks, which can overestimate true model performance by failing to capture the …
Angelika Romanou, Mark Ibrahim, Candace Ross, Chantal Shaib, Kerem Okta, Sam Bell, Elia Ovalle, Jesse Dodge, Antoine Bosselut, Koustuv Sinha, Adina Williams
10 views