Robust AI Evaluation through Maximal Lotteries
arXiv:2602.21297v1 Announce Type: new Abstract: The standard way to evaluate language models on subjective tasks is through pairwise comparisons: an annotator chooses the "better" of …
Hadi Khalaf, Serena L. Wang, Daniel Halpern, Itai Shapira, Flavio du Pin Calmon, Ariel D. Procaccia
3 views