Quantifying construct validity in large language model evaluations
arXiv:2602.15532v1 Announce Type: new Abstract: The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have …
Ryan Othniel Kearns
9 views