Position: Science of AI Evaluation Requires Item-level Benchmark Data
arXiv:2604.03244v1 Announce Type: new Abstract: AI evaluations have become the primary evidence for deploying generative AI systems across high-stakes domains. However, current evaluation paradigms often …