Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks
arXiv:2603.06942v1 Announce Type: new Abstract: Recent advances have made long-form report-generating systems widely available. This has prompted evaluation frameworks that use LLM-as-judge protocols and claim …
Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D'Arcy, Daniel S. Weld, Lucy Lu Wang, Doug Downey, Sergey Feldman
19 views