Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation
arXiv:2603.03116v1 Announce Type: new Abstract: Large Language Model (LLM)-based agents are increasingly adopted in high-stakes settings, but current benchmarks evaluate mainly whether a task was …
Hongliu Cao, Ilias Driouich, Eoin Thomas
13 views