When LLM Judge Scores Look Good but Best-of-N Decisions Fail
arXiv:2603.12520v1 Announce Type: cross Abstract: Large language models are often used as judges to score candidate responses, then validated with a single global metric such …
Eddie Landesberg
16 views