Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge
arXiv:2603.04417v1 Announce Type: new Abstract: Large language models are increasingly used as automated evaluators in research and enterprise settings, a practice known as LLM-as-a-judge. While …
Fiona Lau
3 views