When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
arXiv:2603.00314v1 Announce Type: new Abstract: This paper details the baseline model selection, fine-tuning process, evaluation methods, and the implications of deploying more accurate LLMs in …
Bian Sun, Zhenjian Wang, Orvill de la Torre, Zirui Wang
3 views