VERT: Reliable LLM Judges for Radiology Report Evaluation
arXiv:2604.03376v1 Announce Type: new Abstract: Current literature on radiology report evaluation has focused primarily on designing LLM-based metrics and fine-tuning small models for chest X-rays. However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies. Which model and prompt configurations are best suited to serve as LLM judges for radiology evaluation? We conduct a thorough correlation analysis between expert and LLM-based ratings. We compare three existing LLM-as-a-judge metrics (RadFact, GREEN, and FineRadScore) alongside VERT, our proposed LLM-based metric, using open- and closed-source models (reasoning and non-reasoning) of different sizes across two expert-annotated datasets, RadEval and RaTE-Eval, spanning multiple modalities and anatomies. We further evaluate few-shot approaches, ensembling, and parameter-efficient fine-tuning using RaTE-Eval. To better understand metric behavior, we perform a systematic er
arXiv:2604.03376v1 Announce Type: new Abstract: Current literature on radiology report evaluation has focused primarily on designing LLM-based metrics and fine-tuning small models for chest X-rays. However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies. Which model and prompt configurations are best suited to serve as LLM judges for radiology evaluation? We conduct a thorough correlation analysis between expert and LLM-based ratings. We compare three existing LLM-as-a-judge metrics (RadFact, GREEN, and FineRadScore) alongside VERT, our proposed LLM-based metric, using open- and closed-source models (reasoning and non-reasoning) of different sizes across two expert-annotated datasets, RadEval and RaTE-Eval, spanning multiple modalities and anatomies. We further evaluate few-shot approaches, ensembling, and parameter-efficient fine-tuning using RaTE-Eval. To better understand metric behavior, we perform a systematic error detection and categorization study to assess alignment of these metrics against expert judgments and identify areas of lower and higher agreement. Our results show that VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN. Furthermore, fine-tuning Qwen3 30B yield gains of up to 25% using only 1,300 training samples. The fine-tuned model also reduces inference time up to 37.2 times. These findings highlight the effectiveness of LLM-based judges and demonstrate that reliable evaluation can be achieved with lightweight adaptation.
Executive Summary
This paper presents VERT, a novel LLM-based metric for evaluating radiology reports across diverse imaging modalities and anatomies, addressing a critical gap in current literature focused predominantly on chest X-rays. Through comprehensive correlation analysis with expert radiologist ratings, the study benchmarks VERT against established LLM-as-a-judge metrics (RadFact, GREEN, FineRadScore) using open- and closed-source models of varying sizes. Key findings demonstrate VERT's superior performance, improving correlation by up to 11.7% over GREEN, while few-shot fine-tuning of Qwen3 30B achieves gains of 25% with minimal training data (1,300 samples) and significant inference speed improvements (37.2x). The research underscores the robustness and efficiency of LLM-based evaluation metrics in radiology, offering a scalable solution for clinical and research applications.
Key Points
- ▸ VERT outperforms existing LLM-based radiology report evaluation metrics in correlation with expert judgments, particularly across diverse modalities and anatomies beyond chest X-rays.
- ▸ Few-shot fine-tuning of lightweight models (e.g., Qwen3 30B) yields substantial performance gains (25%) with minimal training data, reducing inference costs and latency by up to 37.2x.
- ▸ Systematic error analysis identifies specific areas where LLM judges align or diverge with expert evaluations, providing actionable insights for metric improvement and deployment.
Merits
Methodological Rigor
The study employs a robust, multi-faceted evaluation framework, including correlation analysis with expert ratings, few-shot learning, ensembling, and parameter-efficient fine-tuning, across two diverse datasets (RadEval and RaTE-Eval). This comprehensive approach ensures the findings are generalizable and not modality-specific.
Novelty and Practical Impact
VERT introduces a scalable, high-performance LLM-based metric for radiology report evaluation that is not only more accurate than existing models but also computationally efficient. The demonstrated gains in inference speed and minimal data requirements make it particularly suitable for clinical workflows.
Insightful Error Analysis
The systematic error detection and categorization study provides a nuanced understanding of where LLM judges succeed or fail relative to expert judgments. This granular analysis is invaluable for refining metrics and guiding future research in LLM alignment.
Demerits
Dataset Limitations
The study relies on two expert-annotated datasets (RadEval and RaTE-Eval), which, while diverse, may not fully capture the variability in real-world radiology reports across all imaging modalities and clinical settings. Further validation on broader, multi-institutional datasets is warranted.
Model Bias and Generalization
The performance of LLM-based judges, including VERT, is contingent on the training data and model architecture. Potential biases in the training corpora (e.g., overrepresentation of certain anatomies or pathologies) could limit generalization, and this risk is not fully explored.
Interpretability Challenges
While the error analysis is thorough, the opacity of LLM decision-making processes remains a challenge. The lack of interpretability in how VERT and other metrics arrive at their judgments may hinder trust and adoption in high-stakes clinical environments.
Expert Commentary
This paper represents a significant advancement in the evaluation of radiology reports using LLM-based metrics, addressing a critical limitation in the field by demonstrating robustness across diverse imaging modalities and anatomies. The introduction of VERT and the demonstrated gains in performance and efficiency are particularly noteworthy, as they offer a practical solution to the scalability challenges inherent in radiology AI. The study's methodological rigor, including the comprehensive correlation analysis and systematic error detection, sets a new benchmark for evaluating LLM judges in medical imaging. However, the reliance on specific datasets and the potential for model bias warrant caution. Future work should focus on broader validation, interpretability enhancements, and the development of standardized evaluation protocols to ensure clinical reliability. The implications for clinical deployment are profound, with potential to streamline radiology workflows and improve patient care through consistent, automated quality assessment.
Recommendations
- ✓ Conduct further validation of VERT and similar metrics on larger, multi-institutional datasets encompassing a wider range of imaging modalities, pathologies, and clinical contexts to ensure generalizability and robustness in real-world settings.
- ✓ Develop interpretability tools and frameworks to enhance the transparency of LLM-based evaluation metrics like VERT, enabling clinicians and regulators to better understand and trust their outputs in high-stakes decision-making.
- ✓ Establish cross-disciplinary collaborations between radiologists, AI researchers, and policymakers to develop standardized validation protocols and governance frameworks for the deployment of AI-based radiology evaluation metrics in clinical practice.
- ✓ Explore the integration of VERT with existing radiology information systems (RIS) and picture archiving and communication systems (PACS) to facilitate seamless adoption and workflow integration in clinical environments.
- ✓ Investigate the potential of multimodal LLMs that combine textual radiology reports with imaging data to further enhance the accuracy and diagnostic relevance of evaluation metrics.
Sources
Original: arXiv - cs.AI