CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation
arXiv:2603.06183v1 Announce Type: new Abstract: We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline-based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute-level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non-urgent, non-actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity-aware weighting that prioritizes clinically conse
arXiv:2603.06183v1 Announce Type: new Abstract: We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline-based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute-level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non-urgent, non-actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity-aware weighting that prioritizes clinically consequential mistakes over benign discrepancies. CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84), and through two additional benchmarks that we introduce. In RadJudge, a targeted suite of clinically challenging pass-fail scenarios, CRIMSON shows consistent agreement with expert judgment. In RadPref, a larger radiologist preference benchmark of over 100 pairwise cases with structured error categorization, severity modeling, and 1-5 overall quality ratings from three cardiothoracic radiologists, CRIMSON achieves the strongest alignment with radiologist preferences. We release the metric, the evaluation benchmarks, RadJudge and RadPref, and a fine-tuned MedGemma model to enable reproducible evaluation of report generation, all available at https://github.com/rajpurkarlab/CRIMSON.
Executive Summary
The article introduces CRIMSON, a novel clinically grounded evaluation framework for assessing the quality of generative radiology reports. This framework incorporates full clinical context and a comprehensive taxonomy of errors, enabling severity-aware weighting to prioritize clinically consequential mistakes. CRIMSON is validated through strong alignment with clinically significant error counts annotated by board-certified radiologists and demonstrates consistent agreement with expert judgment in targeted benchmarks. The introduction of CRIMSON offers a significant advancement in the evaluation of report generation, enabling reproducible evaluation and providing a valuable tool for clinicians and researchers alike. The framework's focus on clinical context and severity-aware weighting represents a critical improvement over prior metrics, which may have been influenced by normal or clinically insignificant findings. This work has the potential to improve the quality and safety of radiology reports generated by artificial intelligence, ultimately benefiting patient care.
Key Points
- ▸ Introduction of CRIMSON, a clinically grounded evaluation framework for generative radiology report evaluation
- ▸ Comprehensive taxonomy of errors, including false findings, missing findings, and attribute-level errors
- ▸ Severity-aware weighting enables prioritization of clinically consequential mistakes
- ▸ Strong alignment with clinically significant error counts annotated by board-certified radiologists
- ▸ Consistent agreement with expert judgment in targeted benchmarks
Merits
Impressive Validation Results
CRIMSON demonstrates strong alignment with clinically significant error counts annotated by board-certified radiologists, as well as consistent agreement with expert judgment in targeted benchmarks, indicating its validity and reliability as an evaluation framework.
Comprehensive Taxonomy of Errors
The comprehensive taxonomy of errors in CRIMSON, including false findings, missing findings, and attribute-level errors, enables a nuanced understanding of report quality and facilitates targeted improvements.
Severity-Aware Weighting
The severity-aware weighting mechanism in CRIMSON prioritizes clinically consequential mistakes, ensuring that clinically significant errors are given greater weight in the evaluation process.
Demerits
Limited Generalizability
The validation results may not be generalizable to other domains or types of medical reports, highlighting the need for further research and evaluation in diverse clinical contexts.
Potential for Complexity
The comprehensive taxonomy of errors and severity-aware weighting mechanism may introduce complexity in the evaluation process, requiring significant expertise and resources to implement and interpret effectively.
Expert Commentary
The introduction of CRIMSON represents a significant advancement in the evaluation of report generation, offering a more comprehensive and clinically relevant framework for assessing the quality of radiology reports. The framework's focus on clinical context and severity-aware weighting addresses critical limitations of prior metrics, which may have been influenced by normal or clinically insignificant findings. While the validation results are impressive, further research and evaluation are needed to ensure the generalizability of CRIMSON to diverse clinical contexts and types of medical reports. Ultimately, the development of CRIMSON has the potential to improve the quality and safety of radiology reports generated by artificial intelligence, ultimately benefiting patient care.
Recommendations
- ✓ Further research and evaluation are needed to ensure the generalizability of CRIMSON to diverse clinical contexts and types of medical reports.
- ✓ Policymakers should establish clear guidelines and standards for the clinical validation and evaluation of AI-generated reports, ensuring that these reports are safe and effective for patient care.