RadTimeline: Timeline Summarization for Longitudinal Radiological Lung Findings
arXiv:2603.22820v1 Announce Type: new Abstract: Tracking findings in longitudinal radiology reports is crucial for accurately identifying disease progression, and the time-consuming process would benefit from automatic summarization. This work introduces a structured summarization task, where we frame longitudinal report summarization as a timeline generation task, with dated findings organized in columns and temporally related findings grouped in rows. This structured summarization format enables straightforward comparison of findings across time and facilitates fact-checking against the associated reports. The timeline is generated using a 3-step LLM process of extracting findings, generating group names, and using the names to group the findings. To evaluate such systems, we create RadTimeline, a timeline dataset focused on tracking lung-related radiologic findings in chest-related imaging reports. Experiments on RadTimeline show tradeoffs of different-sized LLMs and prompting stra
arXiv:2603.22820v1 Announce Type: new Abstract: Tracking findings in longitudinal radiology reports is crucial for accurately identifying disease progression, and the time-consuming process would benefit from automatic summarization. This work introduces a structured summarization task, where we frame longitudinal report summarization as a timeline generation task, with dated findings organized in columns and temporally related findings grouped in rows. This structured summarization format enables straightforward comparison of findings across time and facilitates fact-checking against the associated reports. The timeline is generated using a 3-step LLM process of extracting findings, generating group names, and using the names to group the findings. To evaluate such systems, we create RadTimeline, a timeline dataset focused on tracking lung-related radiologic findings in chest-related imaging reports. Experiments on RadTimeline show tradeoffs of different-sized LLMs and prompting strategies. Our results highlight that group name generation as an intermediate step is critical for effective finding grouping. The best configuration has some irrelevant findings but very good recall, and grouping performance is comparable to human annotators.
Executive Summary
The article introduces RadTimeline, a novel structured summarization framework for longitudinal radiological lung findings, transforming the complex task of tracking disease progression into a timeline-based format. By leveraging a 3-step LLM process—extraction, grouping via generated names, and organization—the system enhances comparability across temporal datasets and supports efficient fact-checking. Evaluated on a newly curated RadTimeline dataset, the study reveals nuanced tradeoffs between LLM size and prompting, with group name generation emerging as a pivotal intermediary step. While the best-performing configuration exhibits some irrelevant findings, its recall and grouping accuracy closely align with human annotators, indicating significant potential for clinical workflow optimization. The work addresses a critical gap in automated medical documentation summarization and offers actionable insights for improving AI-assisted radiology reporting.
Key Points
- ▸ Structured timeline format improves longitudinal finding comparability
- ▸ 3-step LLM process enhances automated summarization efficiency
- ▸ Group name generation is identified as a critical intermediary step
Merits
Strength in Novelty
The introduction of a structured timeline framework represents a significant innovation in longitudinal report summarization, offering a more intuitive and actionable format.
Demerits
Limitation in Precision
Although grouping performance is strong, the presence of irrelevant findings in the best configuration suggests room for refinement in filtering accuracy.
Expert Commentary
This work represents a meaningful advancement in the intersection of natural language processing and radiology. The structured timeline approach aligns well with clinical cognitive patterns, facilitating better information recall and verification. Moreover, the empirical validation against human annotators adds substantial credibility to the findings. However, the article could have further elaborated on the mechanisms for mitigating irrelevant findings—specifically, whether these stem from semantic ambiguity in the LLM’s extraction phase or from labeling inconsistencies in the dataset. Future iterations should explore hybrid models combining rule-based filtering with LLM-driven grouping, potentially improving specificity without sacrificing recall. Overall, RadTimeline demonstrates that thoughtful design of intermediary steps in automated summarization can yield clinically relevant outcomes, setting a new benchmark for AI-driven medical documentation assistance.
Recommendations
- ✓ Develop hybrid filtering mechanisms to enhance specificity in grouped findings
- ✓ Expand RadTimeline dataset to include other imaging modalities for broader applicability
Sources
Original: arXiv - cs.CL