CLiGNet: Clinical Label-Interaction Graph Network for Medical Specialty Classification from Clinical Transcriptions
arXiv:2603.22752v1 Announce Type: new Abstract: Automated classification of clinical transcriptions into medical specialties is essential for routing, coding, and clinical decision support, yet prior work on the widely used MTSamples benchmark suffers from severe data leakage caused by applying SMOTE oversampling before train test splitting. We first document this methodological flaw and establish a leakage free benchmark across 40 medical specialties (4966 records), revealing that the true task difficulty is substantially higher than previously reported. We then introduce CLiGNet (Clinical Label Interaction Graph Network), a neural architecture that combines a Bio ClinicalBERT text encoder with a two layer Graph Convolutional Network operating on a specialty label graph constructed from semantic similarity and ICD 10 chapter priors. Per label attention gates fuse document and label graph representations, trained with focal binary cross entropy loss to handle extreme class imbalance
arXiv:2603.22752v1 Announce Type: new Abstract: Automated classification of clinical transcriptions into medical specialties is essential for routing, coding, and clinical decision support, yet prior work on the widely used MTSamples benchmark suffers from severe data leakage caused by applying SMOTE oversampling before train test splitting. We first document this methodological flaw and establish a leakage free benchmark across 40 medical specialties (4966 records), revealing that the true task difficulty is substantially higher than previously reported. We then introduce CLiGNet (Clinical Label Interaction Graph Network), a neural architecture that combines a Bio ClinicalBERT text encoder with a two layer Graph Convolutional Network operating on a specialty label graph constructed from semantic similarity and ICD 10 chapter priors. Per label attention gates fuse document and label graph representations, trained with focal binary cross entropy loss to handle extreme class imbalance (181 to 1 ratio). Across seven baselines ranging from TF IDF classifiers to Clinical Longformer, CLiGNet without calibration achieves the highest macro F1 of 0.279, with an ablation study confirming that the GCN label graph provides the single largest component gain (increase of 0.066 macro F1). Adding per label Platt scaling calibration yields an expected calibration error of 0.007, demonstrating a principled trade off between ranking performance and probability reliability. We provide comprehensive failure analysis covering pairwise specialty confusions, rare class behaviour, document length effects, and token level Integrated Gradients attribution, offering actionable insights for clinical NLP system deployment.
Executive Summary
The CLiGNet paper addresses a critical gap in clinical NLP by identifying and correcting a persistent data leakage issue in prior MTSamples benchmark studies, thereby enabling a more accurate assessment of the difficulty in classifying clinical transcriptions into medical specialties. The proposed CLiGNet architecture effectively integrates Bio ClinicalBERT with a graph-based label interaction network informed by semantic similarity and ICD-10 priors, demonstrating superior performance over seven baselines without calibration (macro F1 0.279), and further improving with Platt scaling (EPE 0.007). The study’s rigorous benchmarking, ablation analysis, and failure diagnostics represent a significant methodological advance in clinical domain adaptation and classifier interpretability.
Key Points
- ▸ Identification and correction of data leakage in prior benchmarking
- ▸ Introduction of CLiGNet with Bio ClinicalBERT and GCN label graph architecture
- ▸ Achievement of highest macro F1 among baselines and improved calibration error via Platt scaling
Merits
Methodological Rigor
CLiGNet’s design demonstrates a systematic approach to identifying and mitigating systematic bias in clinical datasets, enhancing reproducibility and generalizability.
Performance Innovation
The integration of semantic-aware graph convolutional networks with clinical language models yields measurable gains in macro F1 without compromising calibration, offering a template for future domain-specific NLP systems.
Demerits
Computational Complexity
The inclusion of GCN layer and label graph construction may increase inference latency, potentially limiting scalability in real-time clinical deployment environments.
Expert Commentary
CLiGNet represents a pivotal evolution in clinical NLP evaluation methodology. The detection of SMOTE-induced leakage is not merely a technical correction—it is a paradigm shift in how we assess model performance in medical domain tasks. Too often, overfitting artifacts have been mistaken for true predictive ability; CLiGNet’s work dismantles this illusion by redefining the baseline difficulty. Moreover, the architectural coupling of contextual embeddings with semantic label graphs is a masterstroke: it translates abstract clinical concepts into actionable neural signals. The Platt scaling calibration, while statistically elegant, also carries profound ethical implications—it acknowledges the weight of diagnostic uncertainty and empowers practitioners with more reliable probability estimates. This paper does not merely improve a metric; it elevates the standard of care in automated clinical classification.
Recommendations
- ✓ Adopt CLiGNet’s benchmarking framework as a standard for future medical specialty classification studies.
- ✓ Integrate per-label calibration mechanisms into clinical NLP deployment pipelines as a default best practice.
Sources
Original: arXiv - cs.AI