Academic

A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients

Joakim Edin, Sedrah Butt Balaganeshan, Annike Kj{\o}lby Kristensen, Lars Maal{\o}e, Ioannis Louloudis, S{\o}ren Brunak · March 4, 2026 · 1 min read · 43 views

#cs.LG

arXiv:2603.00221v1 Announce Type: new Abstract: Medical coding translates clinical documentation into standardized codes for billing, research, and public health, but manual coding is time-consuming and error-prone. Existing automation efforts rely on small datasets that poorly represent real-world patient heterogeneity. We trained a language model on 5.8 million electronic health records from 1.8 million patients across nearly all specialties in Eastern Denmark (2006--2016) to predict ICD-10 codes from clinical notes, medications, and laboratory results. Evaluated on 270,000 held-out patients, the model achieved a micro F1 of 71.8% and a top-10 recall of 95.5%. Performance varied by specialty (F1: 53--91%), with higher scores in specialties with well-defined diagnostic criteria. Codes appearing predominantly as secondary diagnoses had markedly lower F1 scores. For three such codes (suicide-related behaviors, weight disorders, and hypertension), the model identified thousands of uncoded cases, of which 76-86% were confirmed valid upon manual review, suggesting systematic under-coding rather than model error. These findings suggest under-coding of secondary diagnoses in Eastern Denmark during this period, with potential implications for epidemiological research, public health surveillance, and understanding of multimorbidity. Similar time constraints and reimbursement structures in other healthcare systems suggest this may not be isolated to this dataset. The model can automate coding for approximately 50% of cases and provide accurate suggestions for most others, and may offer a practical solution to help capture missed secondary conditions.

Executive Summary

This article presents a groundbreaking artificial intelligence (AI) model trained on a massive dataset of 5.8 million electronic health records from 1.8 million patients in Eastern Denmark. The model successfully predicts ICD-10 codes from clinical notes, medications, and laboratory results with a high degree of accuracy, particularly in specialties with well-defined diagnostic criteria. Notably, the model identifies thousands of uncoded cases for certain secondary diagnoses, such as suicide-related behaviors and hypertension, suggesting systematic under-coding. This research has significant implications for epidemiological research, public health surveillance, and understanding of multimorbidity. The model's potential to automate coding for approximately 50% of cases and provide accurate suggestions for most others offers a practical solution to address the limitations of manual coding. However, further validation and refinement of the model are necessary to ensure widespread adoption and implementation.

Key Points

▸ The AI model achieves high accuracy in predicting ICD-10 codes from clinical data.
▸ The model identifies thousands of uncoded cases for certain secondary diagnoses, suggesting under-coding.
▸ The model has potential to automate coding for approximately 50% of cases and provide accurate suggestions for most others.

Merits

Strength in Data Size and Diversity

The use of a massive dataset from a population-wide cohort provides a unique opportunity to develop and train an AI model that can generalize to a wide range of clinical scenarios.

High Accuracy in Predicting ICD-10 Codes

The model achieves a high degree of accuracy in predicting ICD-10 codes, particularly in specialties with well-defined diagnostic criteria, making it a valuable tool for clinical documentation and billing purposes.

Potential to Address Under-Coding

The model's ability to identify uncoded cases for certain secondary diagnoses highlights the potential to address systematic under-coding, which has significant implications for epidemiological research, public health surveillance, and understanding of multimorbidity.

Demerits

Limitation in Generalizability

While the model is trained on a large and diverse dataset, its generalizability to other healthcare systems and populations remains uncertain, and further validation and refinement are necessary to ensure widespread adoption and implementation.

Dependence on Clinical Data Quality

The accuracy of the model is highly dependent on the quality and consistency of clinical data, which may vary significantly across different healthcare systems and providers.

Potential for Biases in Training Data

The model may inherit biases present in the training data, which could impact its performance and accuracy in real-world clinical settings.

Expert Commentary

This article presents a significant advancement in the development of AI models for predicting ICD-10 codes from clinical data. The use of a massive dataset from a population-wide cohort provides a unique opportunity to develop and train an AI model that can generalize to a wide range of clinical scenarios. However, further validation and refinement of the model are necessary to ensure widespread adoption and implementation. Additionally, the model's potential to address under-coding and secondary diagnoses has significant implications for epidemiological research, public health surveillance, and understanding of multimorbidity. The article highlights the importance of developing AI models that can accurately predict ICD-10 codes and identify uncoded cases, which can improve clinical documentation, billing, and reimbursement purposes, while also informing policy changes to address under-coding and secondary diagnoses.

Recommendations

✓ Further validation and refinement of the model are necessary to ensure widespread adoption and implementation.
✓ The model should be tested in diverse healthcare settings and populations to assess its generalizability and performance.
✓ The findings should inform policy changes to address under-coding and secondary diagnoses, and improve epidemiological research, public health surveillance, and understanding of multimorbidity.

Sources

arXiv - cs.LG

A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients

AI Commentary

Executive Summary

Key Points

Merits

Strength in Data Size and Diversity

High Accuracy in Predicting ICD-10 Codes

Potential to Address Under-Coding

Demerits

Limitation in Generalizability

Dependence on Clinical Data Quality

Potential for Biases in Training Data

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs