Academic

Training a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data

John Cook, Michael Wyatt, Peng Wei, Iris Chin, Santosh Gupta, Van Zyl Van Vuuren, Richie Siburian, Amanda Spicer, Kristen Viviano, Alda Cami, Raunaq Malhotra, Zhewei Yao, Jeff Rasley, Gaurav Kaushik · March 26, 2026 · 1 min read · 36 views

#cs.CL #cs.AI

arXiv:2603.23515v1 Announce Type: new Abstract: Improving the accuracy and reliability of medical coding reduces clinician burnout and supports revenue cycle processes, freeing providers to focus more on patient care. However, automating the assignment of ICD-10-CM and CPT codes from clinical documentation remains a challenge due to heterogeneous records, nuanced coding guidelines, and long-tail distributions. Large language models have been proposed to help or automate specific medical coding tasks. However, foundation models are not explicitly trained for medical coding and zero-shot coding has yielded poor results. We investigate whether a modern open-weight foundation model can be adapted for an expert-level medical coding task using privacy-preserving synthetic training data derived from electronic health records. We fine-tune Llama 3-70B on pairs of clinical notes and gold codes generated from EHR-grounded templates and coding policies, then evaluate exact-code prediction for ICD-10-CM and CPT. A zero-shot baseline with the unadapted model achieved an F1 score of 0.18 for exact code match. After fine-tuning on the synthetic corpus, exact-match F1 exceeded 0.70, representing a large absolute gain across both code systems. Notably, performance remained high on complex categories that often require multi-step clinical reasoning and code composition, including Advanced Illness and Frailty classes, and the model retained its performance on medical comprehension tasks. These results indicate that synthetic, policy-aware data can efficiently teach a general-purpose large language model to support precise medical coding without exposing protected health information. The approach offers a practical path for training coding agents safely and iteratively on specific tasks that represent real-world populations.

Executive Summary

This article presents a novel approach to training a large language model (LLM) for medical coding using privacy-preserving synthetic clinical data. The authors fine-tune an open-weight foundation model on pairs of clinical notes and gold codes generated from electronic health records (EHRs), achieving state-of-the-art performance on exact-code prediction for ICD-10-CM and CPT codes. The results indicate that synthetic, policy-aware data can efficiently teach a general-purpose LLM to support precise medical coding without exposing protected health information. This approach offers a practical path for training coding agents safely and iteratively on specific tasks that represent real-world populations. The authors' findings have significant implications for improving the accuracy and reliability of medical coding, reducing clinician burnout, and supporting revenue cycle processes.

Key Points

▸ The authors propose a novel approach to training a large language model for medical coding using privacy-preserving synthetic clinical data.
▸ The approach achieves state-of-the-art performance on exact-code prediction for ICD-10-CM and CPT codes.
▸ Synthetic, policy-aware data can efficiently teach a general-purpose LLM to support precise medical coding without exposing protected health information.

Merits

Strength in Adaptability

The authors demonstrate the ability of their approach to adapt to different medical coding tasks, including complex categories that require multi-step clinical reasoning and code composition.

Strength in Scalability

The use of synthetic, policy-aware data enables the efficient training of a general-purpose LLM, reducing the need for large amounts of real-world data and associated costs.

Strength in Protecting Patient Data

The authors' approach ensures that protected health information is not exposed during the training process, mitigating concerns around data privacy and security.

Demerits

Limitation in Generalizability

The authors' approach is specifically tailored for medical coding tasks and may not generalize to other applications or domains.

Limitation in Interpretability

The authors do not provide a detailed analysis of the LLM's decision-making process, making it challenging to understand the underlying reasoning and potential biases.

Expert Commentary

The authors' approach represents a significant advancement in the field of medical coding, leveraging the power of large language models to improve the accuracy and reliability of coding processes. However, further research is needed to address the limitations of their approach, particularly in terms of generalizability and interpretability. Additionally, the use of synthetic, policy-aware data raises important questions about data protection and security in healthcare AI applications, which will require careful consideration and regulation.

Recommendations

✓ Future research should focus on developing more explainable and transparent AI models for medical coding, particularly in complex decision-making scenarios.
✓ Healthcare organizations should prioritize the use of synthetic, policy-aware data in AI-driven medical coding applications to mitigate concerns around data protection and security.

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

Training a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data

AI Commentary

Executive Summary

Key Points

Merits

Strength in Adaptability

Strength in Scalability

Strength in Protecting Patient Data

Demerits

Limitation in Generalizability

Limitation in Interpretability

Expert Commentary

Recommendations

Sources

Related Articles

AI-Driven Approaches to Enhancing Fairness and Identifying Algorithmic Bias in …

High resolution schemes for hyperbolic conservation laws

Robust Graph Representation Learning via Adaptive Spectral Contrast

Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via …

JCG, PC

HSOLLC Co., Ltd.