Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages
arXiv:2602.21374v1 Announce Type: cross Abstract: Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP). This study evaluates a two-step pipeline combining Aya-expanse-8B as a Persian-to-English translation model with five open-source small language models (SLMs) -- Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Qwen2.5-1.5B-Instruct, and Gemma-3-1B-it -- for binary extraction of 13 clinical features from 1,221 anonymized Persian transcripts collected at a cancer palliative care call center. Using a few-shot prompting strategy without fine-tuning, models were assessed on macro-averaged F1-score, Matthews Correlation Coefficient (MCC), sensitivity, and specificity to account for class imbalance. Qwen2.5-7B-Instruct achieved the highest overall performance (median macro-F1: 0.899; MCC: 0.797), while Gemma-3-1B-it showed the weakest results. Larger models (7B--
arXiv:2602.21374v1 Announce Type: cross Abstract: Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP). This study evaluates a two-step pipeline combining Aya-expanse-8B as a Persian-to-English translation model with five open-source small language models (SLMs) -- Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Qwen2.5-1.5B-Instruct, and Gemma-3-1B-it -- for binary extraction of 13 clinical features from 1,221 anonymized Persian transcripts collected at a cancer palliative care call center. Using a few-shot prompting strategy without fine-tuning, models were assessed on macro-averaged F1-score, Matthews Correlation Coefficient (MCC), sensitivity, and specificity to account for class imbalance. Qwen2.5-7B-Instruct achieved the highest overall performance (median macro-F1: 0.899; MCC: 0.797), while Gemma-3-1B-it showed the weakest results. Larger models (7B--8B parameters) consistently outperformed smaller counterparts in sensitivity and MCC. A bilingual analysis of Aya-expanse-8B revealed that translating Persian transcripts to English improved sensitivity, reduced missing outputs, and boosted metrics robust to class imbalance, though at the cost of slightly lower specificity and precision. Feature-level results showed reliable extraction of physiological symptoms across most models, whereas psychological complaints, administrative requests, and complex somatic features remained challenging. These findings establish a practical, privacy-preserving blueprint for deploying open-source SLMs in multilingual clinical NLP settings with limited infrastructure and annotation resources, and highlight the importance of jointly optimizing model scale and input language strategy for sensitive healthcare applications.
Executive Summary
This study evaluates a two-step pipeline for extracting clinical information from medical transcripts in Persian, a low-resource language. The pipeline combines a Persian-to-English translation model with five small language models, achieving promising results for binary extraction of 13 clinical features. The best-performing model, Qwen2.5-7B-Instruct, achieved a median macro-F1 score of 0.899 and a Matthews Correlation Coefficient of 0.797. The study highlights the importance of jointly optimizing model scale and input language strategy for sensitive healthcare applications.
Key Points
- ▸ The study proposes a two-step pipeline for clinical information extraction in low-resource languages
- ▸ The pipeline combines a Persian-to-English translation model with small language models
- ▸ The best-performing model achieved a median macro-F1 score of 0.899 and a Matthews Correlation Coefficient of 0.797
Merits
Effective Use of Small Language Models
The study demonstrates the potential of small language models for clinical information extraction in low-resource languages, which can be particularly useful in settings with limited infrastructure and annotation resources.
Improvement in Sensitivity and Robustness
The use of a bilingual analysis with Aya-expanse-8B translation model improved sensitivity, reduced missing outputs, and boosted metrics robust to class imbalance.
Demerits
Class Imbalance and Limited Feature Extraction
The study acknowledges the challenge of class imbalance and the limited extraction of certain clinical features, such as psychological complaints and complex somatic features.
Trade-off between Sensitivity and Specificity
The use of the translation model improved sensitivity but slightly reduced specificity and precision, highlighting the need for further optimization.
Expert Commentary
This study makes a significant contribution to the field of healthcare NLP, demonstrating the potential of small language models for clinical information extraction in low-resource languages. The use of a two-step pipeline and the evaluation of multiple models provide valuable insights into the optimization of model performance. However, the study also highlights the challenges of class imbalance and the limited extraction of certain clinical features, which require further research and development. Overall, the study provides a comprehensive and well-reasoned approach to addressing the complexities of clinical information extraction in multilingual settings.
Recommendations
- ✓ Further research is needed to address class imbalance and improve the extraction of challenging clinical features, such as psychological complaints and complex somatic features.
- ✓ The development of more inclusive and equitable healthcare systems requires the investment in healthcare NLP research, particularly in low-resource languages, to address healthcare disparities and improve health outcomes.