Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian
arXiv:2602.17475v1 Announce Type: new Abstract: Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternativ
arXiv:2602.17475v1 Announce Type: new Abstract: Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.
Executive Summary
This study investigates the effectiveness of small language models (around one billion parameters) in performing medical Natural Language Processing (NLP) tasks, specifically in Italian. The authors systematically compare various adaptation strategies, including few-shot prompting, constraint decoding, supervised fine-tuning, and continual pre-training. The results show that fine-tuning emerges as the most effective approach, while a combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Notably, the best configuration based on Qwen3-1.7B achieves an average score +9.2 points higher than Qwen3-32B. The study contributes to the field by releasing a comprehensive collection of Italian medical datasets and top-performing models, as well as a new dataset of 126M words from an Italian hospital and 175M words from various sources.
Key Points
- ▸ Small language models can effectively perform medical NLP tasks with competitive accuracy.
- ▸ Fine-tuning emerges as the most effective adaptation strategy.
- ▸ Combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives.
Merits
Methodological Rigor
The study employs a systematic and comprehensive approach, comparing multiple adaptation strategies across various medical NLP tasks.
Practical Implications
The findings have significant practical implications for the deployment of language models in real-world healthcare settings, where computational resources may be limited.
Contribution to the Field
The study contributes to the field by releasing a comprehensive collection of Italian medical datasets and top-performing models, as well as a new dataset of 126M words from an Italian hospital and 175M words from various sources.
Demerits
Limited Generalizability
The study focuses on Italian language and medical datasets, which may limit the generalizability of the findings to other languages and domains.
Dependence on Data Quality
The performance of the models may be sensitive to the quality of the training data, which can be a limitation in real-world settings where data may be noisy or incomplete.
Expert Commentary
This study makes a significant contribution to the field of medical NLP by demonstrating the effectiveness of small language models in performing NLP tasks. The findings have practical implications for the deployment of language models in real-world healthcare settings, where computational resources may be limited. The study's use of multiple adaptation strategies highlights the need for more transparent and explainable NLP models that can provide insights into their decision-making processes. Furthermore, the study's release of comprehensive datasets and top-performing models will facilitate further research in the field. However, the study's focus on Italian language and medical datasets may limit the generalizability of the findings to other languages and domains.
Recommendations
- ✓ Future studies should investigate the effectiveness of adaptation strategies across multiple languages and domains to improve the generalizability of the findings.
- ✓ Developers of NLP models should prioritize the development of more transparent and explainable models that can provide insights into their decision-making processes.