Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment
arXiv:2603.00917v1 Announce Type: new Abstract: Small open-source language models are gaining attention for low-resource healthcare settings, but their reliability under different prompt phrasings remains poorly understood. We evaluated five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B domain-pretrained without instruction tuning) across three clinical QA datasets (MedQA, MedMCQA, PubMedQA) using five prompt styles (original, formal, simplified, roleplay, direct). We measured consistency scores, accuracy, and instruction-following failure rates. All inference ran locally on consumer CPU hardware without fine-tuning. Consistency and accuracy were largely independent. Gemma 2 achieved the highest consistency (0.845-0.888) but lowest accuracy (33.0-43.5%), while Llama 3.2 showed moderate consistency (0.774-0.807) with the highest accuracy (49.0-65.0%). Roleplay prompts consistently reduced accuracy across all models, with Phi-3 Mini droppin
arXiv:2603.00917v1 Announce Type: new Abstract: Small open-source language models are gaining attention for low-resource healthcare settings, but their reliability under different prompt phrasings remains poorly understood. We evaluated five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B domain-pretrained without instruction tuning) across three clinical QA datasets (MedQA, MedMCQA, PubMedQA) using five prompt styles (original, formal, simplified, roleplay, direct). We measured consistency scores, accuracy, and instruction-following failure rates. All inference ran locally on consumer CPU hardware without fine-tuning. Consistency and accuracy were largely independent. Gemma 2 achieved the highest consistency (0.845-0.888) but lowest accuracy (33.0-43.5%), while Llama 3.2 showed moderate consistency (0.774-0.807) with the highest accuracy (49.0-65.0%). Roleplay prompts consistently reduced accuracy across all models, with Phi-3 Mini dropping 21.5 percentage points on MedQA. Meditron-7B exhibited near-complete instruction-following failure on PubMedQA (99.0% UNKNOWN rate), showing domain pretraining alone is insufficient for structured clinical QA. High consistency does not imply correctness. Models can be reliably wrong, a dangerous failure mode in clinical AI. Roleplay prompts should be avoided in healthcare applications. Llama 3.2 showed the strongest balance of accuracy and reliability for low-resource deployment. Safe clinical AI requires joint evaluation of consistency, accuracy, and instruction adherence.
Executive Summary
This study assesses the performance of five open-source large language models on clinical question answering (QA) tasks, focusing on the impact of prompt phrasing on model reliability and accuracy. The research highlights the importance of considering both consistency and accuracy when evaluating AI models, as high consistency does not necessarily imply correctness. The findings suggest that roleplay prompts can significantly reduce accuracy and that domain pretraining alone is insufficient for structured clinical QA. The study concludes that a balanced model, such as Llama 3.2, is suitable for low-resource healthcare deployment, but emphasizes the need for joint evaluation of consistency, accuracy, and instruction adherence in clinical AI. The study's results have significant implications for the development and deployment of AI in low-resource healthcare settings.
Key Points
- ▸ Prompt phrasing significantly affects the performance of large language models on clinical QA tasks.
- ▸ High consistency does not necessarily imply correctness, highlighting the importance of accuracy evaluation.
- ▸ Roleplay prompts can reduce accuracy and should be avoided in healthcare applications.
- ▸ Domain pretraining alone is insufficient for structured clinical QA.
- ▸ Llama 3.2 shows a strong balance of accuracy and reliability for low-resource deployment.
Merits
Methodological rigor
The study employs a comprehensive evaluation framework, including multiple prompt styles, datasets, and models, to assess the impact of prompt phrasing on model performance.
Relevance to low-resource healthcare
The study's focus on open-source models and low-resource healthcare settings highlights the practical implications of the research for real-world applications.
Demerits
Limited model variety
The study only evaluates five models, which may not be representative of the broader range of available models, limiting the generalizability of the findings.
Limited exploration of prompt styles
While the study evaluates five prompt styles, it does not explore the impact of other prompt attributes, such as length or complexity, on model performance.
Expert Commentary
This study provides a timely and comprehensive evaluation of the performance of open-source large language models on clinical QA tasks. The findings highlight the importance of considering both consistency and accuracy when evaluating AI models, as high consistency does not necessarily imply correctness. The results also underscore the need for more transparent and explainable AI models in clinical applications, as well as stronger regulation and safety protocols in clinical AI development and deployment. The study's implications for the development and deployment of AI in low-resource healthcare settings are significant, and its methodological rigor and relevance to real-world applications make it a valuable contribution to the field.
Recommendations
- ✓ Developers of clinical AI models should prioritize the development of more transparent and explainable models that can provide clear insights into their decision-making processes.
- ✓ Regulatory bodies should establish standards for clinical AI safety and efficacy, including requirements for transparency, explainability, and evaluation protocols, to ensure safe and effective deployment.