This Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA
arXiv:2604.05051v1 Announce Type: new Abstract: Patients are increasingly turning to large language models (LLMs) with medical questions that are complex and difficult to articulate clearly. However, LLMs are sensitive to prompt phrasings and can be influenced by the way questions are worded. Ideally, LLMs should respond consistently regardless of phrasing, particularly when grounded in the same underlying evidence. We investigate this through a systematic evaluation in a controlled retrieval-augmented generation (RAG) setting for medical question answering (QA), where expert-selected documents are used rather than retrieved automatically. We examine two dimensions of patient query variation: question framing (positive vs. negative) and language style (technical vs. plain language). We construct a dataset of 6,614 query pairs grounded in clinical trial abstracts and evaluate response consistency across eight LLMs. Our findings show that positively- and negatively-framed pairs are sign
arXiv:2604.05051v1 Announce Type: new Abstract: Patients are increasingly turning to large language models (LLMs) with medical questions that are complex and difficult to articulate clearly. However, LLMs are sensitive to prompt phrasings and can be influenced by the way questions are worded. Ideally, LLMs should respond consistently regardless of phrasing, particularly when grounded in the same underlying evidence. We investigate this through a systematic evaluation in a controlled retrieval-augmented generation (RAG) setting for medical question answering (QA), where expert-selected documents are used rather than retrieved automatically. We examine two dimensions of patient query variation: question framing (positive vs. negative) and language style (technical vs. plain language). We construct a dataset of 6,614 query pairs grounded in clinical trial abstracts and evaluate response consistency across eight LLMs. Our findings show that positively- and negatively-framed pairs are significantly more likely to produce contradictory conclusions than same-framing pairs. This framing effect is further amplified in multi-turn conversations, where sustained persuasion increases inconsistency. We find no significant interaction between framing and language style. Our results demonstrate that LLM responses in medical QA can be systematically influenced through query phrasing alone, even when grounded in the same evidence, highlighting the importance of phrasing robustness as an evaluation criterion for RAG-based systems in high-stakes settings.
Executive Summary
This study evaluates the sensitivity of large language models (LLMs) to patient question framing in medical question answering (QA). Researchers constructed a dataset of 6,614 query pairs grounded in clinical trial abstracts and examined two dimensions of query variation: question framing (positive vs. negative) and language style (technical vs. plain language). The study found that positively- and negatively-framed pairs are more likely to produce contradictory conclusions than same-framing pairs, and that sustained persuasion in multi-turn conversations increases inconsistency. The results highlight the importance of phrasing robustness as an evaluation criterion for RAG-based systems in high-stakes settings.
Key Points
- ▸ LLMs are sensitive to patient question framing in medical QA.
- ▸ Positively- and negatively-framed pairs produce contradictory conclusions more often than same-framing pairs.
- ▸ Sustained persuasion in multi-turn conversations increases inconsistency.
- ▸ Phrasing robustness is crucial for RAG-based systems in high-stakes settings.
Merits
Strength in Methodology
The study utilizes a systematic evaluation in a controlled retrieval-augmented generation (RAG) setting, ensuring robustness and validity of the findings.
Insight into Patient Query Variation
The researchers examine two dimensions of patient query variation: question framing and language style, providing valuable insights into the complexities of medical QA.
Demerits
Limited Generalizability
The study focuses on a specific dataset and RAG setting, which may limit the generalizability of the findings to other domains or systems.
Need for Further Investigation
The study highlights the importance of phrasing robustness but does not provide an in-depth analysis of the underlying mechanisms driving this phenomenon.
Expert Commentary
The study's findings have significant implications for the development and deployment of RAG-based systems in medical QA. The sensitivity of LLMs to patient question framing highlights the need for careful attention to phrasing and the potential for sustained persuasion to increase inconsistency. To address these challenges, researchers and developers should prioritize the development of phrasing robustness as a key evaluation criterion for RAG-based systems. Additionally, regulatory bodies should consider the potential impact of question framing on AI decision-making in medical QA and develop guidelines to ensure phrasing robustness in high-stakes applications.
Recommendations
- ✓ Develop and deploy phrasing robustness testing as a standard evaluation criterion for RAG-based systems in medical QA.
- ✓ Conduct further research into the underlying mechanisms driving the impact of question framing on LLM responses, including the role of bias in AI decision-making and human-AI collaboration.
Sources
Original: arXiv - cs.CL