Academic

Spelling Correction in Healthcare Query-Answer Systems: Methods, Retrieval Impact, and Empirical Evaluation

arXiv:2603.19249v1 Announce Type: new Abstract: Healthcare question-answering (QA) systems face a persistent challenge: users submit queries with spelling errors at rates substantially higher than those found in the professional documents they search. This paper presents the first controlled study of spelling correction as a retrieval preprocessing step in healthcare QA using real consumer queries. We conduct an error census across two public datasets -- the TREC 2017 LiveQA Medical track (104 consumer health questions) and HealthSearchQA (4,436 health queries from Google autocomplete) -- finding that 61.5% of real medical queries contain at least one spelling error, with a token-level error rate of 11.0%. We evaluate four correction methods -- conservative edit distance, standard edit distance (Levenshtein), context-aware candidate ranking, and SymSpell -- across three experimental conditions: uncorrected queries against an uncorrected corpus (baseline), uncorrected queries against a

S
Saurabh K Singh
· · 1 min read · 7 views

arXiv:2603.19249v1 Announce Type: new Abstract: Healthcare question-answering (QA) systems face a persistent challenge: users submit queries with spelling errors at rates substantially higher than those found in the professional documents they search. This paper presents the first controlled study of spelling correction as a retrieval preprocessing step in healthcare QA using real consumer queries. We conduct an error census across two public datasets -- the TREC 2017 LiveQA Medical track (104 consumer health questions) and HealthSearchQA (4,436 health queries from Google autocomplete) -- finding that 61.5% of real medical queries contain at least one spelling error, with a token-level error rate of 11.0%. We evaluate four correction methods -- conservative edit distance, standard edit distance (Levenshtein), context-aware candidate ranking, and SymSpell -- across three experimental conditions: uncorrected queries against an uncorrected corpus (baseline), uncorrected queries against a corrected corpus, and fully corrected queries against a corrected corpus. Using BM25 and TF-IDF cosine retrieval over 1,935 MedQuAD answer passages with TREC relevance judgments, we find that query correction substantially improves retrieval -- edit distance and context-aware correction achieve MRR improvements of +9.2% and NDCG@10 improvements of +8.3% over the uncorrected baseline. Critically, correcting only the corpus without correcting queries yields minimal improvement (+0.5% MRR), confirming that query-side correction is the key intervention. We complement these results with a 100-sample error analysis categorising correction outcomes per method and provide evidence-based recommendations for practitioners.

Executive Summary

This article presents a comprehensive study on the impact of spelling correction on healthcare query-answer systems. The authors conducted a controlled study using real consumer queries and found that 61.5% of medical queries contain at least one spelling error. They evaluated four correction methods and found that query correction substantially improves retrieval, with edit distance and context-aware correction achieving MRR improvements of +9.2% and NDCG@10 improvements of +8.3% over the uncorrected baseline. The study provides evidence-based recommendations for practitioners and highlights the importance of query-side correction in improving retrieval performance. The findings have significant implications for the development and improvement of healthcare query-answer systems.

Key Points

  • 61.5% of real medical queries contain at least one spelling error
  • Query correction substantially improves retrieval performance
  • Edit distance and context-aware correction achieve significant improvements over the uncorrected baseline

Merits

Comprehensive Evaluation

The study evaluates four correction methods and provides a comprehensive analysis of their performance, providing valuable insights for practitioners.

Real-World Data

The study uses real consumer queries, making the findings more relevant and applicable to real-world healthcare query-answer systems.

Evidence-Based Recommendations

The study provides evidence-based recommendations for practitioners, highlighting the importance of query-side correction in improving retrieval performance.

Demerits

Limited Datasets

The study uses only two public datasets, which may not be representative of the broader population of healthcare queries.

Assumes Perfect Correction

The study assumes that the correction methods are perfect, which may not be the case in real-world scenarios where errors can occur.

Limited Context-Aware Correction Analysis

The study provides limited analysis of the context-aware correction method, which may be an important area for further research.

Expert Commentary

This study provides a comprehensive analysis of the impact of spelling correction on healthcare query-answer systems and highlights the importance of query-side correction in improving retrieval performance. The findings have significant implications for the development and improvement of healthcare query-answer systems and provide evidence-based recommendations for practitioners. However, the study has some limitations, including the use of limited datasets and the assumption of perfect correction. Further research is needed to address these limitations and to explore the broader implications of the study's findings.

Recommendations

  • Develop and implement query-side correction methods to improve the accuracy and reliability of healthcare query-answer systems.
  • Invest in the evaluation and testing of different correction methods to identify the most effective approaches for specific use cases.
  • Prioritize the development and improvement of healthcare query-answer systems that incorporate query-side correction to improve patient outcomes and reduce healthcare costs.

Sources

Original: arXiv - cs.CL