Academic

EPPCMinerBen: A Novel Benchmark for Evaluating Large Language Models on Electronic Patient-Provider Communication via the Patient Portal

Samah Fodeh, Yan Wang, Linhai Ma, Srivani Talakokkul, Jordan M. Alpert, Sarah Schellhorn · March 7, 2026 · 1 min read · 4 views

#cs.CL

arXiv:2603.00028v1 Announce Type: new Abstract: Effective communication in health care is critical for treatment outcomes and adherence. With patient-provider exchanges shifting to secure messaging, analyzing electronic patient-communication (EPPC) data is both essential and challenging. We introduce EPPCMinerBen, a benchmark for evaluating LLMs in detecting communication patterns and extracting insights from electronic patient-provider messages. EPPCMinerBen includes three sub-tasks: Code Classification, Subcode Classification, and Evidence Extraction. Using 1,933 expert annotated sentences from 752 secure messages of the patient portal at Yale New Haven Hospital, it evaluates LLMs on identifying communicative intent and supportive text. Benchmarks span various LLMs under zero-shot and few-shot settings, with data to be released via the NCI Cancer Data Service. Model performance varied across tasks and settings. Llama-3.1-70B led in evidence extraction (F1: 82.84%) and performed well in classification. Llama-3.3-70b-Instruct outperformed all models in code classification (F1: 67.03%). DeepSeek-R1-Distill-Qwen-32B excelled in subcode classification (F1: 48.25%), while sdoh-llama-3-70B showed consistent performance. Smaller models underperformed, especially in subcode classification (>30% F1). Few-shot prompting improved most tasks. Our results show that large, instruction-tuned models generally perform better in EPPCMinerBen tasks, particularly evidence extraction while smaller models struggle with fine-grained reasoning. EPPCMinerBen provides a benchmark for discourse-level understanding, supporting future work on model generalization and patient-provider communication analysis. Keywords: Electronic Patient-Provider Communication, Large language models, Data collection, Prompt engineering

Executive Summary

This article presents EPPCMinerBen, a novel benchmark for evaluating the performance of large language models (LLMs) in analyzing electronic patient-provider communication (EPPC) data. The benchmark consists of three sub-tasks: Code Classification, Subcode Classification, and Evidence Extraction, and is evaluated on 1,933 expert-annotated sentences from 752 secure messages of the patient portal at Yale New Haven Hospital. The results show that large, instruction-tuned models generally perform better in EPPCMinerBen tasks, particularly evidence extraction, while smaller models struggle with fine-grained reasoning. The authors argue that EPPCMinerBen provides a benchmark for discourse-level understanding, supporting future work on model generalization and patient-provider communication analysis. The findings have significant implications for the development and evaluation of LLMs in healthcare and highlight the importance of fine-grained reasoning in EPPC analysis.

Key Points

▸ EPPCMinerBen is a novel benchmark for evaluating LLMs in EPPC analysis.
▸ The benchmark consists of three sub-tasks: Code Classification, Subcode Classification, and Evidence Extraction.
▸ Large, instruction-tuned models generally perform better in EPPCMinerBen tasks, particularly evidence extraction.
▸ Smaller models struggle with fine-grained reasoning in EPPC analysis.

Merits

Strength in EPPC analysis

EPPCMinerBen provides a comprehensive benchmark for evaluating LLMs in EPPC analysis, covering three critical sub-tasks. The benchmark's design enables the evaluation of LLMs in a real-world healthcare scenario, providing valuable insights for the development and improvement of these models.

Fine-grained reasoning evaluation

EPPCMinerBen evaluates LLMs' ability to perform fine-grained reasoning, which is essential for accurate EPPC analysis in healthcare. This evaluation provides a critical assessment of LLMs' capabilities in handling complex EPPC data.

Demerits

Limited dataset size

The authors use a relatively small dataset of 1,933 expert-annotated sentences from 752 secure messages. This limited dataset size may not be representative of the broader EPPC data and may not generalize well to other healthcare scenarios.

Dependence on large models

The results show that large, instruction-tuned models perform better in EPPCMinerBen tasks. This dependence on large models may limit the applicability of EPPCMinerBen to smaller models or those with limited computational resources.

Expert Commentary

The article presents a timely and important contribution to the field of EPPC analysis, highlighting the potential of large language models in this area. However, the results also underscore the need for careful evaluation and fine-tuning of these models to ensure accurate and reliable performance in real-world healthcare scenarios. As the field continues to evolve, it is essential to develop and evaluate EPPC analysis tools that can accurately capture the nuances of patient-provider communication. EPPCMinerBen provides a valuable benchmark for this purpose, and its results have significant implications for the development and deployment of LLMs in healthcare.

Recommendations

✓ Future research should focus on developing and evaluating EPPC analysis tools that can accurately capture the nuances of patient-provider communication.
✓ Policymakers and healthcare administrators should prioritize the development and implementation of accurate and reliable EPPC analysis tools, which can improve patient outcomes and enhance healthcare delivery.

Sources

arXiv - cs.CL

EPPCMinerBen: A Novel Benchmark for Evaluating Large Language Models on Electronic Patient-Provider Communication via the Patient Portal

AI Commentary

Executive Summary

Key Points

Merits

Strength in EPPC analysis

Fine-grained reasoning evaluation

Demerits

Limited dataset size

Dependence on large models

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs