Academic

PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

Samah Fodeh, Linhai Ma, Yan Wang, Srivani Talakokkul, Ganesh Puthiaraju, Afshan Khan, Ashley Hagaman, Sarah Lowe, Aimee Roundtree · February 26, 2026 · 1 min read · 4 views

#cs.CL #cs.AI

arXiv:2602.21165v1 Announce Type: new Abstract: Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH). Traditional qualitative coding frameworks are labor intensive and do not scale to large volumes of patient-authored messages across health systems. Existing machine learning (ML) and natural language processing (NLP) approaches provide partial solutions but often treat patient-centered communication (PCC) and SDoH as separate tasks or rely on models not well suited to patient-facing language. We introduce PVminer, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication. PVminer formulates PV detection as a multi-label, multi-class prediction task integrating patient-specific BERT encoders (PV-BERT-base and PV-BERT-large), unsupervised topic modeling for thematic augmentation (PV-Topic-BERT), and fine-tuned classifiers for Code, Subcode, and Combo-level labels. Topic representations are incorporated during fine-tuning and inference to enrich semantic inputs. PVminer achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines, achieving F1 scores of 82.25% (Code), 80.14% (Subcode), and up to 77.87% (Combo). An ablation study further shows that author identity and topic-based augmentation each contribute meaningful gains. Pre-trained models, source code, and documentation will be publicly released, with annotated datasets available upon request for research use.

Executive Summary

The article introduces PVminer, a domain-specific NLP framework designed to detect the patient voice (PV) in patient-generated data. PVminer integrates patient-specific BERT encoders, unsupervised topic modeling, and fine-tuned classifiers to achieve high performance in multi-label, multi-class prediction tasks. The framework outperforms existing biomedical and clinical pre-trained baselines, demonstrating the effectiveness of incorporating patient-specific context and thematic augmentation. The study highlights the importance of understanding patient communication behaviors and social determinants of health (SDoH) through scalable and accurate NLP tools.

Key Points

▸ PVminer is a domain-adapted NLP framework for detecting patient voice in secure patient-provider communication.
▸ The framework integrates patient-specific BERT encoders and unsupervised topic modeling for thematic augmentation.
▸ PVminer achieves high F1 scores across hierarchical tasks, outperforming biomedical and clinical baselines.
▸ Ablation studies show that author identity and topic-based augmentation contribute to performance gains.
▸ Pre-trained models, source code, and documentation will be publicly released, with annotated datasets available upon request.

Merits

Innovative Approach

PVminer introduces a novel approach to detecting patient voice by integrating multiple NLP techniques, including patient-specific BERT encoders and topic modeling, which enhances the accuracy and scalability of the framework.

High Performance

The framework achieves strong F1 scores across different hierarchical tasks, demonstrating its effectiveness in capturing patient voice and social determinants of health.

Open-Source Contribution

The decision to release pre-trained models, source code, and documentation publicly fosters collaboration and further research in the field of patient-centered communication.

Demerits

Data Availability

While the article mentions that annotated datasets will be available upon request, the lack of immediate access to these datasets may limit the reproducibility and immediate adoption of the framework by other researchers.

Generalizability

The study's focus on secure patient-provider communication may limit the generalizability of PVminer to other types of patient-generated data, such as social media posts or public forums.

Computational Resources

The use of large BERT models and topic modeling techniques may require significant computational resources, which could be a barrier for smaller research teams or institutions with limited resources.

Expert Commentary

The article presents a significant advancement in the field of NLP applications in healthcare, particularly in the detection of patient voice and social determinants of health. The integration of patient-specific BERT encoders and topic modeling techniques demonstrates a sophisticated approach to capturing the nuances of patient communication. The high performance metrics achieved by PVminer underscore its potential to revolutionize patient-centered communication and improve healthcare outcomes. However, the study's focus on secure patient-provider communication may limit its generalizability to other types of patient-generated data. Additionally, the computational resources required for implementing PVminer may pose a challenge for smaller research teams. Despite these limitations, the decision to release pre-trained models and source code publicly is commendable and will undoubtedly foster further research and collaboration in this critical area.

Recommendations

✓ Future research should explore the generalizability of PVminer to other types of patient-generated data, such as social media posts and public forums, to ensure its applicability across diverse healthcare settings.
✓ Efforts should be made to optimize the computational efficiency of PVminer to make it more accessible to smaller research teams and institutions with limited resources.

Sources

arXiv - cs.CL

Something extraordinary is coming.

PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

AI Commentary

Executive Summary

Key Points

Merits

Innovative Approach

High Performance

Open-Source Contribution

Demerits

Data Availability

Generalizability

Computational Resources

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.