Academic

PARHAF, a human-authored corpus of clinical reports for fictitious patients in French

Xavier Tannier, Salam Abbara, R\'emi Flicoteaux, Youness Khalil, Aur\'elie N\'ev\'eol, Pierre Zweigenbaum, Emmanuel Bacry · March 24, 2026 · 1 min read · 6 views

#cs.CL

arXiv:2603.20494v1 Announce Type: new Abstract: The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus contains 7394 clinical reports covering 5009 patient cases across a wide range of medical and surgical specialties. It includes a general-purpose component designed to approximate real-world hospitalization distributions, and four specialized subsets that support information-extraction use cases in oncology, infectious diseases, and diagnostic coding. Documents are released under a CC-BY open license, with a portion temporarily embargoed to enable future benchmarking under controlled conditions. PARHAF provides a valuable resource for training and evaluating French clinical language models in a fully privacy-preserving setting, and establishes a replicable methodology for building shareable synthetic clinical corpora in other languages and health systems.

Executive Summary

The paper introduces PARHAF, a large, open-source corpus of clinical documents in French, developed to address the challenge of sensitive medical records data sharing. Expert-authored clinical reports describing fictitious patient cases are combined with epidemiological guidance to ensure broad clinical coverage. This corpus comprises 7,394 reports, covering 5,009 patient cases, and provides a valuable resource for training and evaluating French clinical language models while preserving patient privacy. The methodology employed in PARHAF is replicable and adaptable to other languages and health systems, making it a significant contribution to the field of clinical NLP.

Key Points

▸ PARHAF is a large, open-source corpus of clinical documents in French, addressing data sharing restrictions.
▸ The corpus comprises 7,394 reports, covering 5,009 patient cases, across various medical specialties.
▸ PARHAF is designed to train and evaluate French clinical language models while preserving patient privacy.

Merits

Strength in Addressing Data Sharing Restrictions

PARHAF provides a novel solution to the challenge of sensitive medical records data sharing, enabling the development of clinical NLP systems while preserving patient privacy.

Replicable Methodology

The methodology employed in PARHAF is adaptable to other languages and health systems, making it a replicable and valuable contribution to the field of clinical NLP.

Demerits

Limited Generalizability to Non-French Contexts

The corpus is specifically designed for French clinical language models, which may limit its generalizability to non-French contexts, particularly in regions with diverse linguistic and cultural backgrounds.

Potential for Biased or Incomplete Representations

The use of fictitious patient cases may lead to biased or incomplete representations of real-world clinical scenarios, potentially affecting the accuracy and reliability of models trained on PARHAF.

Expert Commentary

The introduction of PARHAF is a significant contribution to the field of clinical NLP, addressing a long-standing challenge in the development of accurate and reliable language models. By providing a large, open-source corpus of clinical documents in French, PARHAF enables researchers and developers to train and evaluate French clinical language models while preserving patient privacy. The replicable methodology employed in PARHAF is adaptable to other languages and health systems, making it a valuable resource for the broader clinical NLP community. However, it is essential to consider the limitations of PARHAF, including its limited generalizability to non-French contexts and potential biases in the fictitious patient cases.

Recommendations

✓ Future research should focus on adapting the PARHAF methodology to other languages and health systems, increasing the corpus's generalizability and applicability.
✓ Developers and researchers should consider incorporating diverse linguistic and cultural perspectives to mitigate potential biases in the fictitious patient cases.

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

PARHAF, a human-authored corpus of clinical reports for fictitious patients in French

AI Commentary

Executive Summary

Key Points

Merits

Strength in Addressing Data Sharing Restrictions

Replicable Methodology

Demerits

Limited Generalizability to Non-French Contexts

Potential for Biased or Incomplete Representations

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.