PARHAF, a human-authored corpus of clinical reports for fictitious patients in French
arXiv:2603.20494v1 Announce Type: new Abstract: The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus contains 7394 clinical reports covering 5009 patient case
arXiv:2603.20494v1 Announce Type: new Abstract: The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus contains 7394 clinical reports covering 5009 patient cases across a wide range of medical and surgical specialties. It includes a general-purpose component designed to approximate real-world hospitalization distributions, and four specialized subsets that support information-extraction use cases in oncology, infectious diseases, and diagnostic coding. Documents are released under a CC-BY open license, with a portion temporarily embargoed to enable future benchmarking under controlled conditions. PARHAF provides a valuable resource for training and evaluating French clinical language models in a fully privacy-preserving setting, and establishes a replicable methodology for building shareable synthetic clinical corpora in other languages and health systems.
Executive Summary
The paper introduces PARHAF, a large, open-source corpus of clinical documents in French, developed to address the challenge of sensitive medical records data sharing. Expert-authored clinical reports describing fictitious patient cases are combined with epidemiological guidance to ensure broad clinical coverage. This corpus comprises 7,394 reports, covering 5,009 patient cases, and provides a valuable resource for training and evaluating French clinical language models while preserving patient privacy. The methodology employed in PARHAF is replicable and adaptable to other languages and health systems, making it a significant contribution to the field of clinical NLP.
Key Points
- ▸ PARHAF is a large, open-source corpus of clinical documents in French, addressing data sharing restrictions.
- ▸ The corpus comprises 7,394 reports, covering 5,009 patient cases, across various medical specialties.
- ▸ PARHAF is designed to train and evaluate French clinical language models while preserving patient privacy.
Merits
Strength in Addressing Data Sharing Restrictions
PARHAF provides a novel solution to the challenge of sensitive medical records data sharing, enabling the development of clinical NLP systems while preserving patient privacy.
Replicable Methodology
The methodology employed in PARHAF is adaptable to other languages and health systems, making it a replicable and valuable contribution to the field of clinical NLP.
Demerits
Limited Generalizability to Non-French Contexts
The corpus is specifically designed for French clinical language models, which may limit its generalizability to non-French contexts, particularly in regions with diverse linguistic and cultural backgrounds.
Potential for Biased or Incomplete Representations
The use of fictitious patient cases may lead to biased or incomplete representations of real-world clinical scenarios, potentially affecting the accuracy and reliability of models trained on PARHAF.
Expert Commentary
The introduction of PARHAF is a significant contribution to the field of clinical NLP, addressing a long-standing challenge in the development of accurate and reliable language models. By providing a large, open-source corpus of clinical documents in French, PARHAF enables researchers and developers to train and evaluate French clinical language models while preserving patient privacy. The replicable methodology employed in PARHAF is adaptable to other languages and health systems, making it a valuable resource for the broader clinical NLP community. However, it is essential to consider the limitations of PARHAF, including its limited generalizability to non-French contexts and potential biases in the fictitious patient cases.
Recommendations
- ✓ Future research should focus on adapting the PARHAF methodology to other languages and health systems, increasing the corpus's generalizability and applicability.
- ✓ Developers and researchers should consider incorporating diverse linguistic and cultural perspectives to mitigate potential biases in the fictitious patient cases.
Sources
Original: arXiv - cs.CL