Academic

MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

arXiv:2603.06905v1 Announce Type: new Abstract: Instruction tuning has become essential for adapting large language models (LLMs) to follow domain-specific prompts. Yet, in specialized fields such as medicine, the scarcity of high-quality French instruction data limits effective supervision. To address this gap, we introduce MedInjection-FR, a large-scale French biomedical instruction dataset comprising 571K instruction-response pairs drawn from three complementary sources: native, synthetic, and translated data. We design a controlled experimental framework to systematically assess how data provenance affects instruction tuning, using Qwen-4B-Instruct fine-tuned across seven configurations combining these sources. Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits. Synthetic data alone remains less effective but contributes positively when balanced with native supervision. Evaluation on

arXiv:2603.06905v1 Announce Type: new Abstract: Instruction tuning has become essential for adapting large language models (LLMs) to follow domain-specific prompts. Yet, in specialized fields such as medicine, the scarcity of high-quality French instruction data limits effective supervision. To address this gap, we introduce MedInjection-FR, a large-scale French biomedical instruction dataset comprising 571K instruction-response pairs drawn from three complementary sources: native, synthetic, and translated data. We design a controlled experimental framework to systematically assess how data provenance affects instruction tuning, using Qwen-4B-Instruct fine-tuned across seven configurations combining these sources. Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits. Synthetic data alone remains less effective but contributes positively when balanced with native supervision. Evaluation on open-ended QA combines automatic metrics, LLM-as-a-judge assessment, and human expert review; although LLM-based judgments correlate best with human ratings, they show sensitivity to verbosity. These findings highlight that data authenticity and diversity jointly shape downstream adaptation and that heterogeneous supervision can mitigate the scarcity of native French medical instructions.

Executive Summary

This article introduces MedInjection-FR, a large-scale French biomedical instruction dataset comprising native, synthetic, and translated data. The authors investigate the impact of data provenance on instruction tuning for large language models in the medical domain. The results show that native data yield the strongest performance, while mixed setups, particularly native and translated data, provide complementary benefits. Synthetic data alone remains less effective but contributes positively when balanced with native supervision. The findings highlight the importance of data authenticity and diversity in shaping downstream adaptation and suggest that heterogeneous supervision can mitigate the scarcity of native French medical instructions. The evaluation framework combines automatic metrics, LLM-as-a-judge assessment, and human expert review to assess the performance of instruction tuning models.

Key Points

  • MedInjection-FR is a large-scale French biomedical instruction dataset comprising native, synthetic, and translated data.
  • Native data yield the strongest performance in instruction tuning for large language models.
  • Mixed setups, particularly native and translated data, provide complementary benefits in instruction tuning.

Merits

Strength in Data Diversity

The study's inclusion of native, synthetic, and translated data provides a comprehensive understanding of the role of data provenance in instruction tuning.

Robust Evaluation Framework

The authors' use of a combination of automatic metrics, LLM-as-a-judge assessment, and human expert review ensures a robust evaluation of instruction tuning models.

Demerits

Limited Generalizability

The study's focus on the medical domain in French may limit the generalizability of its findings to other languages and domains.

Synthetic Data Limitations

The study's findings suggest that synthetic data alone is less effective in instruction tuning, which may have implications for its use in other applications.

Expert Commentary

The study makes a significant contribution to the field of instruction tuning by providing a comprehensive understanding of the role of data provenance in this context. The use of a large-scale dataset and a robust evaluation framework ensures the validity of the findings. However, the study's limitations, such as its focus on the medical domain in French, may limit its generalizability. Nevertheless, the study's findings have important implications for the development of instruction tuning models in specialized domains and may inform policy decisions regarding data collection and curation in AI development.

Recommendations

  • Future studies should investigate the generalizability of the study's findings to other languages and domains.
  • Researchers should explore the use of synthetic data in conjunction with other data sources to improve instruction tuning performance.

Sources