Academic

FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment

arXiv:2603.19539v1 Announce Type: new Abstract: We introduce an expert curated, real-world benchmark for evaluating document-grounded question-answering (QA) motivated by generic drug assessment, using the U.S. Food and Drug Administration (FDA) drug label documents. Drug labels contain rich but heterogeneous clinical and regulatory information, making accurate question answering difficult for current language models. In collaboration with FDA regulatory assessors, we introduce FDARxBench, and construct a multi-stage pipeline for generating high-quality, expert curated, QA examples spanning factual, multi-hop, and refusal tasks, and design evaluation protocols to assess both open-book and closed-book reasoning. Experiments across proprietary and open-weight models reveal substantial gaps in factual grounding, long-context retrieval, and safe refusal behavior. While motivated by FDA generic drug assessment needs, this benchmark also provides a substantial foundation for challenging reg

arXiv:2603.19539v1 Announce Type: new Abstract: We introduce an expert curated, real-world benchmark for evaluating document-grounded question-answering (QA) motivated by generic drug assessment, using the U.S. Food and Drug Administration (FDA) drug label documents. Drug labels contain rich but heterogeneous clinical and regulatory information, making accurate question answering difficult for current language models. In collaboration with FDA regulatory assessors, we introduce FDARxBench, and construct a multi-stage pipeline for generating high-quality, expert curated, QA examples spanning factual, multi-hop, and refusal tasks, and design evaluation protocols to assess both open-book and closed-book reasoning. Experiments across proprietary and open-weight models reveal substantial gaps in factual grounding, long-context retrieval, and safe refusal behavior. While motivated by FDA generic drug assessment needs, this benchmark also provides a substantial foundation for challenging regulatory-grade evaluation of label comprehension. The benchmark is designed to support evaluation of LLM behavior on drug-label questions.

Executive Summary

The article introduces FDARxBench, a real-world benchmark for evaluating document-grounded question-answering on FDA generic drug assessment. By collaborating with FDA regulatory assessors, the authors construct a multi-stage pipeline for generating high-quality QA examples, spanning factual, multi-hop, and refusal tasks. Experiments reveal substantial gaps in factual grounding, long-context retrieval, and safe refusal behavior among proprietary and open-weight models. The benchmark aims to support evaluation of LLM behavior on drug-label questions, providing a foundation for challenging regulatory-grade label comprehension. While the benchmark demonstrates a significant step towards addressing FDA generic drug assessment needs, further development and refinement are required to ensure its practical application.

Key Points

  • FDARxBench is a real-world benchmark for evaluating document-grounded question-answering on FDA generic drug assessment
  • The benchmark is constructed through a multi-stage pipeline with expert curation and collaboration with FDA regulatory assessors
  • Experiments reveal substantial gaps in factual grounding, long-context retrieval, and safe refusal behavior among proprietary and open-weight models

Merits

Strength in Expert Curation

The collaboration with FDA regulatory assessors ensures the benchmark's high-quality and relevance to real-world applications.

Comprehensive Evaluation Protocols

The open-book and closed-book reasoning evaluation protocols provide a thorough assessment of LLM behavior on drug-label questions.

Demerits

Limited Generalizability

The benchmark's focus on FDA generic drug assessment may limit its generalizability to other regulatory domains or industries.

Technical Challenges

The multi-stage pipeline's complexity may pose technical challenges for model development and evaluation.

Expert Commentary

The introduction of FDARxBench marks a significant step towards addressing the challenges of FDA generic drug assessment through the development of regulatory-grade label comprehension systems. However, further refinement and development are required to ensure the benchmark's practical application and generalizability. The collaboration with FDA regulatory assessors is a notable strength, ensuring the benchmark's high-quality and relevance to real-world applications. The comprehensive evaluation protocols also provide a thorough assessment of LLM behavior on drug-label questions. Nevertheless, the limited generalizability and technical challenges associated with the multi-stage pipeline must be addressed to ensure the benchmark's broader impact.

Recommendations

  • Further development and refinement of the benchmark to enhance its generalizability and practical application.
  • Investigation into the use of alternative evaluation protocols and metrics to assess LLM behavior on drug-label questions.

Sources

Original: arXiv - cs.CL