Skip to main content
Academic

LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering

arXiv:2602.23603v1 Announce Type: new Abstract: Long-form question answering (LFQA) demands nuanced evaluation of multi-sentence explanatory responses, yet existing metrics often fail to reflect human judgment. We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA. We propose nine rubrics for answer quality evaluation, and show that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators. We further examine transitivity consistency, positional bias, and verbosity biases in LLM evaluators and demonstrate their vulnerability to adversarial perturbations. Overall, this work provides one of the largest public LFQA preference datasets and a rubric-driven framework for transparent and reliable evaluation.

arXiv:2602.23603v1 Announce Type: new Abstract: Long-form question answering (LFQA) demands nuanced evaluation of multi-sentence explanatory responses, yet existing metrics often fail to reflect human judgment. We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA. We propose nine rubrics for answer quality evaluation, and show that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators. We further examine transitivity consistency, positional bias, and verbosity biases in LLM evaluators and demonstrate their vulnerability to adversarial perturbations. Overall, this work provides one of the largest public LFQA preference datasets and a rubric-driven framework for transparent and reliable evaluation.

Executive Summary

The article presents LFQA-HP-1M, a large-scale human preference dataset for long-form question answering, comprising 1.3M human pairwise preference annotations. The authors propose nine rubrics for answer quality evaluation and demonstrate that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators. The study also examines the vulnerability of LLM evaluators to adversarial perturbations, highlighting transitivity consistency, positional bias, and verbosity biases. This work provides a rubric-driven framework for transparent and reliable evaluation, contributing to the development of more accurate and robust long-form question answering systems.

Key Points

  • LFQA-HP-1M is a large-scale human preference dataset for long-form question answering
  • The dataset comprises 1.3M human pairwise preference annotations
  • Simple linear models based on proposed rubrics perform comparably to state-of-the-art LLM evaluators

Merits

Strength in Comprehensive Evaluation

The proposed rubrics provide a comprehensive framework for evaluating answer quality in long-form question answering, covering various aspects such as coherence, relevance, and factuality.

Significant Contribution to the Field

The LFQA-HP-1M dataset is one of the largest public datasets for long-form question answering, providing a valuable resource for researchers and developers in the field.

Demerits

Limitation in Generalizability

The study focuses on a specific task and dataset, and it is unclear whether the proposed rubrics and models can be generalized to other tasks and domains.

Vulnerability to Adversarial Perturbations

The study highlights the vulnerability of LLM evaluators to adversarial perturbations, which may limit their practical applications in real-world scenarios.

Expert Commentary

The study presents a significant contribution to the field of long-form question answering, providing a comprehensive framework for evaluating answer quality and a large-scale human preference dataset. The proposed rubrics and models perform comparably to state-of-the-art LLM evaluators, demonstrating their effectiveness in evaluating answer quality. However, the study also highlights the vulnerability of LLM evaluators to adversarial perturbations, which is a critical limitation that needs to be addressed in future research. Overall, the study provides a valuable resource for researchers and developers in the field, and its implications can be significant for the development and deployment of AI systems in various domains.

Recommendations

  • Future research should focus on developing more robust and secure language models that can handle adversarial perturbations and provide accurate and reliable answers in various domains.
  • The proposed rubrics and models should be further validated and evaluated on various tasks and datasets to ensure their generalizability and effectiveness.

Sources