Skip to main content
Academic

ABCD: All Biases Come Disguised

arXiv:2602.17445v1 Announce Type: new Abstract: Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question. We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM's performance, exposing the LLM's capabilities under reduced evaluation artifacts, without any

M
Mateusz Nowak, Xavier Cadet, Peter Chin
· · 1 min read · 14 views

arXiv:2602.17445v1 Announce Type: new Abstract: Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question. We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM's performance, exposing the LLM's capabilities under reduced evaluation artifacts, without any help from the prompt examples or the option labels. Across multiple benchmarks and models, this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance $3\times$ with only a minimal decrease in the mean model's performance. Through ablation studies on various embedding models and similarity functions, we show that the method is more robust than the standard ones.

Executive Summary

The article 'ABCD: All Biases Come Disguised' investigates the presence of biases in large language models (LLMs) when evaluated using multiple-choice question (MCQ) benchmarks. The authors identify several biases, including label-position bias, label bias, and few-shot-prompt bias, which can significantly affect the models' performance. They propose a bias-reduced evaluation protocol that replaces labels with uniform, unordered labels and prompts the LLM to use the entire answer. This protocol demonstrates improved robustness and lower standard deviation between different permutations of answers with minimal performance drop. The study highlights the importance of reducing evaluation artifacts to better assess LLMs' capabilities.

Key Points

  • Identification of various biases in LLMs when evaluated with MCQ benchmarks.
  • Proposal of a bias-reduced evaluation protocol to mitigate these biases.
  • Demonstration of improved robustness and lower variance in model performance.
  • Minimal decrease in mean model performance with the new protocol.
  • Ablation studies showing the robustness of the method across different embedding models and similarity functions.

Merits

Comprehensive Analysis

The article provides a thorough analysis of biases in LLMs, identifying specific types of biases that can affect model performance. This comprehensive approach is crucial for understanding and addressing the limitations of current evaluation methods.

Practical Solution

The proposed bias-reduced evaluation protocol offers a practical solution to mitigate biases in LLM evaluations. The protocol is simple yet effective, demonstrating significant improvements in robustness and performance consistency.

Robust Methodology

The study employs a robust methodology, including ablation studies on various embedding models and similarity functions, to validate the effectiveness of the proposed protocol. This rigorous approach enhances the credibility of the findings.

Demerits

Limited Scope

The study focuses primarily on MCQ benchmarks, which may not fully capture the breadth of biases present in other types of evaluations or real-world applications. Expanding the scope to include diverse evaluation methods could provide a more holistic understanding of LLM biases.

Minimal Performance Drop

While the protocol demonstrates minimal performance drop, the extent of this drop and its implications for real-world applications are not thoroughly explored. Further investigation into the trade-offs between bias reduction and performance is warranted.

Generalizability

The study primarily focuses on specific LLMs and benchmarks. The generalizability of the findings to other models and evaluation contexts remains to be seen. Additional studies across a broader range of LLMs and benchmarks would strengthen the conclusions.

Expert Commentary

The article 'ABCD: All Biases Come Disguised' presents a significant contribution to the field of large language models by identifying and addressing biases in MCQ benchmarks. The identification of label-position, label, and few-shot-prompt biases is a critical step towards understanding the limitations of current evaluation methods. The proposed bias-reduced evaluation protocol is a practical and effective solution that demonstrates improved robustness and consistency in model performance. The study's rigorous methodology, including ablation studies, enhances the credibility of the findings. However, the scope of the study is somewhat limited, focusing primarily on MCQ benchmarks and specific LLMs. Expanding the scope to include diverse evaluation methods and a broader range of models would provide a more comprehensive understanding of biases in LLMs. Additionally, the minimal performance drop observed with the new protocol warrants further investigation to fully understand the trade-offs between bias reduction and performance. Overall, the article provides valuable insights and practical solutions for improving the evaluation of LLMs, contributing to the broader discussion on bias and robustness in machine learning.

Recommendations

  • Expand the scope of the study to include diverse evaluation methods and a broader range of LLMs to assess the generalizability of the findings.
  • Conduct further investigations into the trade-offs between bias reduction and performance to fully understand the implications of the proposed evaluation protocol.

Sources