Academic

Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

arXiv:2603.21016v1 Announce Type: new Abstract: Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different pe

arXiv:2603.21016v1 Announce Type: new Abstract: Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).

Executive Summary

The article proposes a novel approach, Permutation-Aware Group Relative Policy Optimization (PA-GRPO), to mitigate selection bias in large language models (LLMs) used for multiple-choice and pairwise evaluation tasks. PA-GRPO constructs a permutation group for each instance and optimizes the model using two mechanisms: cross-permutation advantage and consistency-aware reward. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The proposed method addresses the limitation of existing inference-time debiasing methods, which are costly and may harm reasoning. PA-GRPO's ability to enforce permutation-consistent semantic reasoning makes it a promising solution for mitigating selection bias in LLMs.

Key Points

  • PA-GRPO constructs a permutation group for each instance to mitigate selection bias
  • PA-GRPO uses two mechanisms: cross-permutation advantage and consistency-aware reward
  • Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks

Merits

Strength in Addressing Selection Bias

PA-GRPO effectively addresses selection bias in LLMs by enforcing permutation-consistent semantic reasoning, which is a significant improvement over existing inference-time debiasing methods.

Improved Performance

PA-GRPO maintains high overall performance while substantially reducing selection bias, making it a promising solution for real-world applications.

Flexibility and Scalability

PA-GRPO can be easily integrated into existing LLMs, making it a scalable solution for mitigating selection bias in large-scale applications.

Demerits

Computational Cost

PA-GRPO may incur additional computational cost due to the construction of permutation groups and optimization mechanisms, which may be a limitation for resource-constrained applications.

Hyperparameter Tuning

PA-GRPO requires careful hyperparameter tuning to achieve optimal performance, which can be time-consuming and may require significant expertise.

Expert Commentary

The proposed method, PA-GRPO, demonstrates a significant improvement over existing approaches in addressing selection bias in LLMs. The use of permutation groups and optimization mechanisms makes it a promising solution for mitigating selection bias while maintaining high overall performance. However, the computational cost and hyperparameter tuning requirements may be significant limitations for resource-constrained applications. Nevertheless, PA-GRPO is a valuable contribution to the field, and its implications for real-world applications and policy-making are substantial.

Recommendations

  • Further research is needed to investigate the applicability of PA-GRPO to other types of LLMs and tasks.
  • The proposed method should be compared to other existing debiasing methods to demonstrate its superiority and robustness.

Sources

Original: arXiv - cs.CL