Academic

Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

arXiv:2603.21016v1 Announce Type: new Abstract: Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different pe

Jinquan Zheng, Jia Yuan, Jiacheng Yao, Chenyang Gu, Pujun Zheng, Guoxiu He · March 24, 2026 · 1 min read · 3 views

#cs.CL #cs.AI #cs.LG

Executive Summary

The article proposes a novel approach, Permutation-Aware Group Relative Policy Optimization (PA-GRPO), to mitigate selection bias in large language models (LLMs) used for multiple-choice and pairwise evaluation tasks. PA-GRPO constructs a permutation group for each instance and optimizes the model using two mechanisms: cross-permutation advantage and consistency-aware reward. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The proposed method addresses the limitation of existing inference-time debiasing methods, which are costly and may harm reasoning. PA-GRPO's ability to enforce permutation-consistent semantic reasoning makes it a promising solution for mitigating selection bias in LLMs.

Key Points

▸ PA-GRPO constructs a permutation group for each instance to mitigate selection bias
▸ PA-GRPO uses two mechanisms: cross-permutation advantage and consistency-aware reward
▸ Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks

Merits

Strength in Addressing Selection Bias

PA-GRPO effectively addresses selection bias in LLMs by enforcing permutation-consistent semantic reasoning, which is a significant improvement over existing inference-time debiasing methods.

Improved Performance

PA-GRPO maintains high overall performance while substantially reducing selection bias, making it a promising solution for real-world applications.

Flexibility and Scalability

PA-GRPO can be easily integrated into existing LLMs, making it a scalable solution for mitigating selection bias in large-scale applications.

Demerits

Computational Cost

PA-GRPO may incur additional computational cost due to the construction of permutation groups and optimization mechanisms, which may be a limitation for resource-constrained applications.

Hyperparameter Tuning

PA-GRPO requires careful hyperparameter tuning to achieve optimal performance, which can be time-consuming and may require significant expertise.

Expert Commentary

The proposed method, PA-GRPO, demonstrates a significant improvement over existing approaches in addressing selection bias in LLMs. The use of permutation groups and optimization mechanisms makes it a promising solution for mitigating selection bias while maintaining high overall performance. However, the computational cost and hyperparameter tuning requirements may be significant limitations for resource-constrained applications. Nevertheless, PA-GRPO is a valuable contribution to the field, and its implications for real-world applications and policy-making are substantial.

Recommendations

✓ Further research is needed to investigate the applicability of PA-GRPO to other types of LLMs and tasks.
✓ The proposed method should be compared to other existing debiasing methods to demonstrate its superiority and robustness.

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

AI Commentary

Executive Summary

Key Points

Merits

Strength in Addressing Selection Bias

Improved Performance

Flexibility and Scalability

Demerits

Computational Cost

Hyperparameter Tuning

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.