When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger
arXiv:2603.04968v1 Announce Type: new Abstract: Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of prefere
arXiv:2603.04968v1 Announce Type: new Abstract: Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of preference alignment while even outperforming methods trained on fully human-labeled data.
Executive Summary
This article proposes a novel approach to preference alignment in large language models (LLMs) using a weak LLM as an effective annotator. The Confidence-Weighted Preference Optimization (CW-PO) framework re-weights training samples by a weak LLM's confidence, achieving better performance than using full human annotations. Notably, the model aligned by CW-PO with 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO, suggesting a significant reduction in the cost of preference alignment.
Key Points
- ▸ Weak LLMs can act as effective annotators for preference alignment
- ▸ Confidence-Weighted Preference Optimization (CW-PO) framework improves performance
- ▸ CW-PO achieves better results with limited human annotations
Merits
Cost-Effectiveness
The proposed approach reduces the cost of preference alignment by leveraging weak LLMs and limited human annotations
Improved Performance
CW-PO framework achieves better performance than using full human annotations
Demerits
Limited Generalizability
The approach may not generalize to all types of LLMs or preference alignment tasks
Dependence on Confidence
The performance of CW-PO relies on the accuracy of the weak LLM's confidence estimates
Expert Commentary
The proposed approach has significant implications for the development of more efficient and effective preference alignment methods in LLMs. By leveraging weak LLMs and confidence weighting, CW-PO can reduce the cost and improve the performance of preference alignment. However, further research is needed to fully understand the limitations and potential biases of this approach. Additionally, the scalability and generalizability of CW-PO to different types of LLMs and preference alignment tasks require further investigation.
Recommendations
- ✓ Further research on the limitations and potential biases of CW-PO
- ✓ Investigation into the scalability and generalizability of CW-PO to different types of LLMs and preference alignment tasks