Academic

When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

arXiv:2603.04968v1 Announce Type: new Abstract: Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of prefere

A
Amirabbas Afzali, Myeongho Jeon, Maria Brbic
· · 1 min read · 2 views

arXiv:2603.04968v1 Announce Type: new Abstract: Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of preference alignment while even outperforming methods trained on fully human-labeled data.

Executive Summary

This article proposes a novel approach to preference alignment in large language models (LLMs) using a weak LLM as an effective annotator. The Confidence-Weighted Preference Optimization (CW-PO) framework re-weights training samples by a weak LLM's confidence, achieving better performance than using full human annotations. Notably, the model aligned by CW-PO with 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO, suggesting a significant reduction in the cost of preference alignment.

Key Points

  • Weak LLMs can act as effective annotators for preference alignment
  • Confidence-Weighted Preference Optimization (CW-PO) framework improves performance
  • CW-PO achieves better results with limited human annotations

Merits

Cost-Effectiveness

The proposed approach reduces the cost of preference alignment by leveraging weak LLMs and limited human annotations

Improved Performance

CW-PO framework achieves better performance than using full human annotations

Demerits

Limited Generalizability

The approach may not generalize to all types of LLMs or preference alignment tasks

Dependence on Confidence

The performance of CW-PO relies on the accuracy of the weak LLM's confidence estimates

Expert Commentary

The proposed approach has significant implications for the development of more efficient and effective preference alignment methods in LLMs. By leveraging weak LLMs and confidence weighting, CW-PO can reduce the cost and improve the performance of preference alignment. However, further research is needed to fully understand the limitations and potential biases of this approach. Additionally, the scalability and generalizability of CW-PO to different types of LLMs and preference alignment tasks require further investigation.

Recommendations

  • Further research on the limitations and potential biases of CW-PO
  • Investigation into the scalability and generalizability of CW-PO to different types of LLMs and preference alignment tasks

Sources