Academic

PrefPO: Pairwise Preference Prompt Optimization

Rahul Singhal, Pradyumna Tambwekar, Karime Maamari · March 23, 2026 · 1 min read · 7 views

#cs.CL

arXiv:2603.19311v1 Announce Type: new Abstract: Prompt engineering is effective but labor-intensive, motivating automated optimization methods. Existing methods typically require labeled datasets, which are often unavailable, and produce verbose, repetitive prompts. We introduce PrefPO, a minimal prompt optimization approach inspired by reinforcement learning from human feedback (RLHF). Its preference-based approach reduces the need for labeled data and hyperparameter tuning-only a starting prompt and natural language criteria are needed. PrefPO uses an LLM discriminator to express pairwise preferences over model outputs and provide feedback to an LLM optimizer, iteratively improving performance. We evaluate PrefPO on 9 BIG-Bench Hard (BBH) tasks and IFEval-Hard, a newly-curated, challenging subset of IFEval. PrefPO matches or exceeds SOTA methods, including GEPA, MIPRO, and TextGrad, on 6/9 tasks and performs comparably to TextGrad on IFEval-Hard (82.4% vs 84.5%). Unlike other methods, PrefPO can optimize in both labeled and unlabeled settings. Without labels, PrefPO closely matches its labeled performance on 6/9 tasks, proving effective without ground truth. PrefPO also improves prompt hygiene: we find existing methods produce prompts 14.7x their original length or with 34% repetitive content; PrefPO reduces these issues by 3-5x. Furthermore, both LLM and human judges rate PrefPO's prompts higher than TextGrad's. Finally, we identify prompt hacking in prompt optimizers, where methods game evaluation criteria, and find PrefPO is susceptible at half the rate of TextGrad (37% vs 86%), generating fewer brittle, misaligned prompts.

Executive Summary

This article introduces PrefPO, a novel prompt optimization approach that leverages reinforcement learning from human feedback to improve the performance of large language models (LLMs) on various tasks. PrefPO's preference-based approach eliminates the need for labeled datasets and hyperparameter tuning, relying only on a starting prompt and natural language criteria. The authors evaluate PrefPO on 9 BIG-Bench Hard tasks and IFEval-Hard, demonstrating its effectiveness in matching or exceeding state-of-the-art (SOTA) methods in various settings. PrefPO also exhibits improved prompt hygiene, reducing verbosity and repetition, and is less susceptible to prompt hacking. The study highlights PrefPO's potential to optimize LLMs in both labeled and unlabeled settings, offering a promising solution for the optimization of LLMs.

Key Points

▸ PrefPO introduces a preference-based approach to prompt optimization, inspired by reinforcement learning from human feedback.
▸ PrefPO eliminates the need for labeled datasets and hyperparameter tuning, relying on a starting prompt and natural language criteria.
▸ PrefPO achieves SOTA performance on 6/9 BIG-Bench Hard tasks and performs comparably to TextGrad on IFEval-Hard.
▸ PrefPO improves prompt hygiene, reducing verbosity and repetition, and is less susceptible to prompt hacking.

Merits

Effective in both labeled and unlabeled settings

PrefPO demonstrates its ability to optimize LLMs in both settings, offering a flexible solution for various applications.

Improved prompt hygiene

PrefPO reduces verbosity and repetition in generated prompts, making it a more user-friendly and efficient approach.

Robustness to prompt hacking

PrefPO is less susceptible to prompt hacking, generating fewer brittle and misaligned prompts compared to other methods.

Demerits

Potential for susceptibility to prompt hacking

Although PrefPO is less susceptible to prompt hacking, it is not entirely immune, and further research is necessary to address this limitation.

Dependence on human feedback

PrefPO's reliance on human feedback may limit its scalability and applicability in scenarios where human judges are not available or reliable.

Expert Commentary

The introduction of PrefPO represents a significant advancement in the field of prompt engineering and optimization. The authors' innovative approach to leveraging reinforcement learning from human feedback offers a promising solution for the optimization of LLMs in various settings. The study's findings and implications have far-reaching consequences for the development and deployment of LLMs, and it is essential to continue exploring and refining PrefPO's capabilities to address its limitations and potential challenges. Furthermore, the study's emphasis on prompt hygiene and robustness to prompt hacking highlights the need for a more nuanced understanding of the complex interplay between LLMs, prompts, and human feedback.

Recommendations

✓ Future research should focus on exploring PrefPO's scalability and applicability in various settings, including scenarios with limited human feedback.
✓ Developers and practitioners should prioritize the adoption of PrefPO and similar approaches to improve the performance and robustness of LLMs in practical applications.

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

PrefPO: Pairwise Preference Prompt Optimization

AI Commentary

Executive Summary

Key Points

Merits

Effective in both labeled and unlabeled settings

Improved prompt hygiene

Robustness to prompt hacking

Demerits

Potential for susceptibility to prompt hacking

Dependence on human feedback

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.