Academic

TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation

Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Ashley Hagaman, Sarah R. Lowe, Aimee Kendall Roundtree · March 7, 2026 · 1 min read · 18 views

#cs.CL

arXiv:2603.00025v1 Announce Type: new Abstract: Direct Preference Optimization is an offline post-SFT method for aligning language models from preference pairs, with strong results in instruction following and summarization. However, DPO's sequence-level implicit reward can be brittle for token-critical structured prediction settings such as medical annotation, which often exhibit (i) low-separation preference pairs, where chosen and rejected completions differ by minimal edit distance (often 1-3 tokens), and (ii) token-importance skew, where sparse semantic tokens (hierarchical labels and evidence Spans) carry disproportionate task importance relative to high-frequency structural tokens (JSON scaffolding). In this regime, standard DPO suffers from margin collapse (insufficient log-probability separation between near-identical preferences), likelihood squeezing (the margin objective shifts the absolute likelihoods of both completions together), and gradient dilution, where uniform sequence-level weighting diffuses learning signal across shared scaffolding while rare, confusable label tokens receive weak, noisy updates. We introduce Token-Adaptive Barrier Preference Optimization (TAB-PO), which augments DPO with token-weighted, reference-adjusted advantages that prioritize high-value semantic tokens, and a conditional token-level barrier that regularizes under-confident tokens balancing SFT-anchored likelihood and preference-driven separation in low-separation, importance-skewed regimes. We evaluate TAB-PO on medical communication annotation, a task requiring joint prediction of hierarchical labels and evidence Spans from patient-provider messages. TAB-PO achieves a ~ 4% relative improvement in micro-F1 over SFT and consistently outperforms recent preference-optimization baselines.

Executive Summary

The article introduces Token-Adaptive Barrier Preference Optimization (TAB-PO), a novel method to address the limitations of Direct Preference Optimization (DPO) in token-critical structured prediction settings. TAB-PO incorporates token-weighted, reference-adjusted advantages to prioritize high-value semantic tokens and a conditional token-level barrier to regularize under-confident tokens. This approach is evaluated on medical communication annotation and achieves a ~4% relative improvement in micro-F1 over SFT and recent preference-optimization baselines. The method addresses the challenges of low-separation preference pairs, token-importance skew, margin collapse, likelihood squeezing, and gradient dilution in DPO. The results demonstrate the effectiveness of TAB-PO in improving performance in token-critical settings, where standard DPO methods struggle.

Key Points

▸ Direct Preference Optimization (DPO) suffers from limitations in token-critical structured prediction settings
▸ Token-Adaptive Barrier Preference Optimization (TAB-PO) addresses these limitations by incorporating token-weighted, reference-adjusted advantages
▸ TAB-PO achieves a ~4% relative improvement in micro-F1 over SFT and recent preference-optimization baselines

Merits

Strength

TAB-PO effectively addresses the limitations of DPO in token-critical settings, leading to improved performance

Robustness

TAB-PO's token-weighted, reference-adjusted advantages and conditional token-level barrier provide robustness to low-separation preference pairs and token-importance skew

Demerits

Limitation

The evaluation of TAB-PO is limited to a single task (medical communication annotation), and its performance on other tasks is unclear

Complexity

TAB-PO introduces additional complexity compared to DPO, which may require significant computational resources and expertise

Expert Commentary

The article presents a significant contribution to the field of preference optimization, particularly in the context of token-critical structured prediction. TAB-PO's approach addresses the limitations of DPO and achieves improved performance on a challenging task. However, the evaluation is limited to a single task, and the complexity of TAB-PO may be a barrier to adoption. Future work should focus on evaluating TAB-PO on a broader range of tasks and exploring ways to simplify the method without sacrificing its effectiveness.

Recommendations

✓ Recommendation 1: Further evaluation of TAB-PO on a diverse range of tasks to assess its generalizability
✓ Recommendation 2: Exploration of methods to simplify TAB-PO while maintaining its effectiveness

Sources

arXiv - cs.CL

TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation

AI Commentary

Executive Summary

Key Points

Merits

Strength

Robustness

Demerits

Limitation

Complexity

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs