Academic

Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

Haechan Kim, Soohyun Ryu, Gyouk Chu, Doohyuk Jang, Eunho Yang · March 20, 2026 · 1 min read · 5 views

#cs.LG #cs.AI

arXiv:2603.18444v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate RLVR from a statistical estimation perspective by modeling rewards as samples drawn from a policy-induced distribution and casting advantage computation as the problem of estimating the reward distribution from finite data. Building on this view, we propose Discounted Beta--Bernoulli (DBB) reward estimation, which leverages historical reward statistics for the non-stationary distribution. Although biased, the resulting estimator exhibits reduced and stable variance, theoretically avoids estimated variance collapse, and achieves lower mean squared error than standard point estimation. Extensive experiments across six in-distribution and three out-of-distribution reasoning benchmarks demonstrate that GRPO with DBB consistently outperforms naive GRPO, achieving average Acc@8 improvements of 3.22/2.42 points in-distribution and 12.49/6.92 points out-of-distribution on the 1.7B and 8B models, respectively, without additional computational cost or memory usage.

Executive Summary

This article proposes a novel approach to reinforcement learning with verifiable rewards (RLVR) by reformulating it as a statistical estimation problem. The Discounted Beta--Bernoulli (DBB) reward estimation method leverages historical reward statistics to estimate the reward distribution from finite data, reducing variance and achieving lower mean squared error than standard point estimation. Experiments across six in-distribution and three out-of-distribution reasoning benchmarks demonstrate the effectiveness of GRPO with DBB, outperforming naive GRPO without additional computational cost or memory usage. This work has significant implications for improving the reasoning capabilities of large language models and addresses a critical challenge in RLVR, namely sample inefficiency.

Key Points

▸ The article reformulates RLVR as a statistical estimation problem
▸ DBB reward estimation leverages historical reward statistics to reduce variance
▸ Experiments demonstrate the effectiveness of GRPO with DBB across multiple benchmarks

Merits

Strength in Addressing Sample Inefficiency

The work effectively addresses a critical challenge in RLVR, namely sample inefficiency, by providing a novel approach to reward estimation.

Improved Estimation Accuracy

DBB reward estimation achieves lower mean squared error than standard point estimation, indicating improved estimation accuracy.

Practical Significance

The work has significant practical implications for improving the reasoning capabilities of large language models.

Demerits

Limited Domain Generalization

The work primarily focuses on in-distribution and out-of-distribution reasoning benchmarks and may not generalize well to other domains.

Assumptions and Simplifications

The article assumes a policy-induced distribution and a non-stationary distribution, which may not hold in all scenarios.

Expert Commentary

The article presents a well-structured and well-motivated approach to addressing sample inefficiency in RLVR. The proposed DBB reward estimation method is theoretically sound and demonstrates improved estimation accuracy in experiments. However, the work assumes a policy-induced distribution and a non-stationary distribution, which may not hold in all scenarios. Furthermore, the limited domain generalization of the work may restrict its applicability to other domains. Nevertheless, the work has significant practical implications for improving the reasoning capabilities of large language models and addresses a critical challenge in RLVR.

Recommendations

✓ Future work should investigate the applicability of DBB reward estimation to other domains and scenarios.
✓ The authors should explore more general assumptions and simplifications to improve the robustness of the approach.

Sources

arXiv - cs.LG

Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

AI Commentary

Executive Summary

Key Points

Merits

Strength in Addressing Sample Inefficiency

Improved Estimation Accuracy

Practical Significance

Demerits

Limited Domain Generalization

Assumptions and Simplifications

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.