1.7 VAPO — Value-Based 路线的回归与修正¶

本节摘要

VAPO 融合 PPO 的 Value-Based 路线与 GRPO/DAPO 的改进技术，通过三项 Critic 修复方案在长 CoT 推理上达到 AIME 2024 的 60.4%。

论文：VAPO

VAPO: Efficient and Reliable Reinforcement Learning for LLM Reasoning with Verifiable Rewards (ByteDance Seed, 2025) arXiv: 2504.05118 地位: 跨分支融合算法（PPO + GRPO + DAPO），AIME 2024 达到 60.4%

核心论点¶

"Value-Free（如 GRPO）在短任务上够用，但在长 CoT 推理中，Value-Based 有不可替代的优势——只是需要修好 Critic。"

核心公式¶

VAPO 总损失函数:

\[ \mathcal{L}(\theta) = \mathcal{L}_{\text{PPO}}(\theta) + \mu \cdot \mathcal{L}_{\text{NLL}}(\theta) \quad (\mu = 0.1) \]

PPO 部分采用 Clip-Higher + Token-Level Loss + GAE（非 Group Relative）。

正样本模仿学习损失 (Positive-Example LM Loss):

\[ \mathcal{L}_{\text{NLL}}(\theta) = -\frac{1}{\sum_{o_i \in \mathcal{T}} |o_i|} \sum_{o_i \in \mathcal{T}} \sum_{t=1}^{|o_i|} \log \pi_\theta(a_t \mid s_t) \]

其中 \(\mathcal{T}\) 是当前 batch 中回答正确的样本集合。这是一个"自我模仿学习"——对正确回答直接最大化对数似然。去掉此损失，AIME 从 60 降到 54（−6）。

三大 Critic 修复技术¶

技术	做法	移除后 AIME 得分
Value Pretraining	RL 开始前用 MC 回报预训练 Critic ~50 步	11（−49，几乎崩溃）
Decoupled GAE	Critic 用 \(\lambda=1.0\)（无偏），Policy 用 \(\lambda=0.95\)（低方差）	33（−27）
Length-Adaptive GAE	\(\lambda_{\text{policy}} = 1 - \frac{1}{\alpha \cdot l}\)，让 TD 误差贡献与序列长度成正比	45（−15）

📖 初学者补充：为什么 Value Pretraining 如此关键？

从奖励模型初始化 Critic 会引入正向偏置——RM 被训练在 <EOS> 位置打分，对早期 token 给出较低分数。但 Critic 需要在每个 token 给出准确的期望累积回报。没有预训练，Critic 的初始估计完全不靠谱，导致 GAE 计算出的优势全是噪声，RL 从一开始就走向错误方向。

技术来源总结¶

技术	来源	创新/继承
Value Model + GAE	PPO	继承
Group Sampling	GRPO	继承
Clip-Higher, Token-Level Loss, Dynamic Sampling	DAPO	继承
Value Pretraining, Decoupled GAE, Length-Adaptive GAE, Positive-Example LM Loss	VAPO 原创	创新

消融实验汇总¶

移除的组件	AIME 2024	下降
完整 VAPO	60.4	—
移除 Value Pretraining	11	−49
移除 Decoupled GAE	33	−27
移除 Length-Adaptive GAE	45	−15
移除 Clip-Higher	46	−14
移除 Token-Level Loss	53	−7
移除 Positive-Example LM Loss	54	−6
原始 PPO（无任何改进）	5	−55