1.11 全部算法公式速查表¶
使用说明
本表汇总了 Post-Training 核心算法的目标函数、优势函数、IS 粒度和裁剪方式。目标函数核心列为各算法 Policy Gradient 目标的简化形式,详细推导参见对应章节。
| 算法 | 目标函数核心 | 优势函数 | IS 粒度 | 裁剪/门控 | 范式 |
|---|---|---|---|---|---|
| PPO | \(\min(r_t A_t, \text{clip}(r_t) A_t)\) | GAE + Critic | Token | 对称硬裁剪 | RLHF |
| GRPO | \(\min(r_{i,t} \hat{A}_i, \text{clip}(r_{i,t}) \hat{A}_i)\) | Group Relative | Token | 对称硬裁剪 | RLVR |
| DAPO | \(\min(r_{i,t} \hat{A}_i, \text{clip}_{\text{asym}}(r_{i,t}) \hat{A}_i)\) | Group Relative | Token | 非对称硬裁剪 | RLVR |
| VAPO | 同 DAPO + \(\mu \cdot \mathcal{L}_{\text{NLL}}\) | GAE + Adaptive Critic | Token | 非对称硬裁剪 | RLVR |
| CISPO | \(\text{sg}(\text{clip}(r_{i,t})) \cdot \hat{A}_i \cdot \log\pi\) | Group Relative | Token | 裁剪IS 权重 | RLVR |
| GSPO | \(\min(s_i \hat{A}_i, \text{clip}(s_i) \hat{A}_i)\) | Group Relative | Sequence | 硬裁剪(极小 \(\varepsilon\)) | RLVR |
| SAPO | \(\frac{4}{\tau}\sigma(\tau(r-1)) \cdot \hat{A}_i\) | Group Relative | Token → Seq | Soft Gating | RLVR |