Zeyu Zheng's Blog

Sustain the climb

Background

In the recent MAI-Thinking-1 technical report [1], a "self-distillation" mechanism (Section 3.1.4) is introduced where they periodically pause RL, SFT on a clean checkpoint using collected rollouts, and resume training. This aligns with our approach in BFS-Prover-V2 [2]. Despite very different scales and domains, the training curves show the exact same pattern.

MAI-Thinking-1 Figure 15
BFS-Prover-V2 Figure 4

MAI-Thinking-1, Fig. 15 (Section 3.1.4). ⋆ = soft reset (self-distillation). Different colors = base upgrade (new pre/mid-trained model).

BFS-Prover-V2, Fig. 4 (Section 2.2). "Retrain" = soft reset (periodic retraining). "Scale up" = base upgrade (larger model).

The train-inference numerics gap

In most modern RL systems, training and inference use different kernels and parallelism strategies for different optimization priorities, which causes them to produce slightly different logits for the same input. These discrepancies compound across long rollouts and destabilize the importance-sampling correction ([1], Section 3.6.4). As a concrete example, consider the GRPO objective used in [1]:

$$J(\theta) = \mathbb{E}\left[\frac{1}{\sum_{i=1}^G |y_i|} \sum_{i=1}^G \sum_{t=1}^{|y_i|} \min\left(r_{i,t}(\theta)\, A_i,\; \text{clip}(r_{i,t}(\theta), 1{-}\epsilon, 1{+}\epsilon)\, A_i\right)\right]$$

where $r_{i,t}(\theta) = \pi_\theta^{\text{train}}(y_{i,t} \mid q, y_{i,{<}t}) \,/\, \pi_{\text{old}}^{\text{inference}}(y_{i,t} \mid q, y_{i,{<}t})$. The numerator and denominator are computed by different systems, so

$$\log r_{i,t} \;=\; \log \frac{\pi_\theta^{\text{train}}}{\pi_{\text{old}}^{\text{inference}}} \;=\; \log \frac{\pi_\theta^{\text{inference}}}{\pi_{\text{old}}^{\text{inference}}} + \log \frac{\pi_\theta^{\text{train}}}{\pi_\theta^{\text{inference}}} \;=\; \underbrace{\log \frac{\pi_\theta^{\text{inference}}}{\pi_{\text{old}}^{\text{inference}}}}_{\text{policy change}} + \underbrace{\epsilon_t}_{\text{mismatch}}$$

where $\epsilon_t = \log \pi_\theta^{\text{train}} - \log \pi_\theta^{\text{inference}}$.

Individual $\epsilon_t$ are small, but they accumulate over the sequence and can significantly corrupt the ratio. Normally the clip in the objective bounds the ratio to $[1-\epsilon, 1+\epsilon]$, but the GRPO objective intentionally leaves certain branches unclipped (when $A_i < 0, r > 1$ or $A_i > 0, r < 1$), allowing the policy to correct itself freely. In these branches, inflated ratios pass through unbounded and produce gradient-norm spikes. To address this, MAI

But even with all of these, drift can still accumulate over long runs ([1], Section 3.6.4). A soft reset addresses this by starting from a fresh checkpoint.

Model merging is another technique that can address this and is increasingly adopted by AI labs in their post-training pipelines. Different runs from the same base accumulate drift on different tokens, so averaging their parameters cancels out run-specific corruptions.

Beyond numerical precision

BFS-Prover-V2 uses expert iteration, an offline RL algorithm that does not compute importance-sampling ratios and is therefore not affected by train-inference numerical mismatch. Yet soft resets still help. The more general failure mode is the policy settling into a local optimum in parameter space, often accompanied by entropy collapse. In BFS-Prover-V2, each soft reset causes only a small performance drop but significantly restores entropy, allowing the model to resume exploration. In addition to soft resets, MAI also addresses entropy during training with an adaptive control mechanism (Section 3.1.1), estimating policy entropy at each step:

$$\hat{H}(\pi_\theta) = \frac{1}{|T|} \sum_{(i,t) \in T} -\log \pi_\theta(y_{i,t} \mid \cdot) \cdot r_{i,t}(\theta)$$

and adjusting the upper clip bound $k$ to maintain target entropy $H^\star = 0.3$:

$$k \leftarrow \text{clip}\left(k + \delta \cdot \text{sign}(H^\star - \hat{H}(\pi_\theta)),\; 0,\; k_{\max}\right)$$

This helps with entropy, but parameters can still remain trapped in a basin that small gradient steps cannot escape. A soft reset jumps to a different point in parameter space entirely, which is why performance dips briefly then climbs past the previous ceiling. Model merging can also escape local optima, since independent runs settle into different basins and their average lands outside any single one.

Refresh your data

What data should the resetting SFT train on? Both MAI and BFS-Prover-V2 converge on three principles.

MAI 2D filtering landscape
BFS-Prover-V2 tactic perplexity filtering

MAI-Thinking-1 (Section 3.1.3). Two-stage problem sampling with pass-rate filter $\rho \in [0.05, 0.8]$ then $\rho \in [0.1, 0.8]$.

BFS-Prover-V2, Fig. 2 (Section 2.2.1). Tactic-level perplexity filtering. Low and high tails filtered out.

For details on our approach, see the BFS-Prover-V2 paper.

References

[1] The Microsoft AI Team, MAI-Thinking-1: Building a Hill-Climbing Machine, Technical Report, 2026.

[2] R. Xin*, Z. Zheng*, Y. Nie*, K. Yuan, and X. Xiao, BFS-Prover-V2: Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers, ICML, 2026.