§ IThe terrain
Training a modern language model happens in three acts. Understanding what each act does — and what it leaves unfinished — is the prerequisite for understanding why reinforcement learning is needed at all.
Act one: pretraining. A neural network with on the order of $10^{10}$–$10^{12}$ parameters is shown a substantial fraction of the public internet — perhaps $10^{13}$ tokens of text — and trained to predict the next token given the previous ones. The loss is cross-entropy. The result is a network that has internalized statistical regularities of human language: grammar, factual associations, style, code syntax, mathematical notation. But the model has no notion of being helpful. It is a vast conditional probability distribution $p(\text{token} \mid \text{context})$ that happens to be a remarkable mimic.
Act two: supervised fine-tuning (SFT). A few thousand to a few hundred thousand high-quality demonstrations of "instruction → response" are curated. The model is fine-tuned on these via the same next-token cross-entropy loss. It now knows the format of being an assistant. It will answer questions, follow directives, and produce structured output. But its responses, while plausible, may still be wrong, evasive, sycophantic, unsafe, or simply poor compared to what a thoughtful human would say.
Act three: reinforcement learning. This is where the model is taught not just how to respond but which responses are better than which. The signal is no longer "imitate this token" but "this whole response was good; that one was bad." Conveying that signal — and using it to actually improve the model — turns out to require a substantial mathematical apparatus. That apparatus is the subject of this essay.
Pretraining fits the model to a vast empirical distribution. SFT projects it onto the manifold of "assistant-like" responses. RL deforms that manifold — gently, with careful constraints — to peak around responses that humans (or verifiers) judge as good. The deformation has to be gentle because the reward signal is noisy and the manifold is high-dimensional, so an over-aggressive update destroys the very capabilities pretraining gave the model.
§ IIThe fundamental objective
Strip everything down and there is just one thing we want. Let $\pi_\theta(g \mid p)$ be the language model's probability of producing response $g$ given prompt $p$, with parameters $\theta$. Let $R(g, p)$ be some scalar measure of how good response $g$ is. Then we want
"Find parameters that, on prompts drawn from our distribution, produce responses with high expected reward." That is the entire goal. Every algorithm we will discuss — PPO, GRPO, DPO, KTO, SimPO — is a different attempt to estimate or approximate the gradient of this objective. The differences come from three thorny questions.
Question 1 — Where does $R$ come from?
Humans cannot grade a billion responses. Two solutions dominate. Either we train a separate reward model $R_\phi$ on pairs of responses ranked by humans (this is RLHF, used in PPO, GRPO), or we sidestep the reward entirely and work directly with the preference data (DPO and descendants). For reasoning tasks, a third option has emerged: verifiable rewards, where $R$ is a hard-coded function like "does this math answer match the ground truth" or "does this code pass the unit tests."
Question 2 — How do we differentiate through sampling?
The variable $g$ is sampled stochastically from $\pi_\theta$, and sampling is not a differentiable operation. The classical trick is the policy gradient theorem, which uses the log-derivative identity to convert a gradient of an expectation into an expectation of a gradient. We will derive it.
Question 3 — How do we keep the model from breaking itself?
The reward $R$ is imperfect. Push too hard on it and the model exploits its flaws — generating bizarre, incoherent, but high-scoring nonsense. This is reward hacking, and every modern algorithm includes a regularizer that anchors the trained policy near a trusted reference (the SFT model, $\pi_0$). The form of this anchor — KL penalty, clipping, sequence-level constraints — is one of the main axes along which methods differ.
Every post-training RL method navigates the same triangle: signal source, gradient estimation, drift control. The methods we will study are different points in this triangle, each emphasizing a different trade-off.
§ IIIPPO — the workhorse
Proximal Policy Optimization is the algorithm that powered InstructGPT, ChatGPT, GPT-4, Llama 2's RLHF stage, and most of the alignment work from 2022 through 2024. It is more complex than it needs to be, and that complexity is the price of being the first method that actually worked reliably at the scale of large language models. Let us derive it from scratch.
Step 1. The log-derivative trick
We want $\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{g \sim \pi_\theta}[R(g)]$. Writing the expectation as a sum and pushing the gradient inside,
The identity $\nabla \pi = \pi \nabla \log \pi$ is what makes this work. The gradient is now an expectation, which we can estimate by Monte Carlo: sample a response, compute its reward, multiply by the gradient of its log-probability. This is called REINFORCE, and it is the most basic policy-gradient algorithm.
This is the same trick that appears in statistical mechanics whenever you differentiate a partition function: $\nabla \log Z = \langle \nabla H \rangle / Z$, swapping a derivative of a sum for an expectation. The log-derivative trick is just the same identity wearing different clothes.
Step 2. Variance reduction via a baseline
REINFORCE is unbiased but its variance is catastrophic. If $R(g) > 0$ for every $g$, every sample pushes up the log-probability of its response — the gradient never says "this was bad, do less of it." We just push everything up, harder for some than others. Distinguishing "more good" from "less good" by averaging Monte Carlo samples requires enormous batches.
The fix exploits a delightful identity. For any function $b(p)$ that depends only on the prompt and not the response,
The score function has mean zero. So we can subtract any prompt-dependent baseline $b(p)$ without biasing the gradient, and the variance-minimizing choice is approximately the value function $V(p) = \mathbb{E}_{g \sim \pi_\theta}[R(g \mid p)]$ — the average reward we expect from prompt $p$ under the current policy. This gives the advantage:
Now the gradient signal is "this response was better than typical for this prompt" or "this response was worse." A response with $A > 0$ gets its probability pushed up; $A < 0$ pushes down. The variance plummets.
We do not know $V$, so we train a second neural network — the critic, $V_\phi$ — by regression against observed returns. It is updated alongside the policy.
Step 3. Per-token credit assignment
So far the response $g$ has been treated as one atomic object. But a response is a sequence of tokens $g = (a_1, a_2, \ldots, a_T)$, and $\pi_\theta(g \mid p) = \prod_t \pi_\theta(a_t \mid s_t)$, where $s_t = (p, a_1, \ldots, a_{t-1})$ is the prefix. So
The sequence-level advantage $A(g)$ multiplies every token's gradient equally. If the response was good because of tokens 5–10 and tokens 30–40 were just filler, this is wasteful. We want to assign credit per token.
The key observation: token $a_t$ can only influence rewards at time $t$ or later. So define a per-token reward $r_t$ (we will see how shortly), and let $\hat{A}_t$ be a per-token advantage. The clean form is the TD error:
In words: "How much better did things look one step later than my value function predicted?" If $V_\phi$ predicted things would be $V_\phi(s_t)$ from here, but after token $a_t$ I observed a small reward $r_t$ and the value of the new state is $V_\phi(s_{t+1})$, then the discrepancy $\delta_t$ is a single-step estimate of the advantage of having taken action $a_t$.
A pure single-step TD estimate is biased (it relies entirely on $V_\phi$, which is imperfect). The Monte Carlo alternative — sum all future rewards — is unbiased but high-variance. The compromise is Generalized Advantage Estimation (GAE), which interpolates:
The hyperparameter $\lambda \in [0,1]$ controls the bias–variance trade. At $\lambda = 0$ we recover the one-step TD; at $\lambda = 1$ we get the Monte Carlo sum. In practice, $\lambda \approx 0.95$. The computation is one backward pass through the trajectory.
Step 4. Where do the per-token rewards $r_t$ come from?
The reward model gives one number for the whole response: $R(g, p)$. The standard convention is to place this reward at the final token and zero elsewhere, then add a per-token KL penalty to the reference policy $\pi_0$:
The KL term is the anchor we discussed earlier. It penalizes every token whose probability under the updated policy deviates from the original SFT policy $\pi_0$. The coefficient $\beta$ controls how tightly the model is tied to its origins.
Step 5. Off-policy reuse and the clipped surrogate
Sampling responses from a large language model is expensive. We would like to take many gradient steps per sampled batch. But after the first step, $\theta$ has moved, and the samples are no longer drawn from the current $\pi_\theta$. Importance sampling rescues us:
Define the per-token probability ratio
and the surrogate objective
At $\theta = \theta_{\text{old}}$ this equals the standard policy gradient. As $\theta$ drifts, the importance ratios get unreliable. PPO's specific innovation is the clipped surrogate:
with $\epsilon \approx 0.2$. The clipping caps the importance ratio at $1 \pm \epsilon$, so no single gradient step can swing the probability of a token by more than 20%. This is the "proximal" in PPO: each update stays close to the previous policy.
The $\min$ inside the objective is asymmetric on purpose. When $\hat{A}_t > 0$ (good token), it caps how aggressively you can push the probability up. When $\hat{A}_t < 0$ (bad token), it caps how aggressively you can push it down. In both cases, the surrogate is a lower bound on the true objective — gradient ascent on it is provably gradient ascent on something close to the truth.
Putting it together
Roll out
Sample a batch of prompts. For each, generate a response from $\pi_\theta$ token by token. Record every $\log \pi_\theta(a_t)$ and $\log \pi_0(a_t)$.
Score
Pass each response through the reward model to get $R(g, p)$. Compute per-token rewards $r_t$ by combining the KL penalty (every token) with $R$ placed on the final token.
Compute advantages
Use the frozen value network $V_\phi^{\text{old}}$ to compute TD errors $\delta_t$, then GAE-combine them backward through the trajectory into $\hat{A}_t$.
Update
Freeze the batch. For several epochs of minibatches, maximize the clipped surrogate $\mathcal{L}^{\text{PPO}}$ via gradient ascent on $\theta$, simultaneously training $V_\phi$ via regression against $\hat{G}_t = V_\phi^{\text{old}}(s_t) + \hat{A}_t$.
Discard, repeat
Throw out the batch. Sample fresh trajectories from the updated $\pi_\theta$. Iterate.
The bill
PPO works, but it is heavy. You maintain four networks during training: the policy $\pi_\theta$, the reference $\pi_0$ (frozen), the reward model $R_\phi$ (frozen), and the value critic $V_\phi$ (trained). Each is roughly the size of the language model itself. For a 70B-parameter model, this is on the order of $4 \times 70 = 280$ billion parameters held in memory, plus optimizer states, plus gradients. The engineering is real.
And the value critic is genuinely difficult to train. It has to predict expected future reward from arbitrary prefixes — a hard task, especially when rewards are sparse (one number at the end of a 2,000-token response). When $V_\phi$ is inaccurate, the advantages are biased, and the policy update is misdirected. This pain motivated the next generation of methods.
§ IVDPO — collapsing RL into supervised learning
In 2023, Rafailov and colleagues at Stanford published a paper titled "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." The title was provocative; the math was beautiful. They showed that the entire PPO pipeline — reward model training, policy gradient, value critic, KL penalty — could be collapsed into a single supervised loss function, with no RL at all.
Here is how. The argument has three moves.
Move 1. Solve the KL-regularized RL problem in closed form
Suppose we have a reward $r(g, p)$ and we want to find the policy that maximizes
This is exactly the RLHF objective. Treat it as a constrained optimization in the space of probability distributions over responses. With a Lagrangian or just direct calculus of variations, the optimum has a closed form:
where $Z(p) = \sum_{g'} \pi_{\text{ref}}(g' \mid p) \exp(r(g', p) / \beta)$ is the partition function. This is exact, not an approximation. The optimal policy is the reference policy tilted exponentially by the reward, with temperature $\beta$.
A physicist will recognize this immediately: $\pi^\star \propto \pi_{\text{ref}} \cdot e^{r/\beta}$ is precisely the form of a Boltzmann distribution at "temperature" $\beta$, with the reference policy playing the role of the density of states and the reward playing the role of $-E$. RLHF is the Gibbs measure of an LLM.
Move 2. Invert the closed form
The trouble with that closed form is the partition function $Z(p)$ — it requires summing over all possible responses, which is intractable. But here is the trick: rearrange the equation to express $r$ in terms of $\pi^\star$:
So the reward is determined (up to a prompt-dependent constant) by the optimal policy. Any policy implicitly defines a reward function it would be optimal for.
Move 3. Use preference data, not rewards
Now bring in the Bradley-Terry model of preferences. Given two responses $g_w$ ("winner") and $g_l$ ("loser") for the same prompt, the probability that a human prefers $g_w$ is
where $\sigma$ is the logistic function. The reward differences are what matter — and crucially, the partition function $Z(p)$ cancels because it appears in both rewards with the same sign. Substituting the expression for $r$:
Now we have the preference probability written entirely in terms of $\pi^\star$ and $\pi_{\text{ref}}$. No reward model. No partition function. We can fit $\pi_\theta$ to predict observed human preferences by maximizing the log-likelihood:
That is the entire DPO loss. It is a supervised binary classification loss: "given a preferred and rejected response, push the relative log-probability of the preferred one up." There is no reward model, no critic, no rollouts, no policy gradient. Just a forward pass and a cross-entropy gradient.
What was traded
DPO is gorgeous, and it works. But it makes specific trade-offs.
It is offline. The preference data is fixed; the model is fine-tuned on it like a regular dataset. There is no fresh sampling from the current policy. This is computationally cheap but loses the ability to discover new high-reward regions of response space — the model can only re-rank what was already in the training set.
It is sequence-level. DPO treats $\pi_\theta(g \mid p)$ as one object: the joint probability of a full response. There is no per-token credit assignment. This makes the loss simple but blurs the signal — every token in $g_w$ gets nudged up, every token in $g_l$ gets nudged down, even tokens that were identical between them.
It assumes Bradley-Terry. The derivation depends on the BT model of human preferences, which assumes preferences are transitive and the probability of preference is a logistic function of reward difference. If preferences are non-transitive or noisy in ways BT cannot capture, DPO inherits the misspecification.
It can overfit aggressively. Because the loss can drive the log-ratio $\log[\pi_\theta(g_w) / \pi_{\text{ref}}(g_w)]$ arbitrarily high, the model may "memorize" preferred responses in a way that reduces overall quality. This motivated IPO (Identity Preference Optimization), which adds a regularizer to cap the log-ratio magnitude.
The optimal policy under KL-constrained reward maximization has closed form. If you know the policy, you can recover the reward. If you can recover the reward, you can fit the policy to preference data directly. Three lines of algebra collapse an entire RL pipeline.
§ VGRPO — DeepSeek's group-relative trick
In 2024, DeepSeek published the GRPO algorithm in their DeepSeekMath paper. The motivation was practical: PPO's value critic is expensive, hard to train, and biased. Can we get the per-token credit assignment of PPO without the critic?
GRPO's answer is elegant. Instead of estimating the baseline with a learned network, estimate it empirically from multiple samples of the same prompt. For each prompt $p$, sample $G$ responses $\{g_1, \ldots, g_G\}$ from the current policy. Score them all with the reward model. The group-normalized advantage for response $i$ is
That is the entire baseline construction. It is a z-score of the reward within the group. Positive advantage = better than typical for this prompt; negative = worse. No value critic, no learned baseline — just empirical statistics of the group.
Why this works
The PPO value function $V_\phi(p)$ was supposed to estimate $\mathbb{E}_{g \sim \pi_\theta}[R(g \mid p)]$. With $G$ samples, the group mean $\mu_G(p)$ is an unbiased Monte Carlo estimate of exactly that quantity:
For $G \geq 4$ or so, this estimate is sharper than a neural network trained on noisy bootstrap targets. The division by $\sigma_G(p)$ further normalizes the scale, which improves numerical stability when reward magnitudes vary across prompts.
This is essentially a paired-trial design. Rather than estimating a population mean ($V^\pi$) once and using it for every trial, you estimate it freshly for each prompt using $G$ siblings. The variance of the advantage is dominated by within-prompt variation, not between-prompt variation — exactly the variance you want to keep, because that is the signal.
The full GRPO objective
The rest of GRPO is structurally identical to PPO. Same clipped importance ratio, same KL penalty, just with the group-normalized advantage in place of the critic-based one:
One important detail: the same group-normalized advantage $\hat{A}_i$ is assigned to every token in response $i$. This is the price of giving up the critic — without $V_\phi(s_t)$ providing per-state estimates, GRPO cannot do PPO's fine-grained per-token credit assignment. The credit is response-level, smeared uniformly across tokens.
Why GRPO became famous
GRPO was the algorithm behind DeepSeek-R1 and DeepSeek-R1-Zero — the open-weight reasoning models that demonstrated, in early 2025, that you could elicit chain-of-thought reasoning from a base model using only RL with verifiable rewards, no SFT for reasoning at all. The combination of GRPO's efficiency (no critic, halving the memory footprint relative to PPO) with verifiable rewards (no reward model, since correctness can be programmatically checked) made it the dominant method for training reasoning models throughout 2025.
The drawbacks
GRPO has known biases. The division by $\sigma_G$ implicitly down-weights prompts with low reward variance — exactly the prompts the model has nearly mastered. This can slow progress on the harder edges of capability. The length normalization (dividing by $|g_i|$) interacts strangely with reward sparsity: a wrong-but-long response and a wrong-but-short response get different per-token penalties. A 2025 follow-up called Dr. GRPO ("done right") removes the standard deviation normalization and the sequence-length normalization, and reports better calibration and shorter responses at equal quality.
Another follow-up, DAPO, allows the upper clip to be looser than the lower clip (asymmetric clipping), preserving exploration into low-probability but high-reward regions while still preventing collapse on the downside. These are refinements rather than fundamental departures — the group-relative idea has settled in as the standard.
| Method | Baseline | Critic? | Per-token credit? |
|---|---|---|---|
| PPO | Learned $V_\phi(s_t)$, per-state | Yes | Yes (via GAE) |
| GRPO | Empirical group mean $\mu_G(p)$ | No | No (response-level) |
| Dr. GRPO | Group mean only, no $\sigma_G$, no length norm | No | No |
| DAPO | Group mean with asymmetric clipping | No | No |
§ VIThe zoo
After DPO's success, a small explosion of preference-based methods followed. Each tweaks one assumption. Here is a brief tour of the major ones; the goal is to give you the conceptual shape of each, not a full derivation.
KTO — Kahneman-Tversky Optimization
DPO requires paired preferences: response $A$ vs. response $B$ for the same prompt. In practice, much human feedback is unpaired — thumbs-up or thumbs-down on a single response, with no comparison partner. KTO (Ethayarajh et al., 2024) reformulates the loss using prospect theory, the behavioral-economics framework of Kahneman and Tversky. The objective treats gains and losses asymmetrically — losses loom larger than gains — and operates on each labeled sample independently. Practically, KTO uses thumbs-up/thumbs-down feedback, which is far easier to collect than paired comparisons.
IPO — Identity Preference Optimization
The Bradley-Terry assumption in DPO can be violated by noisy or deterministic preferences. IPO (Azar et al., 2024) replaces the logistic preference model with a generic identity mapping: it bounds the squared log-ratio rather than passing it through a sigmoid. This caps how large the implicit reward can become and improves robustness when preferences are nearly deterministic. The cure for DPO's tendency to overconfidently memorize preferred responses.
SimPO — Simple Preference Optimization
SimPO (Meng et al., 2024) is DPO without the reference model. It replaces $\beta \log[\pi_\theta(g) / \pi_{\text{ref}}(g)]$ with a length-normalized log-probability $(\beta / |g|) \log \pi_\theta(g)$. This removes the need to maintain $\pi_{\text{ref}}$ in memory at all, cutting compute roughly in half, and adds an explicit length normalization that addresses DPO's tendency to drift toward overly long responses. Reported gains of 5–7 points on AlpacaEval relative to DPO.
ORPO — Odds Ratio Preference Optimization
ORPO (Hong et al., 2024) goes further still: it folds preference optimization directly into the SFT loss. Instead of a separate alignment stage, ORPO adds an odds-ratio penalty to the standard cross-entropy fine-tuning loss, simultaneously teaching the model to imitate good responses and to disprefer bad ones. One training stage, one loss, no reference model. The trade-off is that you cannot decouple format learning from preference learning, which can hurt when those signals point in different directions.
RLVR — Reinforcement Learning from Verifiable Rewards
Not an algorithm but a paradigm. For tasks where correctness can be programmatically checked — math problems with numerical answers, code that either passes tests or does not, formal proofs — the reward is exact and free of model artifacts. This sidesteps the reward-hacking problem that haunts RLHF: a verifiable reward cannot be hacked by producing nonsense; it is either right or wrong. RLVR is what made reasoning models work, and GRPO is the typical optimizer paired with it.
Rejection Sampling — the simplest method that works
One method that deserves its own treatment, and that Llama 2 used extensively, is rejection sampling (also called best-of-$K$ fine-tuning). It is not really a gradient method at all — it is supervised fine-tuning with a clever choice of targets.
The procedure is brutally simple. For each prompt $p$, sample $K$ candidate responses from the current policy. Score them all with the reward model. Keep only the best:
Then do standard supervised fine-tuning on these $\{(p, g^\star)\}$ pairs:
That's it. No advantages, no critic, no clipping, no importance sampling. Just cross-entropy loss on the best-of-$K$ samples from the current policy.
The intuition is that selecting the argmax over $K$ samples tilts the effective sampling distribution toward high-reward responses. SFT on those targets nudges $\pi_\theta$ to put more mass on the high-reward region. As $K$ grows, the expected reward of $g^\star$ grows roughly like $\sqrt{\log K}$ (the gap between max and median of $K$ iid sub-Gaussian draws), so doubling $K$ gives diminishing returns. Llama 2 used $K$ up to 100.
Llama 2's strategy was iterative: do four rounds of rejection sampling to cheaply lift the bulk of the policy distribution into a higher-reward region, then switch to PPO for the fine-grained final pass. The two methods are complementary — rejection sampling cheaply amplifies the good; PPO additionally suppresses the bad tail through its explicit negative gradients.
Best-of-$K$ selection has a precise statistical interpretation: the resulting target distribution is the current policy $\pi_\theta$ tilted by the $(K-1)$th power of the reward CDF. As $K \to \infty$, this approaches a hard $\arg\max$ — and a softer version (replacing $\arg\max$ with a reward-weighted softmax) is essentially reinforcement learning. Rejection sampling is, in a sense, the $K \to \infty$ limit of soft policy improvement. Crude but principled.
A taxonomy
| Method | Data needed | Online or offline | Defining feature |
|---|---|---|---|
| PPO | Reward model + prompts | Online | Clipped surrogate + value critic |
| Rejection Sampling | Reward model + prompts | Semi-online | SFT on best-of-K samples |
| GRPO | Reward model (or verifier) + prompts | Online | Group-normalized advantage, no critic |
| DPO | Paired preferences | Offline | Closed-form reward inversion |
| KTO | Binary thumbs-up/down | Offline | Prospect-theoretic loss, no pairs |
| IPO | Paired preferences | Offline | BT-free regularization |
| SimPO | Paired preferences | Offline | No reference model, length-normalized |
| ORPO | Paired preferences + SFT data | Offline | SFT and alignment in one loss |
| RLVR + GRPO | Verifiable problems | Online | Programmatic reward, no human labels |
The pattern is clear: the field has moved toward methods that are simpler, cheaper, and demand less human labor. The modern post-training stack typically combines several methods. A representative pipeline in 2026:
- SFT on curated instruction-following data to establish format.
- DPO or SimPO on preference pairs to teach taste, tone, and refusal behavior.
- GRPO with verifiable rewards on math, code, and logic tasks to teach reasoning.
- A final round of preference optimization to clean up regressions from the RL stage.
§ VIIWhere thinking comes from
In late 2024, OpenAI released o1, the first widely-deployed model that visibly "thought before answering." A few months later, DeepSeek released R1 — and crucially, an open paper describing exactly how it was trained. R1 made the recipe public: take a base model, apply GRPO with verifiable rewards on math and code, and watch reasoning emerge.
The mechanism is genuinely surprising, so let us walk through it carefully.
The setup
Start with a pretrained language model that has not been fine-tuned for instruction-following — just a base model. Define a simple prompt template:
System: A conversation between a user and an assistant. The assistant
first thinks about the reasoning process and then provides the answer.
The reasoning process is enclosed in <think></think> tags, and the
answer is enclosed in <answer></answer> tags.
User: [a math problem with a known numerical answer]
Define the reward as purely outcome-based:
- Accuracy reward: +1 if the final numerical answer matches the ground truth; 0 otherwise.
- Format reward: a small bonus if the response correctly uses the
<think>and<answer>tags.
Critically, there is no reward for the content of the reasoning. The model gets no signal about whether its intermediate steps are correct, well-organized, or even relevant. It only gets a signal about whether the final answer comes out right.
Run GRPO. Sample $G = 16$ or so responses per problem. Compute group-normalized advantages. Update. Repeat for thousands of iterations.
What happens — qualitatively
DeepSeek tracked two metrics across training: accuracy on math benchmarks, and average response length. The accuracy curve does what you would expect: a smooth climb from near-zero to near-state-of-the-art. The length curve does something extraordinary. It climbs too. Not linearly — it has phase transitions. The model's responses, which started at a few hundred tokens, grow to thousands of tokens, then tens of thousands, and continue rising.
And the content of those longer responses, when inspected, contains things nobody put there. The model starts using phrases like "wait, let me reconsider" and "actually, that's wrong, let me try a different approach." It backtracks. It checks its work. It explores multiple solution paths and abandons unpromising ones. The DeepSeek team called this the "aha moment" — a specific point in training where a model produces a chain like "Wait, wait. Wait. That's an aha moment I can flag here. Let me reevaluate this step by step…" before finding the right answer.
None of this was trained explicitly. The model invented it.
Outcome-based reward, applied at scale to a sufficiently capable base model, induces self-verification, backtracking, and multi-strategy exploration as instrumental behaviors. The model discovers that thinking longer raises the probability of being right.
Why does this happen — mechanistically
The standard explanation has three layers.
Layer 1: the base model already contains the building blocks. Pretraining on internet text exposes the model to billions of examples of human reasoning, including math derivations, philosophical arguments, debugging logs, and forum threads where someone catches their own mistake. The "ability" to backtrack, to say "actually, I was wrong," is already in the network's probability distribution — but in pretraining it was just one of many possible continuations, not a privileged one. The model could produce reasoning behavior, but it had no reason to.
Layer 2: RL with outcome reward changes the marginal calculus. Now consider what GRPO's gradient is actually doing. For each problem, it samples 16 attempts. Some end up correct, some incorrect. The correct ones receive positive advantage; the incorrect ones receive negative. The policy gradient pushes up the probability of every token in correct attempts, and down for incorrect ones.
Here is the key: when a model produces a chain of thought with a self-correction in it ("wait, that's wrong, let me redo this"), and that chain happens to land on the right answer, the self-correction itself gets reinforced, even though no human labeled it as good. It is reinforced because it was part of a correct trajectory. Over thousands of iterations, the model's probability of producing self-corrections — conditional on facing a hard problem — climbs steadily.
Layer 3: the model discovers compute-as-payoff. The most important insight is that response length is, indirectly, a knob the model can turn for accuracy. Longer responses = more intermediate computations = more chances to catch errors = higher probability of being right. The model has no explicit incentive to think longer — there is no length reward — but the implicit gradient says: trajectories that happen to be longer and that explore alternatives have higher win rates. So the policy probability mass migrates toward longer, more exploratory responses.
This is why reasoning models think for so long. Their "thought process" is not an artifact of a clever prompt; it is the model expending compute that genuinely improves its answers. Cut the thinking short and accuracy drops; let the model think more and accuracy rises — up to a point.
Subsequent research (Bandyopadhyay et al., "DeepSeek-R1 Thoughtology," 2025) found that reasoning models have a non-monotone relationship between thinking length and accuracy. Past a certain point, additional thinking hurts. The model starts second-guessing correct answers, rambling, or getting lost in irrelevant subproblems. There is an optimal thinking budget per problem, and the model does not always find it. This is now an active area of research: how to teach a model to know when to stop.
Why R1-Zero specifically was the punch line
DeepSeek trained two versions: R1 and R1-Zero. R1 followed a multi-stage pipeline (SFT → RL → SFT → RL) with curated reasoning data. R1-Zero skipped the SFT stages entirely — it went straight from the base model into pure RL. R1-Zero was less polished than R1 (it sometimes mixed languages, had formatting issues), but it demonstrated something philosophically important: reasoning behavior emerges from RL with verifiable rewards alone. You do not need to teach the model how to reason. You only need to reward it for getting the answer right and let it figure out the rest.
This was a surprise. The conventional wisdom had been that complex behaviors required curated demonstrations — that you had to show the model what good reasoning looks like via SFT before any RL could refine it. R1-Zero contradicted that. It said: the base model already contains all the reasoning machinery latent in its pretraining; you just need the right gradient to surface it.
§ VIIIMechanics of the thinking mode
So that is where thinking comes from. Now let us look at how it works in deployment — what is actually happening when you send a prompt to o1, R1, Claude with extended thinking, or Gemini Deep Thinking.
The structural change
At the architectural level, almost nothing changes. The model is still a transformer, still trained on next-token prediction, still sampling autoregressively at inference. What changes is the distribution of what it produces in response to a prompt.
A non-thinking model, given "What is the integral of $x^2 e^{-x}$ from 0 to infinity?", might directly produce: "The integral equals 2." (Possibly correct, possibly hallucinated.)
A thinking model, given the same prompt, produces something like:
<think>
I need to compute ∫₀^∞ x² e^(-x) dx. This looks like a Gamma function.
The Gamma function is Γ(n) = ∫₀^∞ x^(n-1) e^(-x) dx, and Γ(n) = (n-1)!
for positive integers n. So my integral with x² corresponds to n-1 = 2,
meaning n = 3, giving Γ(3) = 2! = 2.
Let me double-check by integration by parts. ∫ x² e^(-x) dx, let u = x²,
dv = e^(-x) dx, so du = 2x dx, v = -e^(-x). Then ∫ x² e^(-x) dx
= -x² e^(-x) + 2∫ x e^(-x) dx. The second integral: again by parts,
u = x, dv = e^(-x) dx, giving -x e^(-x) + ∫ e^(-x) dx = -x e^(-x) - e^(-x).
So ∫ x² e^(-x) dx = -x² e^(-x) - 2x e^(-x) - 2 e^(-x) + C.
Evaluating from 0 to ∞: at ∞ everything dies (exponential dominates).
At 0: -(0) - (0) - 2(1) = -2. So the integral = 0 - (-2) = 2. ✓
</think>
<answer>The integral equals 2.</answer>
The model produced thousands of tokens to answer a question that, formally, requires one number. Those extra tokens are not for show — they are computation. Each token in the <think> block is a forward pass through the model that conditions all subsequent tokens. The "reasoning" is the model maintaining and refining a working state through its sequential generation.
What is the "thinking" actually doing — computationally?
A modern LLM's forward pass over a single token uses a fixed amount of compute (proportional to its parameter count and the context length). It cannot do more "thinking" on a single token than its architecture allows. So how does generating more tokens get you more computation?
The answer: each generated token becomes context for the next. The model can offload intermediate state into the visible token stream — write down a partial result, then condition on it. This is the same trick humans use with a piece of paper: you cannot solve a complex problem entirely in working memory, but if you write down intermediate results, you can build the answer step by step.
Formally, you can think of the chain of thought as expanding the effective depth of the network. Without thinking, the model has access to whatever computation can fit in its $L$ layers. With $T$ thinking tokens, it has $L \times T$ layers of effective sequential computation, with each "layer" being a full forward pass that can read the entire history. This is a fundamental capability increase, not a stylistic flourish.
There is a literature (Merrill & Sabharwal, "The Expressive Power of Transformers with Chain of Thought," 2024) formalizing this intuition. A transformer of fixed depth has expressive limits — it cannot, in one forward pass, solve problems requiring more than constant-depth circuits. But a transformer that can generate intermediate tokens and condition on them transcends those limits: with enough thinking tokens, it can solve any problem in polynomial time. Chain of thought is a complexity-class promotion.
How is "thinking" controlled in practice?
Different deployed models handle the thinking mode differently:
OpenAI o1 and o3: The thinking trace is hidden from the user — only the final answer is shown. The model decides for itself how much to think; the user has limited control. Pricing reflects both visible and hidden tokens.
DeepSeek-R1, QwQ: Thinking is fully visible to the user, wrapped in <think></think> tags. The user sees the full chain of thought, which is useful for debugging, transparency, and pedagogical purposes.
Claude with extended thinking: The model supports a thinking budget controlled by the developer. The thinking is visible (in a separate block) and the depth can be tuned per request.
Gemini Deep Thinking: Similar — a thinking budget can be set, and the model dynamically expands or contracts based on the difficulty of the prompt.
The underlying mechanism in all cases is the same: a model trained via RL with outcome rewards has learned that generating a thinking trace before answering improves accuracy, and at inference it does so. The tag structure is a convenience for separating user-visible from internal computation; the capability is the result of training.
What does the gradient see during RL?
Let me close this section with a concrete illustration of what GRPO actually does to a base model during the reasoning-training phase. Consider one prompt: "Solve $x^2 - 5x + 6 = 0$."
The model produces $G = 4$ rollouts:
- "$x = 2$ or $x = 3$." → correct, $R = 1.0$
- "Factor: $(x-2)(x-3) = 0$, so $x = 2, 3$." → correct, $R = 1.0$
- "$x = 5/2 \pm \sqrt{25/4 - 6} = 5/2 \pm 1/2$, so $x = 2, 3$." → correct, $R = 1.0$
- "$x = -2$ or $x = -3$." → wrong, $R = 0.0$
Group mean: $\mu_G = 0.75$. Group std: $\sigma_G \approx 0.43$.
Advantages: $\hat{A}_1 = \hat{A}_2 = \hat{A}_3 = +0.58$, $\hat{A}_4 = -1.74$.
The gradient now pushes up the probability of every token in responses 1–3 (with the same weight, since GRPO is response-level), and pushes down on every token in response 4. Notice: response 3 contains an explicit derivation ("Factor: $(x-2)(x-3) = 0$"), and that derivation gets reinforced even though it was not the shortest path to the answer. Response 1, the terse one, gets the same positive advantage but it has fewer tokens to reinforce.
Over millions of such updates, across millions of problems of varying difficulty, the model gradually learns that:
- Showing work is correlated with being right on hard problems (because the alternative — pattern-matching to a guess — fails more often).
- Verifying answers before committing reduces wrong submissions on borderline cases.
- Exploring multiple approaches before settling on one improves accuracy on problems where the first approach is flawed.
None of these meta-strategies were explicitly programmed. They emerge as policies that win the credit-assignment game. The model is not "trying" to think; the model has learned that responses with thinking-like structure tend to receive positive advantage, and so it generates them.
§ IXSynthesis & what remains
Let me close by stepping back. The arc of this field, compressed into one paragraph:
In 2017, OpenAI published PPO as a general RL algorithm. In 2022, they used it as the engine of InstructGPT, demonstrating that human-preference signals could shape a base language model into a usable assistant. From 2022 to 2024, PPO-with-RLHF was the standard recipe — used by GPT-4, Claude, Llama 2, and most major models. In 2023, DPO collapsed the whole pipeline into a supervised loss, removing the need for the value critic, the reward model, and the rollouts. In 2024, GRPO replaced the critic with a group-relative baseline, restoring the online sampling of PPO while keeping DPO's simplicity. In late 2024 and early 2025, OpenAI's o1 and DeepSeek's R1 demonstrated that GRPO with verifiable rewards could elicit emergent reasoning from base models without any reasoning demonstrations. By 2026, the standard post-training stack is modular: SFT for format, DPO or SimPO for taste, and GRPO with verifiable rewards for reasoning.
| PPO | Rejection Sampling | DPO | GRPO | |
|---|---|---|---|---|
| In one sentence | Clipped policy gradient with a learned value baseline and KL anchor. | SFT on the best-of-K samples from the current policy. | Closed-form RLHF: fit the policy directly to preference data. | PPO with the value critic replaced by an empirical group mean. |
| Origin | Schulman et al., 2017 (OpenAI). Scaled to LLMs by InstructGPT, 2022. | Touvron et al., 2023 (Llama 2). Used for 4 rounds before PPO. | Rafailov et al., 2023 (Stanford). | Shao et al., 2024 (DeepSeek). Powered DeepSeek-R1. |
| Paradigm | ONLINE samples fresh from current $\pi_\theta$ each iteration | SEMI samples per iteration, then SFT on frozen targets | OFFLINE fixed preference dataset, no fresh sampling | ONLINE samples $G$ responses per prompt each iteration |
| What it optimizes | $\mathbb{E}_{g \sim \pi_\theta}[R(g)] - \beta D_{\text{KL}}(\pi_\theta \,\|\, \pi_0)$ via clipped surrogate | $-\mathbb{E}_p[\log \pi_\theta(g^\star(p) \mid p)]$ where $g^\star$ is best-of-$K$ | $-\log \sigma\!\big(\beta \Delta \log \pi_\theta - \beta \Delta \log \pi_{\text{ref}}\big)$ | Same as PPO, but $\hat{A}_i = (R_i - \mu_G)/\sigma_G$ replaces critic-based advantage |
| Networks held in memory | Four: policy, reference (frozen), reward model (frozen), value critic. | Two: policy, reward model (frozen). | Two: policy, reference (frozen). No reward model. | Three: policy, reference (frozen), reward model (frozen). No critic. |
| Data needed | Prompts + a trained reward model (from preference pairs) | Prompts + a trained reward model | Preference pairs $(p, g_w, g_l)$ — that's it | Prompts + reward model OR verifier (programmatic check) |
| Credit assignment | Per-token via GAE on a learned value function $V_\phi(s_t)$ | Sequence-level (only best-of-$K$ is kept, all tokens weighted equally) | Sequence-level (joint log-prob of full response) | Sequence-level (same group advantage smeared across all tokens of response) |
| How it controls drift | Explicit KL penalty $-\beta D_{\text{KL}}$ + clipped importance ratio $\text{clip}(r_t, 1\pm\epsilon)$ | Implicit: SFT loss naturally stays near data manifold; no explicit anchor | Implicit KL: the $\log \pi_{\text{ref}}$ term in the loss is mathematically equivalent to a KL anchor | Explicit KL penalty + clipped surrogate, just like PPO |
| Mechanism for biasing toward high R | Soft reweighting: all samples kept, gradients weighted by $\hat{A}_t$ | Hard selection: only $\arg\max$ over $K$ is kept; bad samples discarded | Pairwise contrast: push log-ratio of preferred over rejected | Soft reweighting within group: group mean acts as the per-prompt baseline |
| Suppression of bad responses | Direct: tokens with $\hat{A}_t < 0$ get explicit negative gradient | Indirect: probability mass for bad samples is implicitly stolen by SFT on $g^\star$ | Direct: $g_l$ tokens are explicitly pushed down by the loss | Direct: $\hat{A}_i < 0$ for below-mean responses → explicit negative gradient |
| Compute per iteration | High: rollouts + 4 forward passes (policy, ref, RM, critic) + critic training | Moderate: $K$ rollouts per prompt + 2 forward passes + standard SFT | Low: just 2 forward passes (policy + ref), no sampling, no critic, no RM | Moderate: $G$ rollouts per prompt + 3 forward passes (policy, ref, RM) |
| Training stability | Notoriously finicky — many hyperparameters, value critic hard to train | Very stable — it's just SFT with a clever data filter | Stable but can overfit aggressively, memorizing preferred responses | More stable than PPO (no critic to mistrain), but new biases from group normalization |
| Best at | Fine-grained shaping when reward model is reliable and compute is not the bottleneck | Cheap heavy-lifting in early RL rounds; lifting the bulk of the policy | Aligning tone, style, and refusal behavior from static preference data | Reasoning tasks with verifiable rewards; lower compute than PPO at scale |
| Famously used by | InstructGPT, ChatGPT, GPT-4, Claude (earlier versions), Llama 2 (final stage) | Llama 2 (4 rounds before PPO), WebGPT, various coding models | Zephyr, Tulu, Llama 3 alignment, countless 2024 open-weight finetunes | DeepSeek-Math, DeepSeek-R1, DeepSeek-R1-Zero, Qwen 2.5-Math, Kimi K1.5 |
| Achilles heel | Value critic biases; reward hacking; engineering complexity | No mechanism to suppress the bad tail beyond amplifying the good | No fresh sampling — can't discover new high-reward regions | Sequence-level credit only; biases from $\sigma_G$ normalization (fixed by Dr. GRPO) |
Reading the table
A few things stand out when you put these methods side by side.
PPO is the most expressive but the most expensive. Four networks in memory, per-token credit assignment via a learned critic, explicit KL anchoring. When everything works, PPO can shape behavior with surgical precision. When it fails, it fails in subtle ways — usually the critic going off the rails. This is why teams that have the engineering budget still use PPO, and why teams that don't have moved to alternatives.
Rejection sampling is shockingly competitive for its simplicity. It uses the same reward model as PPO but discards the gradient machinery entirely. It cannot suppress the bad tail through explicit negative gradients — only by reallocating SFT probability mass — but in practice, when the base policy is good enough that the median sample is not terrible, this works fine. Llama 2's strategy of "four rounds of rejection sampling, then PPO" was about using each method where it shines: rejection sampling for the cheap bulk-shifting, PPO for the fine work on the tails.
DPO trades sampling for simplicity. The closed-form derivation eliminates the entire RL apparatus — no sampling, no critic, no reward model — at the cost of being unable to discover new high-reward regions. The model can only re-rank what was in the preference dataset. For taste/style/refusal alignment this is fine, because the relevant behaviors are in-distribution. For reasoning, where the goal is to find new solution strategies, DPO's offline nature is a serious limitation.
GRPO is PPO minus the critic plus a clever statistical trick. Drop the value network. Replace its function with the empirical mean of $G$ samples from the same prompt. The advantage estimate is now empirical rather than learned — which is more accurate when $G \geq 4$, free of critic biases, and uses about half the memory. The cost: sequence-level credit only, since you no longer have a per-state value function. For reasoning tasks with verifiable rewards, where the reward signal is exact, this trade is worth it. GRPO is the algorithm of choice for reasoning models in 2025–2026.
Every method is solving the same problem — estimate the policy gradient of expected reward, regularized to a reference — with different choices about which expensive piece to approximate. PPO learns everything. Rejection sampling avoids the gradient. DPO avoids the sampling. GRPO avoids the critic. There is no free lunch, only differently priced ones.
What is solved
Several problems that seemed open in 2022 are now essentially closed. How do you make a base model follow instructions? SFT. How do you align tone, helpfulness, and refusal behavior? Preference optimization (DPO, SimPO). How do you teach a model to reason through hard problems? RLVR with GRPO. Each of these is now reproducible, well-understood, and supported by multiple open implementations.
What remains genuinely hard
Several questions remain stubbornly open.
Reward hacking at the limit. Even with KL penalties, group normalization, and verifiable rewards, models exploit subtle quirks. A model trained on math problems can learn to game the answer extractor. A model trained on code can write tests that pass without solving the problem. The arms race between reward designers and policy exploiters is permanent.
Generalization of reasoning. A model trained via GRPO on math and code generalizes its reasoning behavior to other domains — sometimes. The mechanisms are not well understood. Why does training on competition math improve performance on legal reasoning? Why does it sometimes hurt performance on creative writing? These transfer dynamics are an active research area.
Knowing when to think. Reasoning models often think too much on easy problems and too little on hard ones. Training a model to allocate thinking effort proportional to problem difficulty — without just memorizing surface features — is unsolved. The "sweet spot" varies per problem and per model, and there is no clean theory.
Multi-step agency. Single-response RL — even with long chains of thought — does not straightforwardly extend to multi-turn agentic settings where the model must take many actions, observe environments, and recover from errors over thousands of steps. Credit assignment over such horizons is genuinely difficult; the noise in long-horizon rewards is severe; and existing methods are brittle. This is where most current research is heading.
The remarkable thing
What is striking, after all this machinery, is how much of the result depends on a few simple ideas applied very carefully:
- The log-derivative trick to make sampling differentiable.
- A baseline (learned or empirical) to reduce variance.
- A KL penalty to keep the policy near its starting point.
- Verifiable rewards where they exist, preference data where they do not.
That is essentially everything. The rest is implementation detail and careful engineering. The fact that this is enough to produce models that can solve olympiad mathematics, write substantial codebases, and reason about open-ended problems is — well, it is one of the most surprising empirical results of the decade. The math is elegant but modest; the emergent behavior is not.
For a physicist, perhaps the right analogy is the Ising model. The microscopic rule is trivial — spins prefer to align with their neighbors. The macroscopic behavior — phase transitions, spontaneous magnetization, universality classes — is vastly richer than the rule that generated it. Modern LLM training looks similar: simple losses, simple gradients, simple baselines. But scale the system up, expose it to enough data, run the dynamics for long enough, and structure emerges that no one wrote down.
What we are witnessing, in some sense, is statistical mechanics for cognition. We do not yet have a theory of why it works as well as it does. But the empirical phenomenology is rich, the engineering is tractable, and the gradient — that humble, ancient gradient — is once again earning its keep.
REFERENCES: Schulman et al. 2017 · Ouyang et al. 2022 · Rafailov et al. 2023 · Shao et al. 2024 · DeepSeek-AI 2025