The Label Budget Was Fine. The Pairing Strategy Was Not.

TL;DR for operators

Preference labels are expensive. Model completions are comparatively cheap. The usual workflow responds to this imbalance in the least imaginative way possible: generate a small number of completions, compare whatever pairs happen to be available, and hope the post-training objective sorts out the mess. Hope is not a procurement strategy, though it does have the virtue of requiring no dashboard.

Han, Goyal, and Ma’s paper asks a sharper question: before sending comparisons to annotators, which response pairs are actually worth labeling?¹ The answer is not “the most likely responses,” not “the first two completions,” and not even “the obviously good versus obviously bad pair” in any simple sense. The value of a comparison depends on whether it reveals information about the parameter directions that matter for the final DPO-trained policy.

The paper’s mechanism is clean. Pair selection defines a sampling design. That design induces an information matrix, $\Sigma_D(\theta)$. The final RLHF/DPO performance gap is governed by a trace term:

$$ \operatorname{tr}\left(I(\theta^{\ast}) \Sigma_D(\theta^{\ast})^{\dagger}\right). $$

Here, $I(\theta^{\ast})$ describes which parameter directions matter for downstream policy performance, while $\Sigma_D(\theta^{\ast})$ describes which directions the selected comparison pairs actually identify. Bad curation means buying labels that repeatedly illuminate the wrong geometry. Very enterprise.

The practical implication is straightforward: generate a larger candidate pool, compute approximate design weights over possible comparisons, and spend the same annotation budget on pairs that improve information coverage. The paper validates this through synthetic experiments, an IMDb DPO setup with GPT-2-large and a sentiment proxy reward, and an Anthropic-HH setup using Pythia-2.8B with GPT-4.1 judging.

The boundary is equally important. This is not a magic recipe for all alignment data. The theory assumes a well-specified preference model, regularity, coverage, offline randomized designs, and a DPO-style post-training setting. In real systems, the method is best read as a disciplined label-allocation principle: stop treating comparison selection as clerical work.

The real unit of preference-data cost is the pair

The paper starts from a business fact most alignment workflows quietly know but rarely formalize: human preference labels are expensive, while generating additional completions is often cheap enough to expand the candidate pool.

That asymmetry changes the design problem. If the expensive thing is the label, the right question is not merely “how many prompts should we annotate?” It is: “given many possible within-prompt response pairs, which comparisons should receive scarce human judgment?”

The standard workflow treats pair construction as a side effect of generation. Generate two completions and compare them. Generate a few and compare all pairs. Or select pairs uniformly because uniform selection feels neutral, which is how many mediocre systems disguise the absence of judgment.

Han, Goyal, and Ma replace that habit with a sampling-design formulation. For each prompt, a generated candidate pool defines a comparison graph: completions are vertices, possible within-prompt comparisons are edges. A design $D$ is a probability distribution over those edges. Given a label budget $n$, the learner samples $n$ edges from $D$, obtains preference labels, and trains a DPO policy.

That is a small conceptual shift with a large operational consequence. The data asset is no longer “a dataset of prompts and labels.” It is a portfolio of measurements chosen to improve a downstream policy objective.

The mechanism: pair choice changes the information matrix

The paper’s core mechanism has three steps:

A comparison pair has a sensitivity vector.
A sampling design averages those sensitivities into an information object.
That object controls how DPO estimation error becomes downstream RLHF suboptimality.

A comparison edge $e=(x,y^+,y^-)$ enters the DPO loss through a scalar logit difference. The paper defines a pairwise sensitivity vector:

$$ g(e;\theta) = \nabla_{\theta}u_{\theta}(e). $$

This vector tells us how the comparison reacts to small parameter changes. In a linear contextual setting, it is essentially the feature difference between the two completions being compared. In a tabular setting, it behaves like the signed incidence vector of the compared pair. Translation: every labeled comparison points in some direction in parameter space. Some directions are useful. Some are decorative.

Given a sampling design $D$, the paper defines the design covariance matrix:

$$ \Sigma_D(\theta)=\mathbb{E}_{e \sim D}\left[g(e;\theta)g(e;\theta)^\top\right]. $$

This is the information geometry of the comparisons you choose to label. If $D$ repeatedly selects similar or uninformative pairs, $\Sigma_D$ is weak in some directions. If it covers the relevant directions well, DPO has a better chance of estimating the policy-improving parameters.

The downstream policy objective is the standard KL-regularized RLHF objective:

$$ J(\pi)=\mathbb{E}_x\left[ \mathbb{E}\ast{y\sim\pi(\cdot\mid x)}[r^{\ast}(x,y)] -\beta,\mathrm{KL}\left(\pi(\cdot\mid x)|\pi_0(\cdot\mid x)\right) \right]. $$

The target is not “learn preferences well” in the abstract. The target is a policy with small optimality gap under this objective. That distinction matters. A comparison can be useful for ranking items in a local sense while still being poorly aligned with the parameter directions that govern the final policy’s value.

This is where the paper’s trace term becomes the operating principle:

$$ \operatorname{tr}\left(I(\theta^{\ast})\Sigma_D(\theta^{\ast})^{\dagger}\right). $$

The Fisher matrix $I(\theta^{\ast})$ weights directions by how much they matter for the optimal policy. The pseudoinverse of $\Sigma_D(\theta^{\ast})$ penalizes directions the selected comparisons fail to identify. If the design misses an important direction, the penalty grows. The matrix does not care that the label came from a premium annotator, a fancy UI, or a procurement-approved vendor. It only cares whether the comparison carried the right information.

Rude, but fair.

High-probability responses are not automatically high-value comparisons

A natural reader misconception is that preference data should focus on the most likely completions under the reference model. After all, if the model produces those responses often, surely comparing them should matter most.

The paper shows why this intuition is incomplete. High reference probability does not equal high information. If the likely completions are clustered in a narrow part of feature space, comparing them can repeatedly measure the same local direction while leaving other policy-relevant directions underidentified. The result is a label budget that looks busy but learns little.

The synthetic experiments make this point cleanly. In the tabular setting, the authors generate a non-uniform reference policy and weak reward differences. They compare four designs: the oracle trace design $D^{\ast}(\theta^{\ast})$, the implementable plug-in design $D^{\ast}(\theta_0)$, uniform sampling, and a heuristic that samples pairs by drawing two items according to the reference policy $\pi_0$. The oracle and plug-in designs achieve nearly identical and consistently small RLHF gaps. Uniform improves with budget but is less sample-efficient. The $\pi_0$-weighted heuristic performs poorly and shows little improvement.

The linear contextual experiment repeats the lesson under a richer feature model. Again, the plug-in design closely tracks the oracle design. Uniform sampling lags at small budgets. The $\pi_0$-weighted heuristic performs poorly and with larger variance.

This is not an argument against comparing common model outputs. It is an argument against confusing frequency with diagnostic value. A frequently produced response can be important for deployment exposure, but that does not make every comparison involving it informative for DPO training. The right question is: what direction does this comparison identify, and does that direction matter for the downstream policy?

The theorem says this is not just a better heuristic

The theoretical result matters because it does more than introduce a plausible curation score. It links comparison design to downstream performance through matching upper and lower bounds.

Informally, the paper proves that for a DPO estimator trained on $n$ comparisons sampled from design $D$, the RLHF optimality gap is upper bounded by a term of the form:

$$ J(\pi^{\ast})-J(\hat{\pi}_n) \lesssim \frac{1}{n} \operatorname{tr}\left(I(\theta^{\ast})\Sigma_D(\theta^{\ast})^{\dagger}\right). $$

It also proves an information-theoretic lower bound showing that any estimator must incur, up to constants, the same kind of trace-controlled difficulty:

$$ \mathbb{E}\left[J(\pi^{\ast})-J(\tilde{\pi}_n)\right] \gtrsim \frac{1}{n} \mathbb{E}\left[ \operatorname{tr}\left(I(\theta^{\ast})\Sigma_D(\theta^{\ast})^{\dagger}\right) \right]. $$

The two bounds are not identical operational recipes; they sit under assumptions and constants. But their shared trace structure is the point. The same design-dependent quantity controls both what DPO can achieve and what no estimator can avoid.

That makes the trace criterion a defensible target for pair selection. It is not merely “we tried a clever weighting scheme and the curves looked nicer.” It says the data-curation decision has a mathematically identifiable bottleneck: whether the selected comparisons provide information in the directions the final policy objective cares about.

The proof roadmap is also instructive for operators:

Step in the paper	What it establishes	Operational translation
RLHF gap as weighted parameter error	Downstream policy loss is locally equivalent to parameter error weighted by $I(\theta^{\ast})$	Not all parameter errors matter equally; prioritize directions that move policy value
DPO upper bound	DPO estimation error is controlled by the design covariance $\Sigma_D$ and sample budget $n$	Label selection shapes what DPO can learn from the same number of labels
Information-theoretic lower bound	Any estimator faces the same trace-style difficulty	This is not only a DPO optimization artifact; poor comparison geometry is intrinsically costly

The elegance here is that the paper translates annotation design into a measurable geometry problem. The less elegant part is that production systems then have to estimate this geometry from imperfect features. Welcome to engineering.

The plug-in design is the operational move

The oracle design depends on $\theta^{\ast}$, the unknown parameter corresponding to the RLHF-optimal policy. That is inconvenient, because if you already knew the optimal policy, you would not be hiring annotators to help you find it. Minor detail.

The paper handles this by using the reference parameter $\theta_0$ as a proxy. Since KL-regularized RLHF starts from the reference policy and keeps the post-trained policy close to it, $\theta_0$ can be used to construct an implementable plug-in trace design.

The plug-in design solves the approximate objective:

$$ D_{\theta_0}\in \arg\min_{D\in\Delta(E)} \operatorname{tr}\left(I(\theta_0)\Sigma_D(\theta_0)^{\dagger}\right). $$

Theorem 2 shows that, under the paper’s assumptions, the plug-in trace criterion is within a controlled factor of the oracle criterion. The factor depends on the distance between $\theta_0$ and $\theta^{\ast}$. This is exactly the kind of condition one should expect. If the reference policy is already reasonably close to the target, its geometry is useful. If the reference is wildly misaligned, the proxy becomes less trustworthy. No theorem rescues a bad starting point by complimenting it.

In the LLM experiments, the authors implement this idea approximately. They represent prompt-response pairs using hidden representations from the SFT model, form feature differences for candidate comparisons, estimate a Fisher-style matrix from reference-model behavior, and solve a regularized trace-design problem using Frank-Wolfe optimization. Then they sample comparisons without replacement according to the optimized weights.

This is important: the production-relevant method is not “compute exact Fisher matrices for a giant neural policy and solve the Platonic optimal design.” It is “use model features to approximate comparison sensitivity and allocate labels more intelligently than first-pair or uniform heuristics.”

The experiments validate different parts of the claim

The experiments are best read as a staged evidence stack, not as four interchangeable benchmark wins.

Experiment	Likely purpose	What it supports	What it does not prove
Synthetic tabular setting	Main mechanism check under the theory’s cleanest assumptions	The oracle and plug-in trace designs can dramatically reduce RLHF optimality gap versus uniform and $\pi_0$-weighted sampling	That large neural policies satisfy the same assumptions exactly
Synthetic linear contextual setting	Main mechanism check with features and prompt-conditioned candidates	The plug-in design tracks the oracle and targets informative feature directions	That the feature proxy used in real LLMs is always faithful
IMDb with GPT-2-large	Real LLM implementation test with a scalar proxy reward	Trace-curated data improves the reward–KL frontier in prompt and response selection tasks	That sentiment reward equals human preference or that gains generalize to all instruction-following tasks
Anthropic-HH with Pythia-2.8B	Exploratory extension without an explicit scalar reward	Same-budget curated data outperforms the benchmark across reported budgets and sampling temperatures under GPT-4.1 judging	That GPT-4.1 is a perfect evaluator or that offline curation dominates adaptive methods

The IMDb experiment is especially useful because it introduces real DPO fine-tuning while retaining a measurable reward–KL frontier. The authors fine-tune GPT-2-large on IMDb, use the resulting SFT model as the reference policy, and compare DPO datasets curated by the trace method against benchmark selection rules. They run two curation tasks: selecting which prompts to annotate, and selecting which response pair to compare within a prompt. In prompt selection, they use a candidate pool of 1,000 prompts and select $N=175$ comparisons. In response selection, they generate eight candidate responses per prompt and select one pair per prompt.

The result is not merely “higher reward.” The paper evaluates a reward–KL tradeoff, sweeping DPO regularization values and plotting final checkpoints. That matters because post-training is always a tradeoff between reward gain and deviation from the reference model. A method that buys reward by blowing up KL is not alignment; it is just taking the scenic route to distribution shift.

The Anthropic-HH experiment removes the convenience of a scalar reward. The authors use the HH train/test splits, train a Pythia-2.8B SFT model on chosen responses, and train DPO models using either the trace-curated dataset or a benchmark rule. The candidate pool contains 160,800 preference comparisons. The paper reports representative budgets of 80,400 and 96,480 comparisons, equal to 50% and 60% of the candidate pool. Evaluation uses 500 HH test prompts per method-temperature pair, sampling temperatures 0.25, 0.7, and 1.0, and GPT-4.1 as an automatic judge. Ties count as 0.5.

Across the reported budgets and temperatures, the trace-curated datasets outperform the benchmark. That is encouraging, but it should be interpreted as a realistic implementation signal, not a theorem wearing a benchmark costume. The judge is automatic. The candidate pool consists of existing HH preference pairs. The feature construction is an approximation. Still, the evidence points in the same direction as the theory: better pair allocation can buy better post-training outcomes at fixed label count.

What the paper directly shows, and what Cognaptus infers

The distinction matters, because this is exactly where research findings get promoted into corporate folklore.

Layer	Claim	Status
Paper directly shows	In the analyzed DPO setting, comparison design affects downstream RLHF optimality gap through $\Sigma_D$ and the trace criterion	Theoretical result under stated assumptions
Paper directly shows	A plug-in design using the reference parameter can approximate the oracle design within a controlled factor	Theoretical result under stated assumptions
Paper directly shows	Synthetic, IMDb, and Anthropic-HH experiments favor trace-curated data over common heuristics	Empirical evidence across increasing realism
Cognaptus infers	Preference-data pipelines should treat pair selection as budget allocation, not as dataset formatting	Practical interpretation
Cognaptus infers	Teams should generate larger response pools when generation is cheap and annotation is scarce	Operational implication, contingent on candidate diversity and compute cost
Still uncertain	How robust the method is under large-scale production annotation, changing policies, adaptive data collection, noisy human raters, and richer alignment objectives	Open deployment boundary

This is the correct posture: use the paper to change the workflow, not to declare the workflow solved.

A practical preference-label pipeline would look different

For a team running DPO-style post-training, the paper suggests a different operating model.

First, generate a larger candidate pool than the number of comparisons you plan to label. The extra completions are not waste; they are the option set from which information-rich comparisons can be selected.

Second, represent candidate responses in a feature space. The paper’s experiments use hidden representations from the reference/SFT model and construct feature differences between candidate responses. That is a proxy for pairwise sensitivity.

Third, estimate which parameter directions matter. In the paper’s implementation, this involves approximating a Fisher-style matrix using the reference model and centered log-probability structure.

Fourth, solve a regularized trace-design problem and sample comparisons according to the resulting weights. In production, this would sit between generation and annotation: a curation service that decides which pairs deserve human judgment.

Fifth, preserve auditability. Each selected pair should carry metadata: prompt, candidate responses, design score, feature-distance summary, reference probabilities, and selection rationale. Otherwise, the organization will rediscover six months later that it bought labels according to an algorithm nobody can explain. This is traditional enterprise archaeology.

The ROI logic is not “labels become cheap.” The ROI logic is that each paid label carries more downstream training signal. In annotation-heavy workflows, that can mean better policy quality at fixed budget, similar policy quality with fewer labels, or earlier detection that the candidate pool itself is too narrow to support the desired post-training move.

The appendix is implementation detail, not a second thesis

The appendix matters because it shows how the method was made concrete. It does not overturn the main argument.

For IMDb, the authors use GPT-2-large SFT as the reference model, full-parameter DPO, RMSprop, a reward–KL frontier over $\beta \in {0.05,0.1,0.2,0.5,1,2,5}$, and the siebert/sentiment-roberta-large-english classifier as a proxy reward. They estimate KL as a sequence-level log-probability difference between the trained policy and the reference policy. They use last-token hidden representations, feature differences, ridge regression to approximate reference behavior, and a Frank-Wolfe procedure to solve the design objective.

For Anthropic-HH, they use Pythia-2.8B, PCA to reduce hidden representations to 128 dimensions, a ridge parameter $\lambda=10^{-3}$, DPO with $\beta=0.1$, and the same training configuration for the curated and benchmark datasets.

The likely purpose of these appendix details is implementation credibility and reproducibility. They show the trace design can be approximated in realistic LLM pipelines without solving an impossible exact design problem. They do not prove that last-token hidden states are universally the right representation, that PCA dimension 128 is optimal, or that GPT-4.1 judging should substitute for human evaluation in high-stakes deployment.

That distinction keeps the paper useful without turning every appendix choice into scripture.

Where this result does not travel cleanly

The limitations are not decorative; they define where the mechanism can be trusted.

The theory relies on realizability: the optimal policy is assumed to be representable within the chosen parametric policy class, and the preference-label model is well specified. Production LLMs often violate this politely, then repeatedly.

The analysis also requires regularity, smoothness, feature separation, coverage, and a unique population minimizer. These assumptions are not absurd, but they are stronger than “we have a transformer and some labels.” In particular, feature separation matters operationally. If the candidate pool contains near-duplicates, no design can extract information that is not there. Better pair selection cannot redeem a stale generation process.

The design is offline and randomized. Candidate completions are generated before labeling, and the design does not adapt to observed feedback. That fits batch annotation workflows, which are common in industry, but it is not the same as online active learning or iterative RLHF where model behavior and label strategy co-evolve.

The experiments also use approximations. IMDb relies on a sentiment classifier reward. Anthropic-HH relies on GPT-4.1 as an automatic judge. The HH curation experiment selects from existing preference pairs rather than generating all possible within-prompt response alternatives. These are reasonable experimental choices, not universal guarantees.

Finally, the plug-in design depends on the reference model being a useful proxy for the target geometry. If the reference model is badly miscalibrated, dangerously narrow, or systematically blind to the behaviors the organization wants to improve, the design can inherit that blindness. The paper’s method allocates labels intelligently within the geometry it can see. It does not supply moral vision.

The business value is label leverage, not alignment magic

The paper’s business relevance is not that companies can now “solve RLHF.” That phrase should be retired and perhaps made to apologize.

The business relevance is narrower and more useful: organizations can treat preference labeling as an information-allocation problem. Given a fixed annotation budget, the goal is to select comparisons that reduce downstream policy uncertainty in the directions that matter. That is a better operating model than buying labels on whatever pairs the generation script happened to produce.

For AI teams, this changes the questions asked before annotation starts:

Old question	Better question
How many preference labels can we afford?	Which candidate comparisons make each label informative?
Should we label all generated pairs?	Which pairs improve information coverage under the downstream objective?
Should we focus on likely outputs?	Do likely outputs identify policy-relevant directions, or just repeat the reference model’s habits?
Did the trained model improve?	Did curation improve the reward–KL frontier or win rate at the same label budget?
Is the dataset large enough?	Is the comparison graph well covered enough to learn the intended policy shift?

This is the part procurement departments should like: the result does not demand infinite new labels. It asks for better routing of the labels already being bought. The inconvenience is that the annotation pipeline must become more instrumented. Someone has to compute candidate features, estimate design weights, and maintain the selection logic. Apparently, “send random pairs to annotators” was not the final form of data strategy.

The bottom line

This paper’s useful contribution is not that it invents a new post-training objective. It leaves the DPO/RLHF objective largely intact and asks a more operationally neglected question: which comparisons deserve labels in the first place?

The answer is mechanism-first. Pair selection changes the information matrix. The information matrix changes parameter-estimation error. That error changes downstream policy value. Once that chain is visible, the naive habit of comparing arbitrary or high-reference-probability pairs looks less like simplicity and more like unmanaged spend.

The paper does not eliminate the hard parts of alignment. It does not resolve preference ambiguity, evaluator reliability, policy generalization, or the politics of whose preferences count. It does something more modest and more immediately useful: it shows that comparison curation is a first-class control surface for DPO-style post-training.

In other words, the label budget may not be the problem. The pairs you are buying with it might be.

Cognaptus: Automate the Present, Incubate the Future.

Jiangze Han, Vineet Goyal, and Will Ma, “Which Pairs to Compare for LLM Post-Training?” arXiv:2606.19607, 2026, https://arxiv.org/abs/2606.19607. ↩︎

TL;DR for operators#

The real unit of preference-data cost is the pair#

The mechanism: pair choice changes the information matrix#

High-probability responses are not automatically high-value comparisons#

The theorem says this is not just a better heuristic#

The plug-in design is the operational move#

The experiments validate different parts of the claim#

What the paper directly shows, and what Cognaptus infers#

A practical preference-label pipeline would look different#

The appendix is implementation detail, not a second thesis#

Where this result does not travel cleanly#

The business value is label leverage, not alignment magic#

The bottom line#