RLHF | Cognaptus

Stale Rollouts, Fresh Trouble: The Two Speed Limits of Asynchronous RLHF

TL;DR for operators Asynchronous RLHF buys throughput by allowing rollout workers to continue generating completions while the learner updates the policy. The invoice arrives later: some rollouts were generated by a policy that the learner has already left behind. The paper’s useful contribution is not merely the familiar observation that stale data can destabilize training. It identifies two different speed limits.1 ...

The Label Budget Was Fine. The Pairing Strategy Was Not.

TL;DR for operators Preference labels are expensive. Model completions are comparatively cheap. The usual workflow responds to this imbalance in the least imaginative way possible: generate a small number of completions, compare whatever pairs happen to be available, and hope the post-training objective sorts out the mess. Hope is not a procurement strategy, though it does have the virtue of requiring no dashboard. ...

The Reward Model Was Confident. That Was the Bug.

TL;DR for operators Reward models should not be treated as little oracles that hand down one clean number from the alignment heavens. In the paper’s diagnosis, the problem is more mundane and therefore more dangerous: a reward model can be wrong, uncertain, and numerically confident-looking at the same time. GRPO then standardizes those rewards inside a rollout group, giving extreme scores large influence even when the reward model is least reliable. Excellent. The pipeline has discovered a way to launder uncertainty into policy updates. ...

Fine-Tuned, Fine Print: Why Post-Training Teaches Models What to Trust

Enterprise AI has entered its “sure, but can it use the evidence?” phase. That is progress, technically. It is also where many deployment stories begin to get expensive. The first generation of business LLM adoption was satisfied if a model could produce a fluent answer. The next generation asks something more demanding: can the model use retrieved documents, compliance policies, tool outputs, customer records, analyst notes, and human feedback in the right way? ...

Preference Laundering: How RLHF Can Turn Better Answers Into Bigger Biases

Feedback sounds clean. A user tries two model answers. One is more helpful, safer, more complete, and less obviously stupid. The other is worse. The annotator picks the better one. The reward model learns from that preference. The policy is optimized. Everyone goes home believing that the system has become more aligned. ...

Time to Prefer: Why Binary RLHF Feedback Leaves Reward Models Guessing

Time to Prefer: Why Binary RLHF Feedback Leaves Reward Models Guessing Thumbs-up feedback looks efficient. It is clean, cheap, easy to store, and friendly to dashboards. One output wins, another output loses, and the reward model learns what humans supposedly want. A tidy little morality market, with all the nuance of a vending machine. ...

Think Less, Align Better: The New Economics of AI Reasoning

Opening — Why this matters now Enterprise AI is entering its mildly awkward teenage phase: everyone wants intelligence, nobody wants the invoice. For the last two years, much of the AI conversation has revolved around more: more context, more reasoning tokens, more chain-of-thought, more human feedback, more evaluators, more synthetic data, more agents, more dashboards to explain why the agents broke the dashboards. The operating assumption was simple enough: if the model thinks more, explains more, or trains on more feedback, it should perform better. ...

The Reward Is in the Room: Why AI Automation Needs Better Judgment, Not Just Bigger Models

Opening — Why this matters now AI adoption has entered its second, less glamorous phase. The first phase was easy to explain: make the model generate things. Emails, reports, code, dashboards, summaries, customer replies, compliance drafts, market notes, training content. Give the machine a prompt, admire the fluent output, and pretend the future has arrived because the paragraphs are well-spaced. ...

Alignment Isn’t Free: When Safety Objectives Start Competing

Customer support is where alignment theories go to become invoices. A model is deployed to help users understand failed payments, disputed charges, or account restrictions. Product wants it to be useful. Legal wants it to avoid regulated advice. Trust and safety wants it to refuse suspicious requests. Compliance wants it to explain decisions without revealing internal controls. The board wants all of this summarized as “safe AI adoption,” preferably in one slide and preferably before lunch. ...

Active Minds, Efficient Machines: The Bayesian Shortcut in RLHF

TL;DR for operators Labels are the awkward invoice behind modern alignment. RLHF looks elegant in diagrams: generate outputs, ask humans which one is better, train a reward model, optimise the policy, repeat until everyone pretends the reward model is civilisation. In practice, most preference comparisons are not equally useful. Some are obvious. Some are redundant. Some teach the model almost nothing except that annotator budgets have a sense of humour. ...