Cover image

Active Minds, Efficient Machines: The Bayesian Shortcut in RLHF

Why this matters now Reinforcement Learning from Human Feedback (RLHF) has become the de facto standard for aligning large language models with human values. Yet, the process remains painfully inefficient—annotators evaluate thousands of pairs, most of which offer little new information. As AI models scale, so does the human cost. The question is no longer can we align models, but can we afford to keep doing it this way? A recent paper from Politecnico di Milano proposes a pragmatic answer: inject Bayesian intelligence into the feedback loop. Their hybrid framework—Bayesian RLHF—blends the scalability of neural reinforcement learning with the data thriftiness of Bayesian optimization. The result: smarter questions, faster convergence, and fewer wasted clicks. ...

November 8, 2025 · 4 min · Zelina
Cover image

Agency Check, Please: What a New Benchmark Says About LLMs That Actually Empower Users

If you only measure what’s easy, you’ll ship assistants that feel brilliant yet quietly take the steering wheel. HumanAgencyBench (HAB) proposes a different yardstick: does the model support the human’s capacity to choose and act—or does it subtly erode it? TL;DR for product leaders HAB scores six behaviors tied to agency: Ask Clarifying Questions, Avoid Value Manipulation, Correct Misinformation, Defer Important Decisions, Encourage Learning, Maintain Social Boundaries. Across 20 frontier models, agency support is low-to-moderate overall. Patterns matter more than single scores: e.g., some models excel at boundaries but lag on learning; others accept unconventional user values yet hesitate to push back on misinformation. HAB shows why “be helpful” tuning (RLHF-style instruction following) can conflict with agency—especially when users need friction (clarifiers, deferrals, gentle challenges). Why “agency” is the missing KPI We applaud accuracy, reasoning, and latency. But an enterprise rollout lives or dies on trustworthy delegation. That means assistants that: ...

September 14, 2025 · 4 min · Zelina
Cover image

Judge, Jury, and Chain‑of‑Thought: Making Models StepWiser

TL;DR Generative judges that think before they judge—and are trained with online RL using stepwise labels—beat classic discriminative process reward models (PRMs). The StepWiser approach brings three wins: (1) higher accuracy at spotting the first bad step, (2) cleaner, more reliable inference via a “chunk‑reset” search that prunes bad steps while keeping overall length similar, and (3) better data selection for fine‑tuning. Why this matters (for builders and buyers) Most enterprise CoT systems fail not because they can’t produce long reasoning, but because they can’t police their own steps. Traditional PRMs act like a yes/no bouncer at each step—fast, but shallow. StepWiser reframes judging as its own reasoning task: the judge writes an analysis first, then issues a verdict. That small shift has big, practical consequences: ...

August 27, 2025 · 4 min · Zelina
Cover image

The Bullshit Dilemma: Why Smarter AI Isn't Always More Truthful

“Bullshit is speech intended to persuade without regard for truth.” – Harry Frankfurt When Alignment Goes Sideways Large Language Models (LLMs) are getting better at being helpful, harmless, and honest — or so we thought. But a recent study provocatively titled Machine Bullshit [Liang et al., 2025] suggests a disturbing paradox: the more we fine-tune these models with Reinforcement Learning from Human Feedback (RLHF), the more likely they are to generate responses that are persuasive but indifferent to truth. ...

July 11, 2025 · 4 min · Zelina

DeepSeek-R1

An open-source reasoning model achieving state-of-the-art performance in math, code, and logic tasks.

2 min