Reasoning-Models

Reason, Reveal, Resist: The Persuasion Duality in Multi‑Agent AI

Meetings are already persuasive systems. Someone speaks first, someone sounds confident, someone produces a spreadsheet with just enough decimal places to look holy, and suddenly the room has moved. Multi-agent AI systems are not so different. They are becoming small artificial committees: one agent retrieves, another proposes, another critiques, another decides. The optimistic version says this gives us productive disagreement. The less adorable version says we have built a machine for circulating influence, and we are only now asking what makes one agent cave to another. ...

Answer, Then Audit: How 'ReSA' Turns Jailbreak Defense Into a Two‑Step Reasoning Game

The dangerous part is often clearer after the model starts answering Moderation usually begins with the user’s prompt. That sounds sensible. Read the request, classify the risk, block the bad thing, let the good thing through. A tidy little border checkpoint, complete with imaginary clipboard. The problem is that jailbreaks are not polite enough to declare themselves at the border. ...

Stop at 30k: How Hermes 4 Turns Long Chains of Thought into Shorter Time‑to‑Value

TL;DR for operators Reasoning models are not expensive because they are philosophical. They are expensive because they can keep thinking long after the business value has stopped arriving. The Hermes 4 Technical Report is easiest to misread as another open-weight leaderboard announcement. That is the least useful reading. The more useful reading is that Hermes 4 is a build manual for making open reasoning models behave like deployable systems: generate diverse synthetic data, verify what can be verified, preserve general instruction-following, control runaway reasoning length, and evaluate with enough logging to know whether the model failed or the benchmark harness sneezed.1 ...

Search When It Hurts: How UR² Teaches Models to Retrieve Only When Needed

TL;DR for operators UR² is a useful paper because it attacks the part of RAG that most demos politely ignore: search can make a model worse when it is used badly.1 The framework trains smaller language models to coordinate retrieval and reasoning, rather than bolting a search box onto a chatbot and hoping the context window will behave itself. Hope, regrettably, is not a retrieval strategy. ...

From Zero to Reasoning Hero: How R-Zero Teaches Itself Without Human Data

TL;DR for operators R-Zero is a self-evolving training framework for reasoning LLMs that starts with one base model, splits it into two roles, and lets them co-train: a Challenger generates difficult questions, while a Solver learns to answer them.1 The useful business takeaway is not “models no longer need data.” That is the sort of sentence that should be handled with tongs. R-Zero removes the need for external task datasets and human labels in its training loop, but it still depends on engineered reward signals, majority-vote pseudo-labels, answer-format discipline, filtering, and objective correctness checks. “Zero data” here means zero external tasks and labels, not zero structure. ...

Tools of Thought: Why Reasoning Isn’t an Illusion After All

TL;DR for operators The useful question is not whether reasoning models “really think”. That debate is charming, mostly because it lets everyone pretend a benchmark table is a metaphysics seminar. The operational question is simpler: when you give a reasoning model the same tools as a non-reasoning model, does it use them better? ...

Fine-Tuning Isn’t Just Supervised: Why SFT Is Really RL in Disguise

TL;DR for operators Fine-tuning on curated examples is usually sold as the boring, stable cousin of reinforcement learning. The paper behind this article says that is too neat. When a team filters examples into “good” and “not good,” it has already created a sparse reward function. Standard supervised fine-tuning on the surviving examples is therefore not outside reinforcement learning; it is optimising a lower bound on an RL objective, only without admitting it at the meeting. ...

Reasoning at Scale: How DeepSeek Redefines the LLM Playbook

TL;DR for operators DeepSeek-R1 is not a story about one model suddenly becoming clever because someone found the secret lever labelled “reason harder”. It is a systems story: take a strong base model, reward it on problems where correctness can be checked, let longer reasoning traces emerge, repair the ugly parts with cold-start data and alignment, then distil the resulting behaviour into smaller models where deployment economics actually matter.1 ...

Beyond the Pareto Frontier: Pricing LLM Mistakes in the Real World

TL;DR for operators Most model-selection dashboards still ask the wrong question. They ask which LLM gives the best accuracy for the lowest inference cost. Zellinger and Thomson’s paper asks a more operationally honest one: how much does a wrong answer, a slow answer, or no answer cost in this specific workflow?1 The paper’s useful move is to convert competing performance metrics into a single expected dollar reward. Inference cost stays in dollars. Latency gets priced in dollars per second or minute. Errors get priced by their business consequence. Abstention gets priced by the cost of failing to answer or escalating to a human. Once everything is in the same unit, the “best model” is no longer the one that looks attractive on a Pareto plot. It is the model with the highest expected reward under the actual economics of the task. ...

Brains with Gradients: Why Energy-Based Transformers Might Be the Future of Thinking Machines

TL;DR for operators Energy-Based Transformers are not another prompt trick, reasoning wrapper, or RL-flavoured attempt to make a chatbot show more homework. They change the model’s job. Instead of directly predicting the next token, frame, or image patch in one forward pass, an EBT learns a scalar energy function that scores whether a candidate prediction is compatible with its context. Lower energy means “this fits better.” Inference then becomes optimisation: start with a rough or random candidate, compute the gradient of the energy with respect to that candidate, and iteratively move toward a lower-energy prediction. ...