Cover image

Stop at 30k: How Hermes 4 Turns Long Chains of Thought into Shorter Time‑to‑Value

TL;DR for operators Reasoning models are not expensive because they are philosophical. They are expensive because they can keep thinking long after the business value has stopped arriving. The Hermes 4 Technical Report is easiest to misread as another open-weight leaderboard announcement. That is the least useful reading. The more useful reading is that Hermes 4 is a build manual for making open reasoning models behave like deployable systems: generate diverse synthetic data, verify what can be verified, preserve general instruction-following, control runaway reasoning length, and evaluate with enough logging to know whether the model failed or the benchmark harness sneezed.1 ...

August 26, 2025 · 18 min · Zelina
Cover image

Search When It Hurts: How UR² Teaches Models to Retrieve Only When Needed

TL;DR for operators UR² is a useful paper because it attacks the part of RAG that most demos politely ignore: search can make a model worse when it is used badly.1 The framework trains smaller language models to coordinate retrieval and reasoning, rather than bolting a search box onto a chatbot and hoping the context window will behave itself. Hope, regrettably, is not a retrieval strategy. ...

August 11, 2025 · 19 min · Zelina
Cover image

From Zero to Reasoning Hero: How R-Zero Teaches Itself Without Human Data

TL;DR for operators R-Zero is a self-evolving training framework for reasoning LLMs that starts with one base model, splits it into two roles, and lets them co-train: a Challenger generates difficult questions, while a Solver learns to answer them.1 The useful business takeaway is not “models no longer need data.” That is the sort of sentence that should be handled with tongs. R-Zero removes the need for external task datasets and human labels in its training loop, but it still depends on engineered reward signals, majority-vote pseudo-labels, answer-format discipline, filtering, and objective correctness checks. “Zero data” here means zero external tasks and labels, not zero structure. ...

August 8, 2025 · 15 min · Zelina
Cover image

Tools of Thought: Why Reasoning Isn’t an Illusion After All

TL;DR for operators The useful question is not whether reasoning models “really think”. That debate is charming, mostly because it lets everyone pretend a benchmark table is a metaphysics seminar. The operational question is simpler: when you give a reasoning model the same tools as a non-reasoning model, does it use them better? ...

July 24, 2025 · 14 min · Zelina
Cover image

Fine-Tuning Isn’t Just Supervised: Why SFT Is Really RL in Disguise

TL;DR for operators Fine-tuning on curated examples is usually sold as the boring, stable cousin of reinforcement learning. The paper behind this article says that is too neat. When a team filters examples into “good” and “not good,” it has already created a sparse reward function. Standard supervised fine-tuning on the surviving examples is therefore not outside reinforcement learning; it is optimising a lower bound on an RL objective, only without admitting it at the meeting. ...

July 18, 2025 · 18 min · Zelina
Cover image

Reasoning at Scale: How DeepSeek Redefines the LLM Playbook

TL;DR for operators DeepSeek-R1 is not a story about one model suddenly becoming clever because someone found the secret lever labelled “reason harder”. It is a systems story: take a strong base model, reward it on problems where correctness can be checked, let longer reasoning traces emerge, repair the ugly parts with cold-start data and alignment, then distil the resulting behaviour into smaller models where deployment economics actually matter.1 ...

July 15, 2025 · 14 min · Zelina
Cover image

Beyond the Pareto Frontier: Pricing LLM Mistakes in the Real World

TL;DR for operators Most model-selection dashboards still ask the wrong question. They ask which LLM gives the best accuracy for the lowest inference cost. Zellinger and Thomson’s paper asks a more operationally honest one: how much does a wrong answer, a slow answer, or no answer cost in this specific workflow?1 The paper’s useful move is to convert competing performance metrics into a single expected dollar reward. Inference cost stays in dollars. Latency gets priced in dollars per second or minute. Errors get priced by their business consequence. Abstention gets priced by the cost of failing to answer or escalating to a human. Once everything is in the same unit, the “best model” is no longer the one that looks attractive on a Pareto plot. It is the model with the highest expected reward under the actual economics of the task. ...

July 8, 2025 · 19 min · Zelina
Cover image

Brains with Gradients: Why Energy-Based Transformers Might Be the Future of Thinking Machines

TL;DR for operators Energy-Based Transformers are not another prompt trick, reasoning wrapper, or RL-flavoured attempt to make a chatbot show more homework. They change the model’s job. Instead of directly predicting the next token, frame, or image patch in one forward pass, an EBT learns a scalar energy function that scores whether a candidate prediction is compatible with its context. Lower energy means “this fits better.” Inference then becomes optimisation: start with a rough or random candidate, compute the gradient of the energy with respect to that candidate, and iteratively move toward a lower-energy prediction. ...

July 4, 2025 · 16 min · Zelina
Cover image

Anchored Thinking: Mapping the Inner Compass of Reasoning LLMs

TL;DR for operators The paper’s useful claim is not simply that some chain-of-thought sentences matter more than others. That would be true, mildly interesting, and about as operationally helpful as saying some meetings should have been emails. The sharper claim is that the sentences that steer reasoning are often not the visible calculations. They are planning moves, re-checks, uncertainty statements, and backtracking moments: the places where the model chooses a route, notices a contradiction, or decides to verify a previous result. Bogdan, Macar, Nanda, and Conmy call these pivotal sentences thought anchors.1 ...

June 25, 2025 · 19 min · Zelina
Cover image

Proofs and Consequences: How Math Reveals What AI Still Doesn’t Know

TL;DR for operators Mathematical proof is a nasty evaluation setting for AI systems because it leaves fewer hiding places. A model cannot merely land on a final number; it has to preserve the truth of each step. That is precisely why Guo et al.’s RFMDataset is useful: it tests whether advanced reasoning models can construct complete natural-language proofs, then classifies how they fail when they cannot.1 ...

June 23, 2025 · 18 min · Zelina