Cover image

Veto Later, Repair First

TL;DR for operators Most decision systems treat hard constraints like a trapdoor. Candidate violates requirement, candidate disappears. Efficient, clean, and occasionally absurd. The paper behind Repair-Augmented Constraint Learning, or RACL, argues that this is the wrong semantics for systems that already know how to modify an option before showing it to the user.1 A flight missing a checked bag, a hotel missing breakfast, a product bundle missing an accessory, or a schedule slot needing a resource adjustment may not be a bad option. It may be a good option one repair away from being acceptable. ...

June 26, 2026 · 20 min · Zelina
Cover image

Judge, Jury, and Calibration: Why AI Evaluation Needs Anchors

TL;DR for operators AI is becoming very good at producing judgement-shaped output. That is not the same thing as judgement. Two recent papers make the same operational point from different sides: one shows how AI can estimate educational item difficulty before response data are available; the other shows how LLM-generated peer reviews can look serious while diverging from human reviewing behaviour.12 ...

June 15, 2026 · 14 min · Zelina
Cover image

Trust Me, I’m Benchmarked: Why Enterprise AI Needs Two Audits

Enterprise AI has developed two favorite comfort blankets: the model’s confident explanation and the benchmark score. The first says, “Relax, I reasoned through this.” The second says, “Relax, I scored well on a public test.” Both are useful. Neither is a warranty. And when business teams treat either as proof of reliability, the result is not governance. It is theatre with better typography. ...

June 10, 2026 · 14 min · Zelina
Cover image

Prompt and Circumstance: Why One Accuracy Number Is Not a Reliability Audit

Opening — Why this matters now The AI market has learned to worship benchmark tables with the solemnity once reserved for quarterly earnings. One model is up two points on MMLU, another is slightly better at reasoning, a third is cheaper, smaller, faster, and therefore apparently ready to run your compliance workflow by Tuesday. ...

May 7, 2026 · 14 min · Zelina
Cover image

Forecasting the Forecast: Why Agentic AI Is Learning to Doubt Itself

Forecasting is where executive optimism goes to be measured. A sales team says the pipeline is healthy. A policy team says the election risk is manageable. A trading desk says the market has mostly priced in the event. Everyone has a probability. Few people have a disciplined process for updating it. That is also the problem with many AI forecasters. They can produce a number quickly, sometimes impressively, sometimes with the emotional stability of a quarterly sales forecast. But the harder question is not whether an AI can answer, “What is the probability?” The harder question is whether it can revise that probability as evidence arrives, remember why it changed its mind, and avoid turning a confidence score into decorative typography. ...

April 23, 2026 · 18 min · Zelina
Cover image

Batch of Thought, Not Chain of Thought: Why LLMs Reason Better Together

Fraud review is not a solo sport. A risk analyst looking at one suspicious seller can notice a strange product description, a vague company name, or a price range that feels wrong. But the real signal often appears only when several sellers are placed side by side. One shop looks unusual. Ten shops with the same naming pattern, same product mismatch, and same pricing behavior start to look less like noise and more like a system. ...

January 7, 2026 · 17 min · Zelina
Cover image

Confidence, Not Confidence Tricks: Statistical Guardrails for Generative AI

A product team launches an AI assistant. The demo works. The benchmark looks respectable. The model even says “I’m confident” with the serene authority of a consultant who has never owned a pager. Then the real users arrive. Some ask ambiguous questions. Some ask adversarial questions. Some ask perfectly normal questions that happen to sit outside the model’s competence. The assistant still answers. Sometimes it refuses too often. Sometimes it refuses too late. Sometimes its confidence score is less a forecast and more a decorative sticker. ...

September 13, 2025 · 14 min · Zelina
Cover image

Stop, Verify, and Listen: HALT‑RAG Brings a ‘Reject Option’ to RAG

RAG systems usually fail in a very business-like way: not with drama, but with confident paperwork. The retriever finds something. The generator writes something. The user sees an answer that looks plausible, well formatted, and sufficiently certain to be dangerous. Then someone asks the dull but expensive question: did the answer actually follow from the source? ...

September 13, 2025 · 11 min · Zelina
Cover image

Numbers Need Narration: Making LLMs Do Reasoning‑Intensive Regression

TL;DR for operators Many AI workflows do not need a yes-or-no judgment. They need a number: how well did this answer follow the instruction, how far did this reasoning trace remain valid, how much better is answer A than answer B, how strong is this essay, how risky is this case, how close is this support call to escalation? ...

September 1, 2025 · 19 min · Zelina
Cover image

Signed, Sealed, Delivered: A Rough Path to Better Volatility Models

TL;DR for operators Options calibration has a familiar operational problem: the model that is fast enough to run every day is usually the model that assumes the market is behaving politely. The market, naturally, has other hobbies. This paper compares two ways of calibrating implied volatility surfaces. The first is the classical route: use model-specific analytical approximations for Heston and rough Bergomi. The second is the rough-path route: represent volatility as a linear functional of the truncated signature of a primary stochastic process.1 ...

August 3, 2025 · 15 min · Zelina