Cover image

Follow the Heads, Not the Hype: How LLMs Route Deductive Reasoning

Policy rules are boring until a chatbot applies the wrong one. A customer asks whether they qualify for a refund. The rule says refunds require purchase within 30 days, unused condition, and no prior replacement claim. The model answers confidently. It even writes a neat step-by-step explanation. Wonderful. The explanation looks like reasoning. It may even be correct. ...

June 1, 2026 · 16 min · Zelina
Cover image

Score and Disorder: Why LLM Reasoning Needs More Than Accuracy

A model review often begins with a spreadsheet. One column says accuracy. Another says cost. A third says latency. Someone asks whether the model is “good enough.” Someone else points at the benchmark score. A decision is made. Procurement smiles. Compliance does not, but compliance rarely smiles anyway. The problem is not that accuracy is useless. The problem is that accuracy is too small a container for the thing businesses actually want from reasoning systems. A final answer can be correct while the route to that answer is unstable, unnecessarily expensive, locally contradictory, or impossible to reproduce under a harmless rewording of the question. That is not a philosophical inconvenience. It is an operational failure mode waiting politely inside a dashboard. ...

June 1, 2026 · 16 min · Zelina
Cover image

Blame the Blueprint: Why AI Risk Starts in the Architecture

AI risk reviews still tend to begin with comforting questions. Who is the responsible developer? What policy applies? What did the model output? Was the user allowed to ask that? Did the compliance team approve the deployment checklist? Useful questions, certainly. Also slightly late. Two recent arXiv papers point to a less convenient lesson: some AI risks are not merely produced by bad prompts, careless users, malicious deployment, or weak legal controls. They are produced by architecture. One paper shows this at the model-training layer, where Batch Normalization can amplify memorization of atypical samples and increase privacy leakage.1 The other shows it at the ecosystem layer, where decentralized AI can dissolve the very addressee that conventional governance assumes, forcing governance to move from policy instructions to protocol-level constraints.2 ...

May 31, 2026 · 16 min · Zelina
Cover image

Do the Math, Not the Mime: Why LLM Reasoning Needs a Verification Pipeline

A spreadsheet error rarely announces itself with dramatic music. It usually arrives politely. A pricing model gives a clean answer. A compliance calculator writes a confident explanation. A financial assistant produces a neat derivation with enough intermediate steps to look reassuring. The result is formatted, fluent, and possibly wrong. That is the uncomfortable business lesson behind Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges, a 2026 survey of roughly 120 studies on LLM mathematical reasoning.1 The paper is not introducing one new benchmark, one heroic model, or one more leaderboard trophy to place on the already overcrowded mantelpiece. Its useful contribution is more structural: it connects datasets, representations, training methods, tool use, verifiers, and evaluation metrics into one reasoning pipeline. ...

May 31, 2026 · 14 min · Zelina
Cover image

High Entropy, Low Drama: The Internal Fingerprint of LLM Reasoning

Debugging a reasoning model usually starts at the wrong end. A model gives a wrong mathematical answer, so we inspect the final output. Then we inspect the chain-of-thought. Then we compare benchmark scores, sample more answers, compute pass rates, and hope the model’s visible reasoning trace tells us what happened inside. This is convenient. It is also a little like diagnosing a factory by reading only the shipping label. ...

May 31, 2026 · 15 min · Zelina
Cover image

Don’t Just Guard the Door: Jailbreak Safety Needs Checkpoints

Don’t Just Guard the Door: Jailbreak Safety Needs Checkpoints A single prompt classifier is an attractive idea because it is simple, cheap, and easy to draw in a system diagram. The user sends a prompt. The guard says safe or unsafe. The model either answers or refuses. Very tidy. Also, increasingly incomplete. ...

May 30, 2026 · 15 min · Zelina
Cover image

Jailbreak Risk Needs a Stopwatch, Not Just a Scorecard

Jailbreak Risk Needs a Stopwatch, Not Just a Scorecard For many organizations, LLM safety is still treated like a checkpoint: run a benchmark, report an attack success rate, add a few guardrails, and move on. The resulting dashboard looks reassuringly official. It may even have decimals. Unfortunately, adversarial users do not attack dashboards. They attack systems. ...

May 30, 2026 · 17 min · Zelina
Cover image

RAG’s Receipt Problem: When Correct Answers Don’t Prove Retrieval

RAG’s Receipt Problem: When Correct Answers Don’t Prove Retrieval Retrieval-augmented generation has become the respectable outfit enterprise AI wears when it wants to look grounded. Add a document store, retrieve a few passages, attach citations, and the answer suddenly appears more disciplined than a free-floating chatbot. That appearance is useful. It is not proof. ...

May 30, 2026 · 16 min · Zelina
Cover image

Context Is Not a Costume: Why Strong Agents Still Fail on Contact

The agent looks ready. Then reality answers back. The current AI-agent story is conveniently simple. Take a powerful foundation model, wrap it in tools, give it a workflow, add a polite system prompt, and call the result “ready for deployment.” Reality, as usual, has poor manners. Two recent arXiv papers examine very different agent settings. One studies whether multimodal AI agents can align their behavior with the cognitive age of child users. The other studies whether behavior foundation models for imitation learning can remain robust when the physical dynamics of an environment shift after training. They do not share a benchmark, a model class, or even the same deployment domain. That is precisely why they are useful together. ...

May 29, 2026 · 14 min · Zelina
Cover image

Think Less, Align Better: The New Economics of AI Reasoning

Opening — Why this matters now Enterprise AI is entering its mildly awkward teenage phase: everyone wants intelligence, nobody wants the invoice. For the last two years, much of the AI conversation has revolved around more: more context, more reasoning tokens, more chain-of-thought, more human feedback, more evaluators, more synthetic data, more agents, more dashboards to explain why the agents broke the dashboards. The operating assumption was simple enough: if the model thinks more, explains more, or trains on more feedback, it should perform better. ...

May 9, 2026 · 19 min · Zelina