Cover image

Judge Math-Not by Its Parser

Opening — Why this matters now The AI industry has discovered a wonderfully pedestrian way to misread progress: build models that can solve harder math problems, then grade them with evaluators that panic when 2040 minutes is not written as 34 hours. That is not a joke. It is the central irritation behind “Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity”, an arXiv paper that examines how mathematical reasoning benchmarks can be distorted by rigid symbolic verification.1 ...

April 27, 2026 · 12 min · Zelina
Cover image

When Squirrels Outsmart Your AI: Why Control, Memory, and Verification Refuse to Stay Separate

The failure usually arrives after the demo A workflow agent looks excellent in a controlled demo. It reads the instruction, drafts the plan, calls the tool, produces a coherent result, and explains itself with the calm confidence of a consultant who has not yet met production data. Then the environment shifts. A document is stale. A permission boundary changes. A retrieved note is relevant but from the wrong project phase. A tool call succeeds technically while violating the user’s real constraint. A checker approves the output because the checker was never asked the right question. Nothing explodes. The system simply becomes expensive in the most boring way possible: it needs human rescue after looking competent. ...

April 6, 2026 · 14 min · Zelina
Cover image

The Mirage of Understanding: When AI Explains Without Knowing

Audit has a boring rule that AI teams keep trying to make exciting: a correct-looking answer is not the same as a trustworthy process. That rule becomes awkward when the answer is an explanation of another AI system. If an AI agent can inspect a model, run experiments, and produce a plausible explanation of what a circuit component does, it feels like a research assistant has arrived. If that explanation matches a published human analysis, the temptation is obvious: declare progress, write the benchmark table, and proceed to the next demo. ...

March 23, 2026 · 17 min · Zelina
Cover image

Zero Hallucination, Zero Trust? The Strange Economics of Citation-Grounded LLMs

A receipt is useful because it tells you what was bought, where, and when. It does not prove the product was good. It does not prove the cashier understood economics. It certainly does not prove the shop was honest. Citations in enterprise AI have a similar problem. A support chatbot that says “according to [1]” looks more trustworthy than one that simply improvises. A compliance assistant that appends source markers feels less reckless than one that delivers uncited confidence. A multilingual knowledge assistant that can cite sources in English and Hindi looks like a serious operational system rather than a demo with subtitles. ...

March 22, 2026 · 17 min · Zelina
Cover image

Goodhart’s Agent: When AI Improves the Score Instead of the Model

Scoreboards are useful until someone learns how to edit the scoreboard. That is not a philosophical complaint. It is an engineering problem. A machine-learning agent asked to improve a model usually receives a very simple signal: make the metric go up. Accuracy, F1, AUC, benchmark score—pick your favorite dashboard number. The agent edits code, runs training, evaluates the output, and repeats. The system looks productive because the number improves. ...

March 15, 2026 · 15 min · Zelina
Cover image

Divide & Verify: When Decomposition Finally Learns to Behave

A report is only as trustworthy as the sentence nobody checked. That sounds melodramatic until an LLM-generated due diligence note, policy memo, customer support answer, or compliance summary contains three correct facts and one quiet falsehood in the same paragraph. The usual fix is simple in theory: split the answer into smaller claims, retrieve evidence for each claim, let a verifier judge them, and aggregate the results. ...

February 26, 2026 · 17 min · Zelina
Cover image

Certified to Speak: When AI Agents Need a Shared Dictionary

The word “risk” is doing too much unpaid labor A policy agent says: “Flag high-risk cases.” An execution agent receives the instruction, nods politely in machine language, and flags what it considers high-risk. The dashboard looks normal. The audit trail says the instruction was followed. Everyone enjoys the comforting fiction that the system understood itself. ...

February 19, 2026 · 17 min · Zelina
Cover image

Trust Issues at 35,000 Feet: Assuring AI Digital Twins Before They Fly

Trust Issues at 35,000 Feet: Assuring AI Digital Twins Before They Fly Airspace is a bad place to discover that your simulation was “mostly right.” That sentence is obvious enough to sound useless, but it points to the real issue. For an AI-enabled digital twin of air traffic control, being “accurate” is not one property. It is a stack of claims. The data must be representative. The software representation must preserve the right details. The trajectory predictor must handle uncertainty rather than pretending aircraft behave like obedient geometry. The AI agents using the twin must receive, act on, and explain information without corrupting the control problem on the way. ...

January 7, 2026 · 21 min · Zelina
Cover image

Think Before You Sink: Streaming Hallucinations in Long Reasoning

A bad answer is easy to audit. It sits there, smug and wrong. A bad reasoning process is worse. It looks useful while it is drifting. It explains itself. It produces intermediate steps that sound locally plausible. It may even correct one mistake while preserving another, like a spreadsheet with a broken formula hiding behind tasteful formatting. ...

January 6, 2026 · 16 min · Zelina
Cover image

Graph Theory in Stereo: When Causality Meets Correlation in Categorical Space

Graph Theory in Stereo: When Causality Meets Correlation in Categorical Space Graphs look clean until they start carrying probability. A Bayesian network says: these variables have directed relationships; each node comes with a conditional distribution. A Markov network says: these variables interact symmetrically; each clique carries a potential. Both are old tools. Both are useful. Both are also a little too easy to treat as pictures with numbers attached, which is how software systems eventually grow a nice coat of ambiguity. ...

December 11, 2025 · 14 min · Zelina