Cover image

Memory Lane Has Potholes: MemFail and the Business of Testing Agent Recall

Memory is where enterprise AI demos go to become operationally embarrassing. In the demo, the assistant remembers that a client prefers concise weekly updates, that a trader avoids high-leverage positions after volatility spikes, or that a procurement manager only approves a supplier when compliance documents are current. In production, the same assistant may remember the attractive half of the fact and quietly lose the condition. It recalls “approves supplier” but forgets “only when compliance documents are current.” Congratulations: the agent has not forgotten. It has remembered dangerously. ...

June 4, 2026 · 15 min · Zelina
Cover image

Cache Me If You Can: Why LLM Benchmarks Need Contamination-Resistant Data

The benchmark score is not the product. The test pipeline is. Benchmarks used to feel like neutral scoreboards. A model sat down, answered questions, received a number, and everyone pretended the number meant generalization. That story became less charming once benchmark questions started appearing in the same public data oceans used to train the models being tested. ...

June 3, 2026 · 20 min · Zelina
Cover image

Score and Disorder: Why LLM Reasoning Needs More Than Accuracy

A model review often begins with a spreadsheet. One column says accuracy. Another says cost. A third says latency. Someone asks whether the model is “good enough.” Someone else points at the benchmark score. A decision is made. Procurement smiles. Compliance does not, but compliance rarely smiles anyway. The problem is not that accuracy is useless. The problem is that accuracy is too small a container for the thing businesses actually want from reasoning systems. A final answer can be correct while the route to that answer is unstable, unnecessarily expensive, locally contradictory, or impossible to reproduce under a harmless rewording of the question. That is not a philosophical inconvenience. It is an operational failure mode waiting politely inside a dashboard. ...

June 1, 2026 · 16 min · Zelina
Cover image

Jailbreak ASR Is Wearing a Costume

The number looked safe. Then someone ran it twice. A familiar business problem: one vendor says its model resists jailbreaks. Another red-team report says a new attack reaches a spectacular Attack Success Rate. A compliance team sees a percentage, puts it into a risk register, and moves on. Unfortunately, that percentage may be doing more acting than measuring. ...

May 29, 2026 · 14 min · Zelina
Cover image

Judge Math-Not by Its Parser

Opening — Why this matters now The AI industry has discovered a wonderfully pedestrian way to misread progress: build models that can solve harder math problems, then grade them with evaluators that panic when 2040 minutes is not written as 34 hours. That is not a joke. It is the central irritation behind “Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity”, an arXiv paper that examines how mathematical reasoning benchmarks can be distorted by rigid symbolic verification.1 ...

April 27, 2026 · 12 min · Zelina
Cover image

Cloudy With a Chance of Local Models: When On-Prem AI Starts Beating the API

Cloudy With a Chance of Local Models: When On-Prem AI Starts Beating the API Server room. That phrase used to sound like a warning label in enterprise AI strategy. If a company wanted serious model capability, the usual advice was simple: use a cloud API, negotiate procurement terms, and pretend the legal team was not reading the data-processing agreement with growing despair. ...

April 23, 2026 · 17 min · Zelina
Cover image

When Maps Start Thinking: GeoAgentBench and the Audit of Spatial AI

When Maps Start Thinking: GeoAgentBench and the Audit of Spatial AI Maps look calm. That is their trick. A finished map gives the impression of order: roads align, polygons close, rivers flow, color ramps behave, labels politely stay out of the way. Behind that calm surface, a GIS workflow is usually a small bureaucratic state: coordinate systems, raster-vector conversions, topology checks, interpolation choices, file paths, layer ordering, and visualization rules all negotiating with one another. One wrong projection, one invalid geometry, one missing intermediate file, and the whole administrative state collapses. It does not collapse poetically. It throws an error. ...

April 16, 2026 · 17 min · Zelina
Cover image

Benchmarking the Benchmarks: When AI Safety Metrics Stop Meaning Anything

Safety used to sound like a simple procurement question. A vendor says its model is safe. The slide deck has benchmark scores. The scores have respectable names: accuracy, F1, safety score, refusal rate, attack success rate. Everyone nods, because familiar metric names create the soothing illusion that someone has already done the hard work. ...

April 15, 2026 · 16 min · Zelina
Cover image

The Ask Gap: Why AI Agents Fail Not Because They Can’t Think — But Because They Don’t Know When to Stop

A ticket lands in the queue. It looks ordinary: update a parser, answer a business question, patch a workflow, produce a SQL query. The agent opens the files, explores the schema, writes code, runs a few checks, and submits something plausible. The output is polished. The reasoning trace is confident. The dashboard marks the task as completed. ...

April 13, 2026 · 16 min · Zelina
Cover image

CivBench: When AI Stops Guessing and Starts Planning

Scoreboards are comforting. They reduce a messy contest into one neat line: winner, loser, maybe a score. Executives like them, product teams like them, investors like them, and benchmark dashboards absolutely adore them. Strategy, unfortunately, is rude enough not to fit inside that line. A company can make the right decisions and still lose because the market turns. A trading agent can survive a bad regime by managing exposure well, then look mediocre because the final return is not spectacular. A planning system can stumble into success after making terrible intermediate choices. Outcome-only evaluation is clean, but cleanliness is not the same as truth. It is often just a good-looking loss of information. ...

April 11, 2026 · 17 min · Zelina