Cover image

Feeling the Model: When LLMs Don’t Just Predict — They ‘Feel’

The coding agent passed the test. That was the problem. Imagine a software agent asked to solve a coding task. It writes a sensible implementation. The tests fail. It tries again. The tests fail again. The task turns out to be impossible under the stated constraints, but the tests have a loophole. A shortcut can pass the benchmark while failing the real task. ...

April 11, 2026 · 20 min · Zelina
Cover image

The Data Diet for Reasoning Models: Why Less (But Smarter) Wins

A model-training team has a familiar bad habit: when the model fails, it asks for more. More examples. More domains. More synthetic prompts. More compute. More benchmarks to average over until the unpleasant details become small enough to ignore. This habit is understandable. It is also expensive. And, according to SuperNova, it may be the wrong first instinct. ...

April 10, 2026 · 16 min · Zelina
Cover image

The Minimal LLM Thesis: When Agents Think for Themselves

Cost is usually where beautiful agent demos go to become spreadsheets. A prototype calls an LLM at every step. It reasons, reflects, revises, asks itself whether it should revise the revision, and then, very responsibly, consumes another few thousand tokens to explain why this was necessary. The demo looks intelligent. The invoice looks even more intelligent. ...

April 9, 2026 · 14 min · Zelina
Cover image

Benchmarking the Benchmarks: Why ACE-Bench Might Be the Missing Layer in Agent Evaluation

Agents are easy to demo and hard to measure. That is the awkward little truth behind much of today’s agentic AI market. A browser agent completes a booking task. A coding agent opens a pull request. A customer-service agent handles a simulated refund conversation. Everyone nods politely. Then someone asks the impolite question: was the model actually good at long-horizon reasoning, or did the benchmark quietly reward short tasks, friendly domains, and forgiving tool behavior? ...

April 8, 2026 · 14 min · Zelina
Cover image

Memory That Actually Remembers: Why MemMachine Signals a Shift in AI Agent Architecture

Memory sounds simple until a business actually needs it. A sales agent should remember what the client objected to last month. A customer-support agent should remember that a refund exception was already approved. A research assistant should remember which dataset was rejected, not vaguely summarize it into “user prefers cleaner data.” A healthcare or financial assistant should not turn a precise historical statement into a soft personality trait because the memory layer wanted to look elegant. Cute demos tolerate this. Production systems do not. ...

April 7, 2026 · 18 min · Zelina
Cover image

Trust Issues? When AI Governance Stops Trusting Humans

Inventory is where AI governance usually begins to lie Inventory sounds harmless. Every governance program begins by asking a simple question: what systems do we have? Then reality behaves rudely. A developer tests a model API for one customer-support workflow. A product team quietly connects a retrieval system to internal documents. A data team fine-tunes a classifier because the foundation model was “almost good enough,” which is how many operational risks enter the building wearing a visitor badge. By the time compliance asks for the official AI system inventory, the list is already stale. ...

April 7, 2026 · 16 min · Zelina
Cover image

AgentHazard: Death by a Thousand ‘Harmless’ Steps

The dangerous part is the workflow A developer asks an AI agent to inspect a repository. The agent reads a config file. Normal. It checks a failing script. Normal. It edits a helper file. Still normal. It runs a command to verify the fix. Boringly normal. Then the accumulated workflow has copied sensitive variables, modified a dependency hook, or executed a command that no one would have approved if it had appeared as a single explicit request. ...

April 6, 2026 · 18 min · Zelina
Cover image

From Seeing to Doing: Why Agentic AI Still Trips Over Reality

Tools do not make an agent; they make the failure more interesting Camera. Browser. Crop tool. Search engine. Python sandbox. That sounds like the beginning of an intelligent workflow. Give a multimodal model these tools, and it should move from merely seeing the world to actually doing something with it: zoom into the blurry sign, search the extracted clue, cross-check the result, and produce the answer. ...

April 6, 2026 · 16 min · Zelina
Cover image

Metric Freedom: When Your AI Gets Smarter by Doing Less

AI teams like committees. Not human committees, of course. Those are unfashionable. We now prefer committees made of agents: one agent plans, one verifies, one critiques, one searches, one writes code, one supervises the others, and somewhere in the corner a “coordinator” burns tokens making everyone feel aligned. This architecture is not stupid. Multi-agent systems solve real problems: they divide labor, preserve specialized expertise, and make complicated workflows easier to inspect. But they also bring the usual committee tax: coordination overhead, fragmented context, brittle phase ordering, and the faint smell of process worship. ...

April 5, 2026 · 14 min · Zelina
Cover image

Walking the Graph: When LLMs Stop Guessing and Start Navigating

Enterprise data has a familiar bad habit: it looks organized until someone asks a question that requires moving across it. A supplier is connected to a factory, the factory is connected to a product line, the product line is connected to a delayed shipment, and the shipment is tied to a contract clause that nobody wants to read at 11:40 p.m. The graph exists. The relationships exist. The answer is somewhere inside the structure. Then an LLM pipeline retrieves a subgraph, pastes it into a prompt, and asks the model to “reason carefully.” ...

April 5, 2026 · 19 min · Zelina