Assurance

Blinded by Design: When AI Stops Thinking and Starts Remembering

Opening — Why this matters now For the past year, the conversation around AI has quietly shifted. We’re no longer debating whether models are powerful—we’re asking whether they are trustworthy operators inside real workflows. And here lies an uncomfortable truth: when an LLM gives you an answer, you cannot tell whether it came from your data… or from something it remembers. ...

Claw-Eval — When Agents Game the System, the System Needs Claws

Opening — Why this matters now AI agents have quietly crossed a threshold. They no longer just answer questions—they act. They send emails, call APIs, modify files, orchestrate workflows. In other words, they’ve moved from generating text to generating consequences. And yet, most evaluation methods still behave as if we’re grading essays. That mismatch is no longer academic. It’s operational risk. ...

From Spreadsheets to Swarms: How Agentic AI Rewrites the Retail Supply Chain

Opening — Why this matters now Retail supply chains are not broken. They are simply overwhelmed. For decades, supermarket operations have scaled by adding more dashboards, more analysts, and more coordination layers. The result is a system that works—until it doesn’t. Demand spikes, supplier delays, or perishable inventory mismatches expose a structural limitation: human coordination does not scale linearly with operational complexity. ...

Skill Issue or System Design? How LLMs Actually Follow Instructions

Opening — Why this matters now Instruction-following is the quiet backbone of modern AI products. From copilots to autonomous agents, everything hinges on whether a model can do exactly what it was told—not approximately, not creatively, but precisely. And yet, anyone who has deployed LLMs in production knows the uncomfortable truth: they don’t “follow instructions” in any consistent, reliable sense. ...

The Mirage of Intelligence: When LLMs Learn to Memorize Instead of Think

Opening — Why this matters now The current AI narrative is intoxicated by benchmarks. Models score higher, leaderboards update faster, and each new release claims a marginal gain in “reasoning.” But beneath this steady upward curve lies a quieter, less flattering reality: much of what we call intelligence may simply be structured recall. The paper at hand dissects this illusion with uncomfortable precision. It introduces a mechanism by which large language models (LLMs) appear to improve—not by reasoning better, but by memorizing more efficiently. For businesses deploying AI systems into decision pipelines, this distinction is not academic. It is existential. ...

When Data Decides What Matters: The Quiet Economics of LLM Data Selection

Opening — Why this matters now The AI industry is currently obsessed with scale — more tokens, larger models, bigger compute budgets. But quietly, a more consequential question is emerging beneath the surface: What if performance is no longer constrained by how much data you have, but by which data you choose? As training costs climb into the hundreds of millions, brute-force scaling is starting to look less like a strategy and more like a tax. The paper challenges this assumption by reframing training not as a data accumulation problem, but as a data allocation problem. ...

Memory That Actually Remembers: Why MemMachine Signals a Shift in AI Agent Architecture

Opening — Why this matters now Everyone agrees AI agents need memory. Few agree on what kind. The industry’s default answer has been compression: summarize conversations, extract key facts, store structured knowledge, and hope nothing important was lost in translation. It works—until it doesn’t. The moment an agent misremembers a detail, fabricates continuity, or loses temporal context, the illusion of intelligence collapses. ...

Protocol Over Prompts: Why ANX Rewrites the Rules of AI Agent Interaction

Opening — Why this matters now The industry is currently obsessed with what agents can do. The more uncomfortable question is: how they do it. Most AI agent systems today operate like clever improvisers—stringing together prompts, APIs, and UI hacks into something that works most of the time. That’s acceptable for demos. It is not acceptable for production systems handling money, identity, or operations. ...

QED-Nano: Small Models, Big Proof Energy

Opening — Why this matters now For the past two years, AI progress in mathematics has followed a familiar script: bigger models, better results, less transparency. Systems like proprietary frontier models quietly crossed into Olympiad-level reasoning, while the rest of the field was left reverse-engineering shadows. Then comes QED-Nano—a 4B parameter model that politely disrupts that narrative. ...

The Proof Is in the Instance: Why AI Safety Can’t Be Fully Verified

Opening — Why this matters now AI safety has quietly shifted from a performance problem to a guarantee problem. It’s no longer enough that systems work most of the time; in safety-critical domains, they must work correctly every time. Naturally, the industry response has been to scale verification: more rules, more constraints, more formal checks. If something slips through, the instinct is simple—expand the verifier. ...