Cover image

When Models Look Back: Memory, Leakage, and the Quiet Failure Modes of LLM Training

Opening — Why this matters now Large language models are getting better at many things—reasoning, coding, multi‑modal perception. But one capability remains quietly uncomfortable: remembering things they were never meant to remember. The paper underlying this article dissects memorization not as a moral failure or an anecdotal embarrassment, but as a structural property of modern LLM training. The uncomfortable conclusion is simple: memorization is not an edge case. It is a predictable outcome of how we scale data, objectives, and optimization. ...

December 30, 2025 · 3 min · Zelina
Cover image

Reading the Room? Apparently Not: When LLMs Miss Intent

Opening — Why this matters now Large Language Models are increasingly deployed in places where misunderstanding intent is not a harmless inconvenience, but a real risk. Mental‑health support, crisis hotlines, education, customer service, even compliance tooling—these systems are now expected to “understand” users well enough to respond safely. The uncomfortable reality: they don’t. The paper behind this article demonstrates something the AI safety community has been reluctant to confront head‑on: modern LLMs are remarkably good at sounding empathetic while being structurally incapable of grasping what users are actually trying to do. Worse, recent “reasoning‑enabled” models often amplify this failure instead of correcting it. fileciteturn0file0 ...

December 25, 2025 · 4 min · Zelina
Cover image

Benchmarks on Quicksand: Why Static Scores Fail Living Models

Opening — Why this matters now If you feel that every new model release breaks yesterday’s leaderboard, congratulations: you’ve discovered the central contradiction of modern AI evaluation. Benchmarks were designed for stability. Models are not. The paper you just uploaded dissects this mismatch with academic precision—and a slightly uncomfortable conclusion: static benchmarks are no longer fit for purpose. ...

December 15, 2025 · 3 min · Zelina
Cover image

Paper Tigers or Compliance Cops? What AIReg‑Bench Really Says About LLMs and the EU AI Act

The gist AIReg‑Bench proposes the first benchmark for a deceptively practical task: can an LLM read technical documentation and judge how likely an AI system complies with specific EU AI Act articles? The dataset avoids buzzword theater: 120 synthetic but expert‑vetted excerpts portraying high‑risk systems, each labeled by three legal experts on a 1–5 compliance scale (plus plausibility). Frontier models are then asked to score the same excerpts. The headline: best models reach human‑like agreement on ordinal compliance judgments—under some conditions. That’s both promising and dangerous. ...

October 9, 2025 · 5 min · Zelina
Cover image

Options = Power: Turning Empowerment into a KPI for AI Agents

If your agents can reach more valuable futures with fewer steps, they’re stronger—whether you measured that task or not. Today’s paper offers a clean way to turn that intuition into a number: empowerment—an information‑theoretic score of how much an agent’s current action shapes its future states. The authors introduce EELMA, a scalable estimator that works purely from multi‑turn text traces. No bespoke benchmark design. No reward hacking. Just trajectories. This is the kind of metric we’ve wanted at Cognaptus: goal‑agnostic, scalable, and diagnostic. Below, I translate EELMA into an operator’s playbook: what it is, why it matters for business automation, how to wire it into your stack, and where it can mislead you if unmanaged. ...

October 3, 2025 · 5 min · Zelina
Cover image

When Agents Get Bored: Three Baselines Your Autonomy Stack Already Has

Thesis: Give an LLM agent freedom and a memory, and it won’t idle. It will reliably drift into one of three meta-cognitive modes. If you operate autonomous workflows, these modes are your real defaults during downtime, ambiguity, and recovery. Why this matters (for product owners and ops) Most agent deployments assume a “do nothing” baseline between tasks. New evidence says otherwise: with a continuous ReAct loop, persistent memory, and self-feedback, agents self-organize—not randomly, but along three stable patterns. Understanding them improves incident response, UX, and governance, especially when guardrails, tools, or upstream signals hiccup. ...

October 2, 2025 · 4 min · Zelina
Cover image

Sandboxes & Ladders: How to Build a Steerable Agent Economy

If AI agents become the economy’s new workforce, what keeps their markets from melting into ours like solder—fast, hot, and hard to undo? DeepMind’s “Virtual Agent Economies” proposes a practical map (and a modest constitution) for that future: treat agent markets as sandboxes and tune their permeability to the human economy. Once you see permeability as the policy lever, the rest of the architecture falls into place: auctions to resolve clashes, mission-led markets to direct effort, and identity rails so agents can be trusted, priced, and sanctioned. ...

September 19, 2025 · 6 min · Zelina
Cover image

Terms of Engagement: Building Trustworthy AI Agents Before They Build Us

As agentic AI moves from flashy demos to day‑to‑day operations—handling renewals, filing tickets, triaging inboxes, even buying things—the question is no longer can we automate judgment, but on what terms. This isn’t ethics-as-window‑dressing. Agent systems perceive, decide, and act through real interfaces (email, bank APIs, code repos). They can help—or hurt—at machine speed. Today I’ll argue three things: Alignment must shift from “answer quality” to action quality. Social agents change the duty of care developers and companies owe to users. We need a governance stack for multi‑agent ecosystems, not one‑off checklists. The discussion is grounded in the Nature piece by Gabriel, Keeling, Manzini, and Evans (2025), but tuned for operators shipping products this quarter—not a hypothetical future. ...

September 19, 2025 · 5 min · Zelina
Cover image

Stop, Verify, and Listen: HALT‑RAG Brings a ‘Reject Option’ to RAG

The big idea RAG pipelines are only as reliable as their weakest link: generation that confidently asserts things the sources don’t support. HALT‑RAG proposes an unusually pragmatic fix: don’t fine‑tune a big model—ensemble two strong, frozen NLI models, add lightweight lexical features, train a tiny task‑adapted meta‑classifier, and calibrate it so you can abstain when uncertain. The result isn’t just accuracy; it’s a governable safety control you can dial to meet business risk. ...

September 13, 2025 · 4 min · Zelina
Cover image

Branching Out of the Middle: How a ‘Tree of Agents’ Fixes Long-Context Blind Spots

TL;DR — Tree of Agents (TOA) splits very long documents into chunks, lets multiple agents read in different orders, shares evidence, prunes dead-ends, caches partial states, and then votes. The result: fewer hallucinations, resilience to the “lost in the middle” effect, and accuracy comparable to premium large models—while using a compact backbone. Why this matters for operators If your business parses contracts, annual reports, medical SOPs, or call-center transcripts, you’ve likely felt the pain of long-context LLMs: critical details buried mid-document get ignored; retrieval misses cross-paragraph logic; and bigger context windows inflate cost without guaranteeing better reasoning. TOA is a pragmatic middle path: it re-imposes structure on attention—not by scaling a single monolith, but by coordinating multiple lightweight readers with disciplined information exchange. ...

September 12, 2025 · 4 min · Zelina