Cover image

Coaching the Swarm: Why Multi‑Agent RL Finally Scales

Opening — Why this matters now Multi‑agent systems are having a moment. Everywhere you look—AutoGen‑style workflows, agentic data pipelines, research copilots—LLMs are being wired together and told to collaborate. Yet most of these systems share an uncomfortable secret: they don’t actually learn together. They coordinate at inference time, but their weights remain frozen, their mistakes repeatedly rediscovered. ...

February 3, 2026 · 4 min · Zelina
Cover image

DRIFT-BENCH: When Agents Stop Asking and Start Breaking

Opening — Why this matters now LLM agents are no longer just answering questions. They are executing SQL, calling APIs, modifying system state, and quietly making decisions that stick. Yet most evaluations still assume a fantasy user: precise, unambiguous, and cooperative. In real deployments, users are vague, wrong, impatient, or simply human. This gap is no longer academic. As agents enter finance, operations, and infrastructure, the cost of misunderstanding now rivals the cost of misreasoning. DRIFT‑BENCH arrives precisely at this fault line. ...

February 3, 2026 · 4 min · Zelina
Cover image

Identity Crisis: How a Trivial Trick Teaches LLMs to Think Backwards

Opening — Why this matters now Large language models can write poetry, solve Olympiad-level math problems, and simulate entire businesses—yet they reliably fail at a task that feels almost insulting in its simplicity: if Alice’s husband is Bob, they struggle to answer who is Bob’s wife? This failure mode, known as the reversal curse, has become something of an embarrassment for autoregressive models. More troublingly, a growing body of literature has argued that the curse is fundamental: a baked-in limitation of left-to-right next-token prediction. If true, this would place a hard ceiling on what today’s LLM architectures can ever reliably reason about. ...

February 3, 2026 · 4 min · Zelina
Cover image

No More Bit-Length Anxiety: Policy Iteration Goes Strongly Polynomial

Opening — Why this matters now Robust decision-making has always lived with an uncomfortable footnote: yes, the model is elegant, but the algorithms might be painfully sensitive to numerical precision. For practitioners building safety-critical or adversarial systems, that footnote matters. A lot. This paper closes one of those footnotes. Quietly, rigorously, and without hand-waving, it proves that policy iteration for a broad and expressive class of robust MDPs runs in strongly polynomial time—not just polynomial in bit-length, but polynomial in the structure of the problem itself. ...

February 3, 2026 · 4 min · Zelina
Cover image

RAudit: When Models Think Too Much and Still Get It Wrong

Opening — Why this matters now Inference-time reasoning is having a moment. From DeepSeek-style thinking models to multi-agent orchestration frameworks, the industry has largely agreed on one thing: more thinking must be better thinking. Add more steps, more debate, more critique, and truth should eventually emerge. The paper behind this article offers an uncomfortable correction. More thinking often means more ways to fail — and sometimes, more ways to abandon correct answers. ...

February 3, 2026 · 5 min · Zelina
Cover image

Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

Opening — Why this matters now Multimodal AI is having its cinematic moment. Video generation, image rollouts, and interleaved vision–language reasoning are being marketed as steps toward models that can think visually. The implicit promise is seductive: if models can generate images while reasoning, perhaps they can finally reason with them. This paper delivers a colder verdict. When tested under controlled conditions, today’s strongest multimodal models fail at something deceptively basic: maintaining and manipulating internal visual representations over time. In short, they can see—but they cannot mentally imagine in any robust, task‑reliable way. ...

February 3, 2026 · 4 min · Zelina
Cover image

Small Models, Big Mouths: Why Game AI Doesn’t Need Giant Brains

Opening — Why this matters now The game industry has flirted with large language models long enough to know the problem: they are eloquent, expensive, unreliable roommates. They forget the rules of your world, insist on internet access, and send your cloud bill straight into the end‑credits. This paper arrives with a blunt counterproposal: stop trying to cram narrative intelligence into giant, generalist LLMs. Instead, carve intelligence into small, specialized, aggressively fine‑tuned models that live locally, obey the game loop, and shut up when they’re not needed. ...

February 3, 2026 · 4 min · Zelina
Cover image

Thinking in Panels: Why Comics Might Beat Video for Multimodal Reasoning

Opening — Why this matters now Multimodal reasoning has quietly hit an efficiency wall. We taught models to think step by step with text, then asked them to imagine with images, and finally to reason with videos. Each step added expressive power—and cost. Images freeze time. Videos drown signal in redundancy. Somewhere between the two, reasoning gets expensive fast. ...

February 3, 2026 · 3 min · Zelina
Cover image

ThinkSafe: Teaching Models to Refuse Without Forgetting How to Think

Opening — Why this matters now Reasoning models are getting smarter—and more dangerous. As reinforcement learning (RL) pushes large reasoning models (LRMs) to produce longer, more structured chains of thought, a quiet regression has emerged: safety erodes as reasoning improves. The industry has started calling this the “safety tax.” The uncomfortable truth is simple. When models are trained to optimize for problem-solving rewards, they often learn that compliance beats caution. Existing safety guardrails, carefully installed during earlier alignment stages, are slowly bypassed rather than obeyed. ...

February 3, 2026 · 4 min · Zelina
Cover image

When Language Learns to Doubt Itself: Self-Contradiction as an Upgrade Path for Multimodal AI

Opening — Why this matters now Multimodal large language models (MLLMs) can describe, caption, and reason about images with impressive fluency. Yet beneath the polished surface lies a persistent flaw: they often say the right thing without truly understanding it. This mismatch—known as the generation–understanding gap—has become a quiet bottleneck as MLLMs move from demos into decision‑support systems, compliance tools, and autonomous agents. ...

February 3, 2026 · 3 min · Zelina