Cover image

Coaching the Swarm: Why Multi‑Agent RL Finally Scales

Opening — Why this matters now Multi‑agent systems are having a moment. Everywhere you look—AutoGen‑style workflows, agentic data pipelines, research copilots—LLMs are being wired together and told to collaborate. Yet most of these systems share an uncomfortable secret: they don’t actually learn together. They coordinate at inference time, but their weights remain frozen, their mistakes repeatedly rediscovered. ...

February 3, 2026 · 4 min · Zelina
Cover image

RAudit: When Models Think Too Much and Still Get It Wrong

Opening — Why this matters now Inference-time reasoning is having a moment. From DeepSeek-style thinking models to multi-agent orchestration frameworks, the industry has largely agreed on one thing: more thinking must be better thinking. Add more steps, more debate, more critique, and truth should eventually emerge. The paper behind this article offers an uncomfortable correction. More thinking often means more ways to fail — and sometimes, more ways to abandon correct answers. ...

February 3, 2026 · 5 min · Zelina
Cover image

Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

Opening — Why this matters now Multimodal AI is having its cinematic moment. Video generation, image rollouts, and interleaved vision–language reasoning are being marketed as steps toward models that can think visually. The implicit promise is seductive: if models can generate images while reasoning, perhaps they can finally reason with them. This paper delivers a colder verdict. When tested under controlled conditions, today’s strongest multimodal models fail at something deceptively basic: maintaining and manipulating internal visual representations over time. In short, they can see—but they cannot mentally imagine in any robust, task‑reliable way. ...

February 3, 2026 · 4 min · Zelina
Cover image

When LLMs Meet Time: Why Time-Series Reasoning Is Still Hard

Opening — Why this matters now Large Language Models are increasingly marketed as general problem solvers. They summarize earnings calls, reason about code, and explain economic trends with alarming confidence. But when confronted with time—real, numeric, structured temporal data—that confidence starts to wobble. The TSAQA benchmark arrives at exactly the right moment, not to celebrate LLM progress, but to measure how far they still have to go. ...

February 3, 2026 · 3 min · Zelina
Cover image

When Empathy Needs a Map: Benchmarking Tool‑Augmented Emotional Support

Opening — Why this matters now Emotional support from AI has quietly moved from novelty to expectation. People vent to chatbots after work, during grief, and in moments of burnout—not to solve equations, but to feel understood. Yet something subtle keeps breaking trust. The responses sound caring, but they are often wrong in small, revealing ways: the time is off, the location is imagined, the suggestion doesn’t fit reality. Empathy without grounding turns into polite hallucination. ...

February 1, 2026 · 4 min · Zelina
Cover image

SokoBench: When Reasoning Models Lose the Plot

Opening — Why this matters now The AI industry has grown comfortable with a flattering assumption: if a model can reason, it can plan. Multi-step logic, chain-of-thought traces, and ever-longer context windows have encouraged the belief that we are edging toward systems capable of sustained, goal-directed action. SokoBench quietly dismantles that assumption. By stripping planning down to its bare minimum, the paper reveals an uncomfortable truth: today’s large reasoning models fail not because problems are complex—but because they are long. ...

January 31, 2026 · 3 min · Zelina
Cover image

When LLMs Get a Laptop: Why Sandboxes Might Be the Real AGI Benchmark

Opening — Why this matters now LLMs have learned to speak fluently. They can reason passably. Some can even plan. Yet most of them remain trapped in an oddly artificial condition: they think, but they cannot act. The latest wave of agent frameworks tries to fix this with tools, APIs, and carefully curated workflows. But a quieter idea is emerging underneath the hype—one that looks less like prompt engineering and more like infrastructure. ...

January 24, 2026 · 4 min · Zelina
Cover image

When Benchmarks Break: Why Bigger Models Keep Winning (and What That Costs You)

Opening — Why this matters now Every few months, a new paper reassures us that bigger is better. Higher scores, broader capabilities, smoother demos. Yet operators quietly notice something else: rising inference bills, brittle behavior off-benchmark, and evaluation metrics that feel increasingly ceremonial. This paper arrives right on schedule—technically rigorous, empirically dense, and unintentionally revealing about where the industry’s incentives now point. ...

January 21, 2026 · 3 min · Zelina
Cover image

Knowing Is Not Doing: When LLM Agents Pass the Task but Fail the World

Opening — Why this matters now LLM agents are getting disturbingly good at finishing tasks. They click the right buttons, traverse web pages, solve text-based games, and close tickets. Benchmarks applaud. Dashboards glow green. Yet something feels off. Change the environment slightly, rotate the layout, tweak the constraints — and suddenly the same agent behaves like it woke up in a stranger’s apartment. The problem isn’t execution. It’s comprehension. ...

January 15, 2026 · 4 min · Zelina
Cover image

TowerMind: When Language Models Learn That Towers Have Consequences

Opening — Why this matters now Large Language Models have become fluent planners. Ask them to outline a strategy, decompose a task, or explain why something should work, and they rarely hesitate. Yet when placed inside an environment where actions cost resources, mistakes compound, and time does not politely pause, that fluency often collapses. ...

January 12, 2026 · 4 min · Zelina