Cover image

Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

Opening — Why this matters now Multimodal AI is having its cinematic moment. Video generation, image rollouts, and interleaved vision–language reasoning are being marketed as steps toward models that can think visually. The implicit promise is seductive: if models can generate images while reasoning, perhaps they can finally reason with them. This paper delivers a colder verdict. When tested under controlled conditions, today’s strongest multimodal models fail at something deceptively basic: maintaining and manipulating internal visual representations over time. In short, they can see—but they cannot mentally imagine in any robust, task‑reliable way. ...

February 3, 2026 · 4 min · Zelina
Cover image

When LLMs Meet Time: Why Time-Series Reasoning Is Still Hard

Opening — Why this matters now Large Language Models are increasingly marketed as general problem solvers. They summarize earnings calls, reason about code, and explain economic trends with alarming confidence. But when confronted with time—real, numeric, structured temporal data—that confidence starts to wobble. The TSAQA benchmark arrives at exactly the right moment, not to celebrate LLM progress, but to measure how far they still have to go. ...

February 3, 2026 · 3 min · Zelina
Cover image

When Empathy Needs a Map: Benchmarking Tool‑Augmented Emotional Support

Opening — Why this matters now Emotional support from AI has quietly moved from novelty to expectation. People vent to chatbots after work, during grief, and in moments of burnout—not to solve equations, but to feel understood. Yet something subtle keeps breaking trust. The responses sound caring, but they are often wrong in small, revealing ways: the time is off, the location is imagined, the suggestion doesn’t fit reality. Empathy without grounding turns into polite hallucination. ...

February 1, 2026 · 4 min · Zelina
Cover image

SokoBench: When Reasoning Models Lose the Plot

Opening — Why this matters now The AI industry has grown comfortable with a flattering assumption: if a model can reason, it can plan. Multi-step logic, chain-of-thought traces, and ever-longer context windows have encouraged the belief that we are edging toward systems capable of sustained, goal-directed action. SokoBench quietly dismantles that assumption. By stripping planning down to its bare minimum, the paper reveals an uncomfortable truth: today’s large reasoning models fail not because problems are complex—but because they are long. ...

January 31, 2026 · 3 min · Zelina
Cover image

When LLMs Get a Laptop: Why Sandboxes Might Be the Real AGI Benchmark

Opening — Why this matters now LLMs have learned to speak fluently. They can reason passably. Some can even plan. Yet most of them remain trapped in an oddly artificial condition: they think, but they cannot act. The latest wave of agent frameworks tries to fix this with tools, APIs, and carefully curated workflows. But a quieter idea is emerging underneath the hype—one that looks less like prompt engineering and more like infrastructure. ...

January 24, 2026 · 4 min · Zelina
Cover image

When Benchmarks Break: Why Bigger Models Keep Winning (and What That Costs You)

Opening — Why this matters now Every few months, a new paper reassures us that bigger is better. Higher scores, broader capabilities, smoother demos. Yet operators quietly notice something else: rising inference bills, brittle behavior off-benchmark, and evaluation metrics that feel increasingly ceremonial. This paper arrives right on schedule—technically rigorous, empirically dense, and unintentionally revealing about where the industry’s incentives now point. ...

January 21, 2026 · 3 min · Zelina
Cover image

Knowing Is Not Doing: When LLM Agents Pass the Task but Fail the World

Opening — Why this matters now LLM agents are getting disturbingly good at finishing tasks. They click the right buttons, traverse web pages, solve text-based games, and close tickets. Benchmarks applaud. Dashboards glow green. Yet something feels off. Change the environment slightly, rotate the layout, tweak the constraints — and suddenly the same agent behaves like it woke up in a stranger’s apartment. The problem isn’t execution. It’s comprehension. ...

January 15, 2026 · 4 min · Zelina
Cover image

TowerMind: When Language Models Learn That Towers Have Consequences

Opening — Why this matters now Large Language Models have become fluent planners. Ask them to outline a strategy, decompose a task, or explain why something should work, and they rarely hesitate. Yet when placed inside an environment where actions cost resources, mistakes compound, and time does not politely pause, that fluency often collapses. ...

January 12, 2026 · 4 min · Zelina
Cover image

NPCs With Short-Term Memory Loss: Benchmarking Agents That Actually Live in the World

Opening — Why this matters now Agentic AI has entered its Minecraft phase again. Not because blocks are trendy, but because open-world games remain one of the few places where planning, memory, execution, and failure collide in real time. Yet most agent benchmarks still cheat. They rely on synthetic prompts, privileged world access, or oracle-style evaluation that quietly assumes the agent already knows where everything is. The result: impressive demos, fragile agents, and metrics that flatter models more than they inform builders. ...

January 10, 2026 · 4 min · Zelina
Cover image

RxnBench: Reading Chemistry Like a Human (Turns Out That’s Hard)

Opening — Why this matters now Multimodal Large Language Models (MLLMs) have become impressively fluent readers of the world. They can caption images, parse charts, and answer questions about documents that would once have required a human analyst and a strong coffee. Naturally, chemistry was next. But chemistry does not speak in sentences. It speaks in arrows, wedges, dashed bonds, cryptic tables, and reaction schemes buried three pages away from their explanations. If we want autonomous “AI chemists,” the real test is not trivia or SMILES strings — it is whether models can read actual chemical papers. ...

December 31, 2025 · 4 min · Zelina