Cover image

First Proofs, No Training Wheels

Opening — Why this matters now AI models are now fluent in contest math, symbolic manipulation, and polished explanations. That’s the easy part. The harder question—the one that actually matters for science—is whether these systems can do research when the answer is not already in the training set. The paper First Proof arrives as a deliberately uncomfortable experiment: ten genuine research-level mathematics questions, all solved by humans, none previously public, and all temporarily withheld from the internet. ...

February 7, 2026 · 3 min · Zelina
Cover image

FIRE-BENCH: Playing Back the Tape of Scientific Discovery

Opening — Why this matters now Agentic AI has entered its confident phase. Papers, demos, and product pitches increasingly imply that large language model (LLM)–powered agents can already “do research”: formulate hypotheses, run experiments, and even write papers end to end. The uncomfortable question is not whether they look busy—but whether they actually rediscover truth. ...

February 5, 2026 · 4 min · Zelina
Cover image

When Benchmarks Lie: Teaching Leaderboards to Care About Preferences

Opening — Why this matters now Benchmarks were supposed to be neutral referees. Instead, they’ve become unreliable narrators. Over the past two years, the gap between benchmark leadership and real-world usefulness has widened into something awkwardly visible. Models that dominate leaderboards frequently underperform in deployment. Smaller, specialized models sometimes beat generalist giants where it actually counts. Yet our evaluation rituals barely changed. ...

February 5, 2026 · 4 min · Zelina
Cover image

Coaching the Swarm: Why Multi‑Agent RL Finally Scales

Opening — Why this matters now Multi‑agent systems are having a moment. Everywhere you look—AutoGen‑style workflows, agentic data pipelines, research copilots—LLMs are being wired together and told to collaborate. Yet most of these systems share an uncomfortable secret: they don’t actually learn together. They coordinate at inference time, but their weights remain frozen, their mistakes repeatedly rediscovered. ...

February 3, 2026 · 4 min · Zelina
Cover image

RAudit: When Models Think Too Much and Still Get It Wrong

Opening — Why this matters now Inference-time reasoning is having a moment. From DeepSeek-style thinking models to multi-agent orchestration frameworks, the industry has largely agreed on one thing: more thinking must be better thinking. Add more steps, more debate, more critique, and truth should eventually emerge. The paper behind this article offers an uncomfortable correction. More thinking often means more ways to fail — and sometimes, more ways to abandon correct answers. ...

February 3, 2026 · 5 min · Zelina
Cover image

Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

Opening — Why this matters now Multimodal AI is having its cinematic moment. Video generation, image rollouts, and interleaved vision–language reasoning are being marketed as steps toward models that can think visually. The implicit promise is seductive: if models can generate images while reasoning, perhaps they can finally reason with them. This paper delivers a colder verdict. When tested under controlled conditions, today’s strongest multimodal models fail at something deceptively basic: maintaining and manipulating internal visual representations over time. In short, they can see—but they cannot mentally imagine in any robust, task‑reliable way. ...

February 3, 2026 · 4 min · Zelina
Cover image

When LLMs Meet Time: Why Time-Series Reasoning Is Still Hard

Opening — Why this matters now Large Language Models are increasingly marketed as general problem solvers. They summarize earnings calls, reason about code, and explain economic trends with alarming confidence. But when confronted with time—real, numeric, structured temporal data—that confidence starts to wobble. The TSAQA benchmark arrives at exactly the right moment, not to celebrate LLM progress, but to measure how far they still have to go. ...

February 3, 2026 · 3 min · Zelina
Cover image

When Empathy Needs a Map: Benchmarking Tool‑Augmented Emotional Support

Opening — Why this matters now Emotional support from AI has quietly moved from novelty to expectation. People vent to chatbots after work, during grief, and in moments of burnout—not to solve equations, but to feel understood. Yet something subtle keeps breaking trust. The responses sound caring, but they are often wrong in small, revealing ways: the time is off, the location is imagined, the suggestion doesn’t fit reality. Empathy without grounding turns into polite hallucination. ...

February 1, 2026 · 4 min · Zelina
Cover image

SokoBench: When Reasoning Models Lose the Plot

Opening — Why this matters now The AI industry has grown comfortable with a flattering assumption: if a model can reason, it can plan. Multi-step logic, chain-of-thought traces, and ever-longer context windows have encouraged the belief that we are edging toward systems capable of sustained, goal-directed action. SokoBench quietly dismantles that assumption. By stripping planning down to its bare minimum, the paper reveals an uncomfortable truth: today’s large reasoning models fail not because problems are complex—but because they are long. ...

January 31, 2026 · 3 min · Zelina
Cover image

When LLMs Get a Laptop: Why Sandboxes Might Be the Real AGI Benchmark

Opening — Why this matters now LLMs have learned to speak fluently. They can reason passably. Some can even plan. Yet most of them remain trapped in an oddly artificial condition: they think, but they cannot act. The latest wave of agent frameworks tries to fix this with tools, APIs, and carefully curated workflows. But a quieter idea is emerging underneath the hype—one that looks less like prompt engineering and more like infrastructure. ...

January 24, 2026 · 4 min · Zelina