Cover image

AIRS-Bench: When AI Starts Doing the Science, Not Just Talking About It

Opening — Why this matters now For years, AI progress has been narrated through a familiar ritual: introduce a new benchmark, top it with a new model, declare victory, repeat. But as large language models graduate from single-shot answers to multi-step agentic workflows, that ritual is starting to crack. If AI systems are now expected to design experiments, debug failures, iterate on ideas, and judge their own results, then accuracy on static datasets is no longer the right yardstick. ...

February 9, 2026 · 3 min · Zelina
Cover image

Benchmarks Lie, Rooms Don’t: Why Embodied AI Fails the Moment It Enters Your House

Opening — Why this matters now Embodied AI is having its deployment moment. Robots are promised for homes, agents for physical spaces, and multimodal models are marketed as finally “understanding” the real world. Yet most of these claims rest on benchmarks designed far away from kitchens, hallways, mirrors, and cluttered tables. This paper makes an uncomfortable point: if you evaluate agents inside the environments they will actually operate in, much of that apparent intelligence collapses. ...

February 7, 2026 · 4 min · Zelina
Cover image

First Proofs, No Training Wheels

Opening — Why this matters now AI models are now fluent in contest math, symbolic manipulation, and polished explanations. That’s the easy part. The harder question—the one that actually matters for science—is whether these systems can do research when the answer is not already in the training set. The paper First Proof arrives as a deliberately uncomfortable experiment: ten genuine research-level mathematics questions, all solved by humans, none previously public, and all temporarily withheld from the internet. ...

February 7, 2026 · 3 min · Zelina
Cover image

AgenticPay: When LLMs Start Haggling for a Living

Opening — Why this matters now Agentic AI has moved beyond polite conversation. Increasingly, we expect language models to act: negotiate contracts, procure services, choose suppliers, and close deals on our behalf. This shift quietly transforms LLMs from passive tools into economic actors. Yet here’s the uncomfortable truth: most evaluations of LLM agents still resemble logic puzzles or toy auctions. They test reasoning, not commerce. Real markets are messy—private constraints, asymmetric incentives, multi-round bargaining, and strategic patience all matter. The paper behind AgenticPay steps directly into this gap. ...

February 6, 2026 · 4 min · Zelina
Cover image

When Papers Learn to Draw: AutoFigure and the End of Ugly Science Diagrams

Opening — Why this matters now AI can already write papers, review papers, and in some cases get papers accepted. Yet one stubborn artifact has remained conspicuously human: the scientific figure. Diagrams, pipelines, conceptual schematics—these are still hand-crafted, visually inconsistent, and painfully slow to produce. For AI-driven research agents, this isn’t cosmetic. It’s a structural failure. ...

February 4, 2026 · 4 min · Zelina
Cover image

Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

Opening — Why this matters now Multimodal AI is having its cinematic moment. Video generation, image rollouts, and interleaved vision–language reasoning are being marketed as steps toward models that can think visually. The implicit promise is seductive: if models can generate images while reasoning, perhaps they can finally reason with them. This paper delivers a colder verdict. When tested under controlled conditions, today’s strongest multimodal models fail at something deceptively basic: maintaining and manipulating internal visual representations over time. In short, they can see—but they cannot mentally imagine in any robust, task‑reliable way. ...

February 3, 2026 · 4 min · Zelina
Cover image

When Benchmarks Forget What They Learned

Opening — Why this matters now Large language models are getting better at everything — or at least that’s what the leaderboards suggest. Yet beneath the glossy scores lies a quiet distortion: many benchmarks are no longer measuring learning, but recall. The paper you’ve just uploaded dissects this issue with surgical precision, showing how memorization creeps into evaluation pipelines and quietly inflates our confidence in model capability. ...

February 2, 2026 · 3 min · Zelina
Cover image

Sequential Beats Parallel: When Deep Research Agents Learn to Reflect

Opening — Why this matters now The last year has been crowded with so-called deep research agents. Everyone parallelizes. Everyone fans out queries. Everyone promises doctoral-level synthesis at web speed. And yet, the leaderboard keeps telling an inconvenient story: throwing more parallel agents at a problem does not reliably buy depth. The paper “Deep Researcher with Sequential Plan Reflection and Candidates Crossover” enters this debate with a pointed thesis: research is not a map-reduce problem. If you want insight, you need memory, reflection, and the ability to change your mind mid-flight. ...

January 31, 2026 · 4 min · Zelina
Cover image

SokoBench: When Reasoning Models Lose the Plot

Opening — Why this matters now The AI industry has grown comfortable with a flattering assumption: if a model can reason, it can plan. Multi-step logic, chain-of-thought traces, and ever-longer context windows have encouraged the belief that we are edging toward systems capable of sustained, goal-directed action. SokoBench quietly dismantles that assumption. By stripping planning down to its bare minimum, the paper reveals an uncomfortable truth: today’s large reasoning models fail not because problems are complex—but because they are long. ...

January 31, 2026 · 3 min · Zelina
Cover image

Your Agent Remembers—But Can It Forget?

Opening — Why this matters now As reinforcement learning (RL) systems inch closer to real-world deployment—robotics, autonomous navigation, decision automation—a quiet assumption keeps slipping through the cracks: that remembering is enough. Store the past, replay it when needed, act accordingly. Clean. Efficient. Wrong. The paper Memory Retention Is Not Enough to Master Memory Tasks in Reinforcement Learning dismantles this assumption with surgical precision. Its core claim is blunt: agents that merely retain information fail catastrophically once the world changes. Intelligence, it turns out, depends less on what you remember than on what you are able to forget. ...

January 22, 2026 · 4 min · Zelina