Cover image

AIRS-Bench: When AI Starts Doing the Science, Not Just Talking About It

Opening — Why this matters now For years, AI progress has been narrated through a familiar ritual: introduce a new benchmark, top it with a new model, declare victory, repeat. But as large language models graduate from single-shot answers to multi-step agentic workflows, that ritual is starting to crack. If AI systems are now expected to design experiments, debug failures, iterate on ideas, and judge their own results, then accuracy on static datasets is no longer the right yardstick. ...

February 9, 2026 · 3 min · Zelina
Cover image

DeltaEvolve: When Evolution Learns Its Own Momentum

Opening — Why this matters now LLM-driven discovery systems have crossed an uncomfortable threshold. They no longer fail because models cannot generate ideas, but because they cannot remember the right things. AlphaEvolve, FunSearch, and their successors proved that iterative code evolution works. What they also revealed is a structural bottleneck: context windows are finite, expensive, and poorly used. ...

February 5, 2026 · 4 min · Zelina
Cover image

Search-R2: When Retrieval Learns to Admit It Was Wrong

Opening — Why this matters now Search-integrated LLMs were supposed to be the antidote to hallucination. Give the model tools, give it the web, let it reason step by step—problem solved. Except it wasn’t. What we actually built were agents that search confidently, reason eloquently, and fail quietly. One bad query early on, one misleading paragraph retrieved at the wrong moment, and the whole reasoning chain collapses—yet reinforcement learning still rewards it if the final answer happens to be right. ...

February 4, 2026 · 4 min · Zelina
Cover image

Coaching the Swarm: Why Multi‑Agent RL Finally Scales

Opening — Why this matters now Multi‑agent systems are having a moment. Everywhere you look—AutoGen‑style workflows, agentic data pipelines, research copilots—LLMs are being wired together and told to collaborate. Yet most of these systems share an uncomfortable secret: they don’t actually learn together. They coordinate at inference time, but their weights remain frozen, their mistakes repeatedly rediscovered. ...

February 3, 2026 · 4 min · Zelina
Cover image

FadeMem: When AI Learns to Forget on Purpose

Opening — Why this matters now The race to build smarter AI agents has mostly followed one instinct: remember more. Bigger context windows. Larger vector stores. Ever-growing retrieval pipelines. Yet as agents move from demos to long-running systems—handling days or weeks of interaction—this instinct is starting to crack. More memory does not automatically mean better reasoning. In practice, it often means clutter, contradictions, and degraded performance. Humans solved this problem long ago, not by remembering everything, but by forgetting strategically. ...

February 1, 2026 · 4 min · Zelina
Cover image

MemCtrl: Teaching Small Models What *Not* to Remember

Opening — Why this matters now Embodied AI is hitting a very human bottleneck: memory. Not storage capacity, not retrieval speed—but judgment. Modern multimodal large language models (MLLMs) can see, reason, and act, yet when deployed as embodied agents they tend to remember too much, too indiscriminately. Every frame, every reflection, every redundant angle piles into context until the agent drowns in its own experience. ...

January 31, 2026 · 4 min · Zelina
Cover image

Sequential Beats Parallel: When Deep Research Agents Learn to Reflect

Opening — Why this matters now The last year has been crowded with so-called deep research agents. Everyone parallelizes. Everyone fans out queries. Everyone promises doctoral-level synthesis at web speed. And yet, the leaderboard keeps telling an inconvenient story: throwing more parallel agents at a problem does not reliably buy depth. The paper “Deep Researcher with Sequential Plan Reflection and Candidates Crossover” enters this debate with a pointed thesis: research is not a map-reduce problem. If you want insight, you need memory, reflection, and the ability to change your mind mid-flight. ...

January 31, 2026 · 4 min · Zelina
Cover image

Optimizing Agentic Workflows: When Agents Learn to Stop Thinking So Much

Opening — Why this matters now Agentic AI is finally escaping the demo phase and entering production. And like most things that grow up too fast, it’s discovering an uncomfortable truth: thinking is expensive. Every planning step, every tool call, every reflective pause inside an LLM agent adds latency, cost, and failure surface. When agents are deployed across customer support, internal ops, finance tooling, or web automation, these inefficiencies stop being academic. They show up directly on the cloud bill—and sometimes in the form of agents confidently doing the wrong thing. ...

January 30, 2026 · 4 min · Zelina
Cover image

World Models Meet the Office From Hell

Opening — Why this matters now Enterprise AI has entered an awkward phase. On paper, frontier LLMs can reason, plan, call tools, and even complete multi-step tasks. In practice, they quietly break things. Not loudly. Not catastrophically. Just enough to violate a policy, invalidate a downstream record, or trigger a workflow no one notices until audit season. ...

January 30, 2026 · 4 min · Zelina
Cover image

When LLMs Get a Laptop: Why Sandboxes Might Be the Real AGI Benchmark

Opening — Why this matters now LLMs have learned to speak fluently. They can reason passably. Some can even plan. Yet most of them remain trapped in an oddly artificial condition: they think, but they cannot act. The latest wave of agent frameworks tries to fix this with tools, APIs, and carefully curated workflows. But a quieter idea is emerging underneath the hype—one that looks less like prompt engineering and more like infrastructure. ...

January 24, 2026 · 4 min · Zelina