Cover image

FIRE-BENCH: Playing Back the Tape of Scientific Discovery

Why frontier research agents can write code, run experiments, and still fail at the part of science that actually matters: designing the right evidence and drawing the right conclusion.

February 5, 2026 · 14 min · Zelina
Cover image

Perspective Without Rewards: When AI Develops a Point of View

A mechanism-first reading of how a reward-free AI agent can develop a slow, history-shaped internal stance—and why the business value is observability, not consciousness theater.

February 5, 2026 · 14 min · Zelina
Cover image

Thinking Isn’t Free: Why Chain-of-Thought Hits a Hard Wall

A new BAPO-CoT paper shows why some reasoning tasks cannot be compressed below linear token growth, and why enterprise AI systems need routing, tools, and architecture—not just shorter prompts.

February 5, 2026 · 15 min · Zelina
Cover image

When Benchmarks Lie: Teaching Leaderboards to Care About Preferences

A new benchmark-alignment paper shows how public LLM leaderboards can be reweighted toward downstream preferences—and why that is useful only when the benchmark already contains the right signal.

February 5, 2026 · 16 min · Zelina
Cover image

When LLMs Lose the Plot: Diagnosing Reasoning Instability at Inference Time

A paper on inference-time instability shows how token probability logs can reveal when an LLM’s reasoning trajectory is beginning to unravel.

February 5, 2026 · 12 min · Zelina
Cover image

Conducting the Agents: Why AORCHESTRA Treats Sub-Agents as Recipes, Not Roles

AOrchestra shows that the practical edge in multi-agent systems may come less from adding more agents and more from dynamically composing the right instruction, context, tools, and model for each subtask.

February 4, 2026 · 14 min · Zelina
Cover image

Conformal Thinking: Teaching LLMs When to Stop Thinking

A mechanism-first reading of Conformal Thinking, showing how risk-controlled early stopping turns reasoning budgets from guesswork into an operational error-budget decision.

February 4, 2026 · 17 min · Zelina
Cover image

More Isn’t Smarter: Why Agent Diversity Beats Agent Count

A mechanism-first reading of why multi-agent LLM systems saturate when agents repeat each other, and why useful diversity beats raw agent count.

February 4, 2026 · 16 min · Zelina
Cover image

Search-R2: When Retrieval Learns to Admit It Was Wrong

Search-R2 shows why reliable retrieval agents need local error repair, not just more search calls or larger rollout budgets.

February 4, 2026 · 16 min · Zelina
Cover image

When Agents Stop Talking to the Wrong People

TodyComm shows why multi-agent AI systems need learned communication governance, not just more agents talking more often.

February 4, 2026 · 15 min · Zelina