FIRE-BENCH: Playing Back the Tape of Scientific Discovery
Why frontier research agents can write code, run experiments, and still fail at the part of science that actually matters: designing the right evidence and drawing the right conclusion.
Why frontier research agents can write code, run experiments, and still fail at the part of science that actually matters: designing the right evidence and drawing the right conclusion.
A mechanism-first reading of how a reward-free AI agent can develop a slow, history-shaped internal stance—and why the business value is observability, not consciousness theater.
A new BAPO-CoT paper shows why some reasoning tasks cannot be compressed below linear token growth, and why enterprise AI systems need routing, tools, and architecture—not just shorter prompts.
A new benchmark-alignment paper shows how public LLM leaderboards can be reweighted toward downstream preferences—and why that is useful only when the benchmark already contains the right signal.
A paper on inference-time instability shows how token probability logs can reveal when an LLM’s reasoning trajectory is beginning to unravel.
AOrchestra shows that the practical edge in multi-agent systems may come less from adding more agents and more from dynamically composing the right instruction, context, tools, and model for each subtask.
A mechanism-first reading of Conformal Thinking, showing how risk-controlled early stopping turns reasoning budgets from guesswork into an operational error-budget decision.
A mechanism-first reading of why multi-agent LLM systems saturate when agents repeat each other, and why useful diversity beats raw agent count.
Search-R2 shows why reliable retrieval agents need local error repair, not just more search calls or larger rollout budgets.
TodyComm shows why multi-agent AI systems need learned communication governance, not just more agents talking more often.