Autonomous Agents

When LLMs Lose the Plot: Diagnosing Reasoning Instability at Inference Time

Opening — Why this matters now If you work with large language models long enough, you start noticing a familiar failure mode. The model doesn’t just answer incorrectly—it loses the thread. Halfway through a chain-of-thought, something snaps. The reasoning drifts, doubles back, contradicts itself, and eventually lands somewhere implausible. Traditional evaluation misses this. Accuracy checks only look at the final answer, long after the damage is done. Confidence scores are static and blunt. Multi-sample techniques are expensive and retrospective. What’s missing is a process-level diagnostic—a way to tell, during inference, whether reasoning is stabilizing or quietly unraveling. ...

Conducting the Agents: Why AORCHESTRA Treats Sub-Agents as Recipes, Not Roles

Opening — Why this matters now Agentic systems are quietly hitting a ceiling. As tasks stretch across longer horizons—debugging real codebases, navigating terminals, or stitching together multi-hop web reasoning—the dominant design patterns start to fray. Fixed workflows ossify. Multi-agent chats drown in coordination overhead. Context windows bloat, then rot. AORCHESTRA enters this moment with a subtle but decisive shift: stop treating sub-agents as identities, and start treating them as configurations. ...

Conformal Thinking: Teaching LLMs When to Stop Thinking

Opening — Why this matters now Reasoning models have learned how to think longer. Unfortunately, they have not learned when to stop. Test-time scaling has become the industry’s favorite blunt instrument: allocate more tokens, get better answers—on average. But averages are a luxury in deployment. In production systems, every additional token is a cost, and every premature stop is a risk. The uncomfortable truth is that “adaptive reasoning” merely replaces one opaque knob (token limits) with another (confidence thresholds), without offering a principled way to tune either. ...

More Isn’t Smarter: Why Agent Diversity Beats Agent Count

Opening — Why this matters now Multi-agent LLM systems have quietly become the industry’s favorite way to brute-force intelligence. When one model struggles, the instinct is simple: add more agents. Vote harder. Debate longer. Spend more tokens. And yet, performance curves keep telling the same unflattering story: early gains, fast saturation, wasted compute. This paper asks the uncomfortable question most agent frameworks politely ignore: why does scaling stall so quickly—and what actually moves the needle once it does? The answer, it turns out, has less to do with how many agents you run, and more to do with how different they truly are. ...

When Agents Stop Talking to the Wrong People

Opening — Why this matters now Multi-agent LLM systems are no longer a novelty. They debate, plan, critique, simulate markets, and increasingly make decisions that look uncomfortably close to judgment. Yet as these systems scale, something quietly fragile sits underneath them: who talks to whom, and when. Most multi-agent frameworks still assume that communication is cheap, static, and benign. In practice, it is none of those. Agents drift, hallucinate, fatigue, or—worse—become adversarial while sounding perfectly reasonable. When that happens, fixed communication graphs turn from coordination tools into liability multipliers. ...

When Your Agent Starts Copying Itself: Breaking Conversational Inertia

Opening — Why this matters now Multi-turn agents are supposed to get better with experience. More context, more feedback, more opportunities to adapt. Yet in practice, the opposite often happens. Agents loop. They fixate. They repeat themselves with growing confidence and shrinking effectiveness. This paper puts a name—and a mechanism—on that failure mode: conversational inertia. And more importantly, it shows that the problem is not a lack of information, but too much of the wrong kind. ...

Click Like a Human: Why Avenir-Web Is a Quiet Breakthrough in Web Agents

Opening — Why this matters now For years, autonomous web agents have promised to automate the internet: booking flights, scraping dashboards, configuring enterprise tools, or simply clicking buttons so humans don’t have to. And yet, anyone who has actually tried to deploy one knows the truth—these agents fail in embarrassingly human ways. They get lost. They click the wrong thing. They forget what they were doing halfway through. ...

Click with Confidence: Teaching GUI Agents When Not to Click

Opening — Why this matters now Autonomous GUI agents are finally leaving demos and entering production. They book meetings, fill forms, manage dashboards—and occasionally approve payments they should not. The uncomfortable truth is that one mis-click can be irreversible. Yet most GUI grounding models behave with absolute confidence, even when they are guessing. The paper “SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration” tackles this exact failure mode. Its core argument is simple but sharp: progress in GUI agents is no longer bottlenecked by accuracy alone, but by the absence of calibrated doubt. ...

DRIFT-BENCH: When Agents Stop Asking and Start Breaking

Opening — Why this matters now LLM agents are no longer just answering questions. They are executing SQL, calling APIs, modifying system state, and quietly making decisions that stick. Yet most evaluations still assume a fantasy user: precise, unambiguous, and cooperative. In real deployments, users are vague, wrong, impatient, or simply human. This gap is no longer academic. As agents enter finance, operations, and infrastructure, the cost of misunderstanding now rivals the cost of misreasoning. DRIFT‑BENCH arrives precisely at this fault line. ...

Identity Crisis: How a Trivial Trick Teaches LLMs to Think Backwards

Opening — Why this matters now Large language models can write poetry, solve Olympiad-level math problems, and simulate entire businesses—yet they reliably fail at a task that feels almost insulting in its simplicity: if Alice’s husband is Bob, they struggle to answer who is Bob’s wife? This failure mode, known as the reversal curse, has become something of an embarrassment for autoregressive models. More troublingly, a growing body of literature has argued that the curse is fundamental: a baked-in limitation of left-to-right next-token prediction. If true, this would place a hard ceiling on what today’s LLM architectures can ever reliably reason about. ...