AI Agents

Agents Assemble: When Multi‑Agent LLMs Stop Hallucinating and Start Doing Science

A scientist does not usually fail because they cannot ask the right question. More often, they fail because the useful answer is buried behind five separate systems: a biomedical knowledge graph, a disease-module algorithm, a drug-prioritization method, a literature database, and a visualization tool that looks innocent until someone has to configure it. ...

Counterfactuals Unchained: How Causality Escapes Its Own Models

A loan is rejected. Now explain why. A borrower is rejected by an automated lending system. The compliance team asks a simple question: What caused the rejection? A naïve answer points to a variable: low income, high debt ratio, thin credit history, missing documentation, or some equally respectable-looking field in the model. A better answer asks what would have happened if that variable had changed. A still better answer asks which surrounding facts must be held fixed while we imagine that change. ...

Mind the Markov Gap: How a Lightweight Agent Outsmarts Heavy LLMs in Open-Vocabulary Vision

A camera on a factory line does not need to write an essay before deciding whether a part is cracked. That sounds obvious. Yet a surprising amount of recent AI architecture quietly assumes the opposite: when vision systems become uncertain, bring in a large language model, ask it to generate richer descriptions, then run the detector again. Sometimes this works. It also turns a detection problem into a small committee meeting, and committee meetings are rarely known for real-time throughput. ...

Storm-Chasing Agents: How EWE Turns Extreme Weather into Actionable Intelligence

Storms are easy to see after they arrive. The harder question is what actually made them happen. That distinction sounds academic until money enters the room. An insurer wants to know whether an event belongs to a changing regional risk pattern. A grid operator wants to understand whether a heatwave was driven by persistent blocking, moisture transport, or local feedback. A government agency wants a report fast enough to support preparedness, not just a polished explanation three months later. The weather event is visible. The mechanism is expensive. ...

Tile by Tile: Why LLMs Still Can't Plan Their Way Out of a 3×3 Box

A board game should not embarrass a frontier model. That is the uncomfortable charm of the 8-puzzle. It has no hidden information, no vague user intent, no messy database schema, no ambiguous policy exception, and no client saying “just make it pop.” It is a 3×3 grid with eight tiles and one blank space. Slide adjacent tiles into the blank. Reach the goal state. Done. ...

Prints Charming: How Reward Models Finally Got Serious About Long-Horizon Reasoning

Search looks simple until it becomes a workflow. A human analyst can open ten tabs, notice which source contradicts which, remember that one earlier search result changed the meaning of the question, and decide whether the next move should be another search, a calculation, or a final answer. An LLM agent can also open tabs, call tools, browse pages, run code, and produce a final answer. The difference is that the agent often does all of this with the discipline of a caffeinated intern who has been told that “more context” is the same thing as “better memory.” ...

Hierarchy, Not Hype: Why Domain Logic Beats Agent Chaos

Workflow is where agent demos go to die. A user asks for something that sounds simple: “Assess flood damage in this coastal district after the typhoon.” The agent smiles, metaphorically, and begins its little ritual. It searches, summarizes, calls a tool, thinks again, calls another tool, corrects itself, forgets one preprocessing step, invents a plausible shortcut, then produces a confident final answer that looks fine until someone who actually understands geospatial analysis asks an inconvenient question: where did the corrected satellite imagery come from? ...

Mind Over Matter: How a BDI Ontology Gives AI Agents an Actual Inner Life

Workflow agents are easy to admire until someone asks a rude but necessary question: why did the agent do that? Not “what prompt did we send?” Not “which tool did it call?” Not “can we replay the logs and hope the compliance team loses interest?” The real question is sharper: what did the agent believe, what did it want, what did it commit to doing, which plan did that commitment specify, and what evidence justified the transition from one step to the next? ...

Practice Makes Agents: How DPPO Turns Failure into Embodied Intelligence

Robots do not fail gracefully. They misread the scene, choose the wrong object, skip a physical constraint, hallucinate a plan, or produce a confident answer that would make a warehouse supervisor quietly unplug something expensive. The usual response is more data. More robot trajectories. More simulation. More web video. More carefully labelled examples. More of the industrial-scale data plumbing that makes everyone feel productive until the model still cannot decide whether a cup should be placed inside the tray or beside it. ...

Diversity Pays: Why AI Research Agents Need More Than One Good Idea

Budget has a way of making AI agents less magical. On a slide, an AI research agent looks like a neat loop: read the task, propose an idea, write code, run an experiment, improve, repeat. In production, it looks more like a slightly caffeinated junior researcher with terminal access: sometimes brilliant, sometimes stubborn, and occasionally determined to spend four hours failing at the same doomed approach because the first idea sounded respectable. ...