AI Agents

When Agents Talk Back: Why AI Collectives Need a Social Theory

Teams are easy to draw and hard to govern. Put five AI agents in a workflow diagram and everything looks reassuringly corporate: one planner, one researcher, one coder, one critic, one manager. Give them arrows. Add a dashboard. Call it orchestration. Investors relax. Engineers nod. Consultants quietly increase the font size on the word “autonomous.” ...

When Goals Collide: Synthesizing the Best Possible Outcome

A robot does not always get the luxury of a clean task list. Reach the loading bay. Avoid blocked corridors. Preserve battery. Pick up two packages. Respect a safety boundary. Finish before the door closes. Then the environment, as environments enjoy doing, changes the rules halfway through. A corridor shuts. A resource disappears. One goal now interferes with another. ...

EvoFSM: Teaching AI Agents to Evolve Without Losing Their Minds

Workflow is the unglamorous part of agentic AI. Which is precisely why it matters. A research agent can have a strong language model, a decent search tool, and an impressive ability to produce paragraphs that sound like a McKinsey intern who drank too much espresso. Yet when the task becomes long, ambiguous, and evidence-heavy, the same agent often fails for a boring reason: it does the right actions in the wrong order, repeats the same weak search, summarizes too early, forgets to verify a source, or changes its own instructions so enthusiastically that it becomes a different employee halfway through the job. ...

When Agents Learn Without Learning: Test-Time Reinforcement Comes of Age

A team meeting usually ends with someone saying, “Let’s remember this for next time.” Human teams sometimes do. Agent teams usually do not. A group of LLM agents can debate, critique, revise, and produce a final answer. Then the whole episode often disappears into the landfill of inference logs: useful comments, bad guesses, decisive objections, elegant checks, all flattened into “the model answered correctly” or “the model failed.” Very modern. Very wasteful. ...

When Control Towers Learn to Think: Agentic AI Enters the Supply Chain

Control towers are good at showing managers what the company already knows. That is useful. It is also the problem. Most supply-chain control towers watch direct suppliers, shipments, inventory levels, and predefined thresholds. They are strongest when the relevant data has already been structured and admitted into the system. But many serious disruptions begin elsewhere: a Tier-3 materials supplier, a Tier-4 regional dependency, a geopolitical event buried in a news article, or a supplier relationship nobody remembered until the factory schedule started looking nervous. ...

When Interfaces Guess Back: Implicit Intent Is the New GUI Bottleneck

The problem starts with a very ordinary sentence “Order my usual lunch.” For a human assistant, this sentence is not empty. It carries history. It points to an app, a restaurant, a branch, a meal, maybe a delivery address, maybe a payment method. For a conventional GUI agent, it is a trap wearing casual clothes. ...

Click, Fail, Learn: Why BEPA Might Be the First GUI Agent That Actually Improves

Clicking is easy. Clicking correctly, after the screen has changed, after a pop-up appears, after the previous attempt failed, and after the agent has only fifteen steps before the evaluator gives up — that is where GUI automation stops looking like a demo and starts looking like work. This is the problem behind BEPA, short for Bi-Level Expert-to-Policy Assimilation, introduced in the arXiv paper From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation.1 The paper is about training end-to-end GUI agents, but its practical message is broader: expert workflows are not automatically useful training data. They have to be translated into something the learner can actually perform. ...

TowerMind: When Language Models Learn That Towers Have Consequences

Tower placement is a small decision until it is wrong. In a tower-defense game, a bad tower is not merely an inelegant plan. It is money spent, coverage lost, enemies leaked, and time wasted. The game does not care that the explanation sounded strategic. It only asks whether the tower actually touches the road. ...

When Debate Stops Being a Vote: DynaDebate and the Engineering of Reasoning Diversity

Meeting. Anyone who has sat through a corporate “alignment session” knows the ritual. Three people say nearly the same thing, one person says it more confidently, and the room calls it consensus. The decision looks collaborative. It is often just synchronized hesitation wearing a blazer. Multi-agent debate in AI can fail in a similar way. Add several LLM agents, ask them to debate, and the system may look more robust than a single model. But if all agents begin from nearly the same reasoning path, they may simply repeat the same mistake in different wording. The output becomes a vote over correlated errors. Democracy, but with clones. ...

NPCs With Short-Term Memory Loss: Benchmarking Agents That Actually Live in the World

Minecraft is not the point. That may sound rude to the blocks, but it is the cleanest way to read MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents.1 The paper does use Minecraft. It does study an AI companion agent inside a live game world. It does report that a GPT-4o-powered setup failed on 71 out of 216 attempted subtasks, or roughly one-third of the subtask denominator. ...