Agentic AI

Cities That Think: Reasoning AI for the Urban Century

Opening — Why this matters now By 2050, nearly seven out of ten people will live in cities. Yet most urban planning tools today still operate as statistical mirrors—learning from yesterday’s data to predict tomorrow’s congestion. Predictive models can forecast traffic or emissions, but they don’t reason about why or whether those outcomes should occur. The next leap, as argued by Sijie Yang and colleagues in Reasoning Is All You Need for Urban Planning AI, is not more prediction—but more thinking. ...

Who Really Runs the Workflow? Ranking Agent Influence in Multi-Agent AI Systems

Opening — Why this matters now Multi-agent systems — the so-called Agentic AI Workflows — are rapidly becoming the skeleton of enterprise-grade automation. They promise autonomy, composability, and scalability. But beneath this elegant choreography lies a governance nightmare: we often have no idea which agent is actually in charge. Imagine a digital factory of LLMs: one drafts code, another critiques it, a third summarizes results, and a fourth audits everything. When something goes wrong — toxic content, hallucinated outputs, or runaway costs — who do you blame? More importantly, which agent do you fix? ...

The Missing Metric: Measuring Agentic Potential Before It’s Too Late

The Missing Metric: Measuring Agentic Potential Before It’s Too Late In the modern AI landscape, models are not just talkers—they are becoming doers. They code, browse, research, and act within complex environments. Yet, while we’ve become adept at measuring what models know, we still lack a clear way to measure what they can become. APTBench, proposed by Tencent Youtu Lab and Shanghai Jiao Tong University, fills that gap: it’s the first benchmark designed to quantify a model’s agentic potential during pre-training—before costly fine-tuning or instruction stages even begin. ...

When Agents Learn to Test Themselves: TDFlow and the Future of Software Engineering

From Coding to Testing: The Shift in Focus TDFlow, developed by researchers at Carnegie Mellon, UC San Diego, and Johns Hopkins, presents a provocative twist on how we think about AI-driven software engineering. Instead of treating the large language model (LLM) as a creative coder, TDFlow frames the entire process as a test-resolution problem—where the agent’s goal is not to write elegant code, but simply to make the tests pass. ...

Agents, Automata, and the Memory of Thought

If you strip away the rhetoric about “thinking” machines and “cognitive” agents, most of today’s agentic AIs still boil down to something familiar from the 1950s: automata. That’s the thesis of Are Agents Just Automata? by Koohestani et al. (2025), a paper that reinterprets modern agentic AI through the lens of the Chomsky hierarchy—the foundational classification of computational systems by their memory architectures. It’s an argument that connects LLM-based agents not to psychology, but to formal language theory. And it’s surprisingly clarifying. ...

Agents in a Sandbox: Securing the Next Layer of AI Autonomy

The rise of AI agents—large language models (LLMs) equipped with tool use, file access, and code execution—has been breathtaking. But with that power has come a blind spot: security. If a model can read your local files, fetch data online, and run code, what prevents it from being hijacked? Until now, not much. A new paper, Securing AI Agent Execution (Bühler et al., 2025), introduces AgentBound, a framework designed to give AI agents what every other computing platform already has—permissions, isolation, and accountability. Think of it as the Android permission model for the Model Context Protocol (MCP), the standard interface that allows agents to interact with external servers, APIs, and data. ...

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science The latest release from the Allen Institute for AI, AstaBench, represents a turning point for how the AI research community evaluates large language model (LLM) agents. For years, benchmarks like MMLU or ARC have tested narrow reasoning and recall. But AstaBench brings something new—it treats the agent not as a static model, but as a scientific collaborator with memory, cost, and strategy. ...

Blueprints of Agency: Compositional Machines and the New Architecture of Intelligence

When the term agentic AI is used today, it often conjures images of individual, autonomous systems making plans, taking actions, and learning from feedback loops. But what if intelligence, like biology, doesn’t scale by perfecting one organism — but by building composable ecosystems of specialized agents that interact, synchronize, and co‑evolve? That’s the thesis behind Agentic Design of Compositional Machines — a sprawling, 75‑page manifesto that reframes AI architecture as a modular society of minds, not a monolithic brain. Drawing inspiration from software engineering, systems biology, and embodied cognition, the paper argues that the next generation of LLM‑based agents will need to evolve toward compositionality — where reasoning, perception, and action emerge not from larger models, but from better‑coordinated parts. ...

When the Lab Thinks Back: How LabOS Turns AI Into a True Co-Scientist

When we talk about AI in science, most imaginations stop at the screen — algorithms simulating molecules, predicting reactions, or summarizing literature. But in LabOS, AI finally steps off the screen and into the lab. It doesn’t just compute hypotheses; it helps perform them. The Missing Half of Scientific Intelligence For decades, computation and experimentation have formed two halves of discovery — theory and touch, model and pipette. AI has supercharged the former, giving us AlphaFold and generative chemistry, but the physical laboratory has remained stubbornly analog. Robotic automation can execute predefined tasks, yet it lacks situational awareness — it can’t see contamination, notice a wrong reagent, or adapt when a human makes an unscripted move. ...

Backtrack to Breakthrough: Why Great AI Agents Revisit

TL;DR Agentic performance isn’t just about doing more; it’s about going back. In GSM-Agent—a controllable, tool-using version of GSM8K—top models only reach ~65–68% accuracy, and the strongest predictor of success is a high revisit ratio: deliberately returning to a previously explored topic with a refined query. That’s actionable for enterprise AI: design agents that can (1) recognize incomplete evidence, (2) reopen earlier lines of inquiry, and (3) instrument and reward revisits. ...