AI Agents

Forget Me Not: How IterResearch Rebuilt Long-Horizon Thinking for AI Agents

Opening — Why this matters now The AI world has become obsessed with “long-horizon” reasoning—the ability for agents to sustain coherent thought over hundreds or even thousands of interactions. Yet most large language model (LLM) agents, despite their size, collapse under their own memory. The context window fills, noise piles up, and coherence suffocates. Alibaba’s IterResearch tackles this problem not by extending memory—but by redesigning it. ...

Touch Intelligence: How DigiData Trains Agents to Think with Their Fingers

Opening — Why this matters now In 2025, AI agents are no longer confined to text boxes. They’re moving across screens—scrolling, tapping, and swiping their way through the digital world. Yet the dream of a truly general-purpose mobile control agent—an AI that can use your phone like you do—has remained out of reach. The problem isn’t just teaching machines to see buttons; it’s teaching them to understand intent. ...

Thinking Fast and Flowing Slow: Real-Time Reasoning for Autonomous Agents

Opening — Why this matters now AI agents are getting smarter—but not faster. Most large language model (LLM) systems still behave like cautious philosophers in a chess match: the world patiently waits while they deliberate. In the real world, however, traffic lights don’t freeze for an AI car mid-thought, and market prices don’t pause while a trading agent reasons about “the optimal hedge.” The new study Real-Time Reasoning Agents in Evolving Environments by Wen et al. (2025) calls this out as a fundamental flaw in current agent design—and offers a solution that blends human-like intuition with deliberative reasoning. ...

Agents on the Clock: How TPS-Bench Exposes the Time Management Problem in AI

Opening — Why this matters now AI agents can code, search, analyze data, and even plan holidays. But when the clock starts ticking, they often stumble. The latest benchmark from Shanghai Jiao Tong University — TPS-Bench (Tool Planning and Scheduling Benchmark) — measures whether large language model (LLM) agents can not only choose the right tools, but also use them efficiently in multi-step, real-world scenarios. The results? Let’s just say most of our AI “assistants” are better at thinking than managing their calendars. ...

When the Sandbox Thinks Back: Training AI Agents in Simulated Realities

Opening — Why this matters now The AI industry has a curious paradox: we can train models to reason at Olympiad level, but they still fumble at booking flights or handling a spreadsheet. The problem isn’t intelligence—it’s context. Agents are trained in narrow sandboxes that don’t scale, breaking the moment the environment changes. Microsoft and the University of Washington’s Simia framework tackles this bottleneck with a provocative idea: what if the agent could simulate its own world? ...

The Agent Olympics: How Toolathlon Tests the Limits of AI Workflows

Opening — Why this matters now The AI world is obsessed with benchmarks. From math reasoning to coding, each new test claims to measure progress. Yet, none truly capture what businesses need from an agent — a system that doesn’t just talk, but actually gets things done. Enter Toolathlon, the new “decathlon” for AI agents, designed to expose the difference between clever text generation and real operational competence. In a world where large language models (LLMs) are being marketed as digital employees, Toolathlon arrives as the first test that treats them like one. Can your AI check emails, update a Notion board, grade homework, and send follow-up messages — all without breaking the workflow? Spoiler: almost none can. ...

From Prototype to Profit: How IBM's CUGA Redefines Enterprise Agents

When AI agents first emerged as academic curiosities, they promised a future of autonomous systems capable of navigating apps, websites, and APIs as deftly as humans. Yet most of these experiments never left the lab. The jump from benchmark to boardroom—the point where AI must meet service-level agreements, governance rules, and cost-performance constraints—remained elusive. IBM’s recent paper, From Benchmarks to Business Impact, finally brings data to that missing bridge. The Benchmark Trap Generalist agents such as AutoGen, LangGraph, and Operator have dazzled the research community with their ability to orchestrate tasks across multiple tools. But academic triumphs often hide operational fragility. Benchmarks like AppWorld or WebArena measure intelligence; enterprises measure ROI. They need systems that are reproducible, auditable, and policy-compliant—not just clever. ...

The Esperanto of AI Agents: How the Agent Data Protocol Unifies a Fragmented Ecosystem

The Problem of Fragmented Agent Intelligence Building large language model (LLM) agents has long been haunted by a quiet paradox. Despite a growing number of agent datasets—from web navigation to software engineering—researchers rarely fine-tune their models across these diverse sources. The reason is not a shortage of data, but a lack of coherence: every dataset speaks its own dialect. One uses HTML trees; another records API calls; a third logs terminal sessions. Converting them all for fine-tuning an agent is a nightmare of custom scripts, mismatched schemas, and endless validation. ...

Fast but Flawed: What Happens When AI Agents Try to Work Like Humans

AI’s impact on the workforce is no longer a speculative question—it’s unfolding in real time. But how do AI agents actually perform human work? A new study from Carnegie Mellon and Stanford, “How Do AI Agents Do Human Work?”, offers the first large-scale comparison of how humans and AI complete the same tasks across five essential skill domains: data analysis, engineering, computation, writing, and design. The findings are both promising and unsettling, painting a nuanced picture of a workforce in transition. ...

Promptfolios: When Buffett Becomes a System Prompt

TL;DR A fresh study builds five prompt‑guided LLM agents—each emulating a legendary investor (Buffett, Graham, Greenblatt, Piotroski, Altman)—and backtests them on NASDAQ‑100 stocks from Q4 2023 to Q2 2025. Each agent follows a deterministic pipeline: collect metrics → score → construct a weighted portfolio. The Buffett agent tops the pack with ~42% CAGR, beating the NASDAQ‑100 and S&P 500 benchmarks in the window tested. The result isn’t “LLMs discovered alpha,” but rather: prompts can reliably translate qualitative philosophies into reproducible, quantitative rules. The real opportunity for practitioners is governed agent design—measurable, auditable prompts tied to tools—plus robust validation far beyond a single bullish regime. ...