LLM Agents

Agents of Disruption: How LLMs Became Adversarial Testers for Autonomous Driving

The promise of fully autonomous vehicles hinges on their ability to handle not just the average drive—but the unexpected. Yet, creating rare, safety-critical scenarios for testing autonomous driving (AD) systems has long been a bottleneck. Manual scene creation doesn’t scale. Generative models often drift away from real-world distributions. And collecting edge cases on the road? Too dangerous, too slow. Enter AGENTS-LLM, a deceptively simple yet powerful framework that uses Large Language Models (LLMs) not to solve traffic scenes, but to break them. The twist? These aren’t just static prompts or synthetic scripts. AGENTS-LLM organizes LLMs into a multi-agent, modular system that modifies real traffic scenarios with surgical precision—making them trickier, nastier, and far more useful for evaluating planning systems. ...

Personas with Purpose: How TinyTroupe Reimagines Multiagent Simulation

If you’ve ever tried to simulate user behavior using LLMs, you’ve probably noticed the same frustrating pattern: the agents are too polite, too helpful, and too similar. They lack the kind of quirks, inconsistencies, and contextually grounded views that make real people interesting—and unpredictable. Enter TinyTroupe, Microsoft’s new open-source toolkit that flips the script on LLM-agent design. Instead of building yet another task-oriented assistant or collaborative workflow bot, TinyTroupe takes the form of a behavioral simulation laboratory. It invites us to think of agents not as obedient coworkers, but as idiosyncratic personas—each with their own backstories, beliefs, and sometimes maddening biases. ...

The First Hurdle: Why Coding Agents Struggle with Setup

In the race to build autonomous software engineers, large language model (LLM) agents like Devin and Copilot Chat are lauded for fixing bugs, writing code, and even completing tasks from GitHub issues. But what happens when the code doesn’t even run? That’s the uncomfortable gap SetupBench aims to measure—and the results are sobering. SetupBench introduces a 93-task benchmark evaluating a foundational but under-tested skill: bootstrapping a development environment from scratch. Unlike prior benchmarks that hand agents a fully pre-configured Docker container, SetupBench drops them into a barebones Linux sandbox and challenges them to install dependencies, initialize databases, configure background services, and resolve real-world version conflicts. It sounds simple. It isn’t. ...

Threading the Needle: How GRAFT Reinvents Document Translation with DAGs and LLM Agents

Document-level machine translation (DocMT) has long been riddled with a paradox: while LLMs can translate fluent paragraphs and even simulate discourse, they often falter at stitching meaning across paragraphs. Pronouns go adrift, tenses waver, and terminology mutates like a broken telephone game. The new paper GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation proposes an ambitious fix: treat a document not as a sequence, but as a graph — and deploy a team of LLM agents to navigate it. ...

Secret Handshakes at Scale: How LLM Agents Learn to Collude

As large language models (LLMs) evolve from passive tools into autonomous market participants, a critical question emerges: can they secretly coordinate in ways that harm fair competition? A recent paper titled Evaluating LLM Agent Collusion in Double Auctions explores this unsettling frontier, and its findings deserve attention from both AI developers and policy makers. The study simulates a continuous double auction (CDA), where multiple buyer and seller agents submit bids and asks in real-time. Each agent is an LLM-powered negotiator, operating on behalf of a hypothetical industrial firm. Sellers value each item at $80, buyers at $100, and trades execute when bids meet asks. The fair equilibrium price should hover around $90. ...

From ETL to Orchestral Intelligence: The Rise of the Data Agent

Enterprise data workflows have long been a patchwork of scripts, schedulers, human-in-the-loop dashboards, and brittle integrations. Enter the “Data Agent”: an AI-native abstraction designed not just to automate, but to reason over, adapt to, and orchestrate complex Data+AI ecosystems. In their paper, “Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems”, Zhaoyan Sun et al. from Tsinghua University propose a new agentic blueprint for data orchestration—one that moves far beyond traditional ETL. ...

Chains of Causality, Not Just Thought

Large language models (LLMs) have graduated from being glorified autocomplete engines to becoming fully-fledged agents. They write code, control mobile devices, execute multi-step plans. But with this newfound autonomy comes a fundamental problem: they act—and actions have consequences. Recent research from KAIST introduces Causal Influence Prompting (CIP), a method that doesn’t just nudge LLMs toward safety through general heuristics or fuzzy ethical reminders. Instead, it formalizes decision-making by embedding causal influence diagrams (CIDs) into the prompt pipeline. The result? A structured, explainable safety layer that turns abstract AI alignment talk into something operational. ...

Chatbot at the Table: Rethinking Group Recommendations with GenAI

For over two decades, group recommender systems (GRS) have been a curiosity in academic circles, promising collective decisions through algorithmic aggregation. Yet despite dozens of papers and prototype systems, they’ve failed to find traction in the real world. Netflix doesn’t use them. Spotify doesn’t bother. Most of us still hash out group decisions in a group chat—awkwardly, inefficiently, and without algorithmic help. The authors of a recent perspective paper argue it’s time for a fundamental reorientation: stop building tools that compute what the group should want, and start designing agents that help the group decide. With the rise of generative AI and agentic LLMs, the timing couldn’t be better. ...

Agents Under Siege: How LLM Workflows Invite a New Breed of Cyber Threats

Agents Under Siege: How LLM Workflows Invite a New Breed of Cyber Threats From humble prompt-followers to autonomous agents capable of multi-step tool use, LLM-powered systems have evolved rapidly in just two years. But with this newfound capability comes a vulnerability surface unlike anything we’ve seen before. The recent survey paper From Prompt Injections to Protocol Exploits presents the first end-to-end threat model of these systems, and it reads like a cybersecurity nightmare. ...

Catalysts of Thought: How LLM Agents are Reinventing Chemical Process Optimization

In the world of chemical engineering, optimization is both a science and an art. But when operating conditions are ambiguous or constraints are missing, even the most robust solvers stumble. Enter the next-gen solution: a team of LLM agents that not only understand the problem but define it. When Optimization Meets Ambiguity Traditional solvers like IPOPT or grid search work well—if you already know the boundaries. In real-world industrial setups, however, engineers often have to guess the feasible ranges based on heuristics and fragmented documentation. This paper from Carnegie Mellon University breaks the mold by deploying AutoGen-based multi-agent LLMs that generate constraints, propose solutions, validate them, and run simulations—all with minimal human input. ...