LLM Agents

Crystal Ball, Meet Cron Job: What FutureX Reveals About ‘Live’ Forecasting Agents

The one-sentence take A new live benchmark, FutureX, swaps lab-style trivia for rolling, real-world future events, forcing agentic LLMs to search, reason, and hedge under uncertainty that actually moves—and the results expose where today’s “agents” are still brittle. Why FutureX matters now Enterprise teams are deploying agents to answer questions whose truth changes by the hour—markets, elections, sports, product launches. Static leaderboards don’t measure that. FutureX runs as a cron job on reality: it collects new events every day, has agents make predictions, and grades them after events resolve. That turns evaluation from a screenshot into a time series and makes overfitting to benchmark quirks a lot harder. ...

Bias in the Warehouse: What AIM-Bench Reveals About Agentic LLMs

Agentic LLMs are graduating from chat to control rooms—taking actions, maintaining memory, and optimizing business processes. Inventory is a natural proving ground: a clean cocktail of uncertainty, economics, and coordination. AIM-Bench arrives precisely here, testing LLM agents across newsvendor, multi-period replenishment, the Beer Game, two-level warehouses, and a small supply network—each with explicit uncertainty sources (stochastic demand, variable lead times, and partner behavior). ...

Consent, Coaxing, and Countermoves: Simulating Privacy Attacks on LLM Agents

When organizations deploy LLM-based agents to email, message, and collaborate on our behalf, privacy threats stop being static. The attacker is now another agent able to converse, probe, and adapt. Today’s paper proposes a simulation-plus-search framework that discovers these evolving risks—and the countermeasures that survive them. The result is a rare, actionable playbook: how attacks escalate in multi-turn dialogues, and how defenses must graduate from rules to identity-verified state machines. ...

Three’s Company: When LLMs Argue Their Way to Alpha

TL;DR A role‑based, debate‑driven LLM system—AlphaAgents—coordinates three specialist agents (fundamental, sentiment, valuation) to screen equities, reach consensus, and build a simple, equal‑weight portfolio. In a four‑month backtest starting 2024‑02‑01 on 15 tech names, the risk‑neutral multi‑agent portfolio outperformed the benchmark and single‑agent baselines; risk‑averse variants underperformed in a bull run (as expected). The real innovation isn’t the short backtest—it’s the explainable process: constrained tools per role, structured debate, and explicit risk‑tolerance prompts. ...

Confounder Hunters: How LLM Agents are Rewriting the Rules of Causal Inference

When Hidden Variables Become Hidden Costs In causal inference, confounders are the uninvited guests at your data party — variables that influence both treatment and outcome, quietly skewing results. In healthcare, failing to adjust for them can turn life-saving insights into misleading noise. Traditionally, finding these culprits has been the realm of domain experts, a slow and costly process that doesn’t scale well. The paper from National Sun Yat-Sen University proposes a radical alternative: put Large Language Model (LLM)-based agents into the causal inference loop. These agents don’t just crunch numbers — they reason, retrieve domain knowledge, and iteratively refine estimates, effectively acting as tireless, always-available junior experts. ...

Meta-Game Theory: What a Pokémon League Taught Us About LLM Strategy

When language models battle, their strategies talk back. In a controlled Pokémon tournament, eight LLMs drafted teams, chose moves, and logged natural‑language rationales every turn. Beyond win–loss records, those explanations exposed how models reason about uncertainty, risk, and resource management—exactly the traits we want in enterprise decision agents. Why Pokémon is a serious benchmark (yes, really) Pokémon delivers the trifecta we rarely get in classic AI games: Structured complexity: 18 interacting types, clear multipliers, and crisp rules. Uncertainty that matters: imperfect information, status effects, and accuracy trade‑offs. Resource management: limited switches, finite HP, role specialization. Crucially, the action space is compact enough for language-first agents to reason step‑by‑step without search trees—so we can see the strategy, not just the score. ...

Forecast First, Ask Later: How DCATS Makes Time Series Smarter with LLMs

When it comes to forecasting traffic patterns, weather, or financial activity, the prevailing wisdom in machine learning has long been: better models mean better predictions. But a new approach flips this assumption on its head. Instead of chasing ever-more complex architectures, the DCATS framework (Data-Centric Agent for Time Series), developed by researchers at Visa, suggests we should first get our data in order—and let a language model do it. The Agentic Turn in AutoML DCATS builds on the trend of integrating Large Language Model (LLM) agents into AutoML pipelines, but with a twist. While prior systems like AIDE focus on automating model design and hyperparameter tuning, DCATS delegates a more fundamental task to its LLM agent: curating the right data. ...

The Forest Within: How Galaxy Reinvents LLM Agents with Self-Evolving Cognition

In a field where many agents act like well-trained dogs, obediently waiting for commands, Galaxy offers something more radical: a system that watches, thinks, adapts, and evolves—without needing to be told. It’s not just an intelligent personal assistant (IPA); it’s an architecture that redefines what intelligence means for LLM-based agents. Let’s dive into why Galaxy is a leap beyond chatty interfaces and into cognition-driven autonomy. 🌳 Beyond Pipelines: The Cognition Forest At the heart of Galaxy lies the Cognition Forest, a structured semantic space that fuses cognitive modeling and system design. Each subtree represents a facet of agent understanding: ...

Forkcast: How Pro2Guard Predicts and Prevents LLM Agent Failures

If your AI agent is putting a metal fork in the microwave, would you rather stop it after the sparks fly—or before? That’s the question Pro2Guard was designed to answer. In a world where Large Language Model (LLM) agents are increasingly deployed in safety-critical domains—from household robots to autonomous vehicles—most existing safety frameworks still behave like overly cautious chaperones: reacting only when danger is about to occur, or worse, when it already has. This reactive posture, embodied in rule-based systems like AgentSpec, is too little, too late in many real-world scenarios. ...

From Autocomplete to Autonomy: How LLM Code Agents are Rewriting the SDLC

The idea of software that writes software has long hovered at the edge of science fiction. But with the rise of LLM-based code agents, it’s no longer fiction, and it’s certainly not just autocomplete. A recent survey by Dong et al. provides the most thorough map yet of this new terrain, tracing how code generation agents are shifting from narrow helpers to autonomous systems capable of driving the entire software development lifecycle (SDLC). ...