Llm-Agents

Quants With a Plan: Agentic Workflows That Outtrade AutoML

TL;DR for operators A quant team does not need a chatbot that “has ideas” about markets. It needs a workflow that can select a sensible model, change one thing at a time, run the experiment, keep the better version, reject the worse one, and leave a paper trail that a human can inspect without requiring divination. ...

Crystal Ball, Meet Cron Job: What FutureX Reveals About ‘Live’ Forecasting Agents

TL;DR for operators FutureX is less interesting as a leaderboard and more interesting as an operating model for evaluating AI agents that claim to forecast the future. The benchmark runs a live loop: collect future-facing questions from curated web sources, ask agents to predict before the answer exists, wait for resolution, crawl the answer, and score the prior prediction. That matters because most “forecasting” evaluations are either historical backtests with leakage risk or static datasets quietly ageing into trivia. ...

Consent, Coaxing, and Countermoves: Simulating Privacy Attacks on LLM Agents

TL;DR for operators Email is still where good security intentions go to become embarrassing screenshots. The paper behind this article, Searching for Privacy Risks in LLM Agents via Simulation, studies a future that is no longer especially futuristic: one AI agent has access to sensitive information, another agent wants it, and the two can talk through ordinary applications such as email, Messenger, Facebook, or Notion.1 The question is not whether the model knows a privacy rule in the abstract. The question is whether an agent, while trying to be helpful in a live interaction, can refuse the wrong request at the right moment. ...

Skip or Split? How LLMs Can Make Old-School Planners Run Circles Around Complexity

TL;DR for operators When an AI system has to execute a multi-step operational plan, the tempting move is to ask the LLM for the plan. This paper argues for a less glamorous and more useful pattern: let the LLM help shrink the search problem, then let a classical planner verify and compose the actual action sequence.1 ...

Therapy, Explained: How Multi‑Agent LLMs Turn DSM‑5 Screens into Auditable Logic

TL;DR for operators DSM5AgentFlow is not a paper about an AI therapist replacing a clinician. That would be the loud interpretation, and therefore the least useful one. The paper introduces a three-agent workflow that turns DSM-5 Level-1 screening into a structured conversation, then converts the transcript into a provisional diagnosis with evidence-linked reasoning.1 ...

Confounder Hunters: How LLM Agents are Rewriting the Rules of Causal Inference

TL;DR for operators Clinical analytics teams already know the unpleasant truth: observational data is cheap, rich, and biased in ways that do not politely announce themselves. The paper behind this article proposes a way to make that bias-hunting process less artisanal. Instead of asking experts to manually inspect every causal-tree rule, the framework lets causal trees segment patients, asks medical LLM agents to suggest plausible confounders using decomposed prompting plus retrieval, sends those suggestions through expert validation, then recursively focuses on samples whose treatment-effect estimates still have wide confidence intervals.1 ...

Meta-Game Theory: What a Pokémon League Taught Us About LLM Strategy

TL;DR for operators A Pokémon tournament sounds unserious until you notice what it does better than many enterprise AI pilots: it forces models to make constrained, sequential, adversarial decisions, then records not only what they did but why they said they did it. The paper behind this article introduces LLM Pokémon League, a benchmark where eight models from the GPT, Claude, and Gemini families act as Pokémon trainers. Each model selects a six-member team, then makes turn-by-turn battle decisions in a zero-shot setting. The framework captures team-building rationales, move choices, switching decisions, and explanations throughout the tournament.1 ...

Forecast First, Ask Later: How DCATS Makes Time Series Smarter with LLMs

TL;DR for operators Forecasting teams usually ask the same question first: which model should we use? DCATS suggests a more operationally useful question: which related histories should this model learn from? The paper introduces DCATS, a Data-Centric Agent for Time Series, an LLM-agent framework that improves forecasting by selecting auxiliary time series for fine-tuning rather than by designing a new forecasting architecture.1 In the authors’ traffic forecasting study, GPT-4 Turbo reads metadata about nearby or similar California traffic sensors, proposes candidate neighbour sets, lets lightweight forecasting models test those proposals, and then refines the next round using validation error. ...

The Forest Within: How Galaxy Reinvents LLM Agents with Self-Evolving Cognition

TL;DR for operators Galaxy is best read as a design argument, not merely a new agent benchmark entry. The paper says personal agents cannot become genuinely useful by stacking tools under a chat window. They need a structured internal map of the user, their own capabilities, available environments, and the system logic behind those capabilities.1 ...

Forkcast: How Pro2Guard Predicts and Prevents LLM Agent Failures

TL;DR for operators ProbGuard1 is a runtime safety monitor that tries to answer a more useful question than “Has the agent broken a rule?” It asks: “Given where the agent is now, how likely is it to end up breaking a rule soon?” That shift matters. Many agent failures are not single bad actions. They are bad trajectories: the robot chooses the wrong object, the car carries too much speed into a risky scene, the workflow skips a confirmation step three moves before data is exposed. A conventional rule-based guardrail often detects the problem when the violation is already visible. ProbGuard tries to detect the probability mass moving toward the violation earlier. ...