LLM Agents

Search Party in a Notebook: JUPITER Turns Data Analysis into a Tree Game

TL;DR Why this paper matters: It shows that how you search matters more than how big your model is for multi‑step, tool‑using analytics. With a notebook‑grounded dataset (NbQA) and value‑guided search, mid‑size open models rival GPT‑4o–based agents on a leading data‑analysis benchmark. What’s new: (1) NbQA, a large corpus of real Jupyter tasks with executable multi‑step solutions; (2) JUPITER, a planner that treats analysis as a tree search over “thought → code → output” steps, guided by a learned value model. Why you should care (operator’s view): This blueprint turns flaky “Code Interpreter”-style sessions into repeatable playbooks—fewer dead ends, more auditable steps, and better generalization without paying for the biggest model. The core idea: analytics as a search tree Most LLM data‑analysis failures come from branching mistakes: choosing the wrong intermediate step, compounding errors, and wasting tool calls. JUPITER reframes the whole exercise as search over notebook states. Each node is a concrete state—accumulated thoughts, code, and execution outputs. The system expands only a few promising branches and prunes the rest using a value model trained on successful and failed trajectories. ...

Small Gains, Long Games: Why Tiny Accuracy Bumps Explode into Big Execution Wins

The quick take Most debates about “diminishing returns” fixate on single‑step metrics. This paper flips the lens: if your product’s value depends on how long a model can execute without slipping, then even small per‑step gains can produce super‑linear increases in the task length a model can finish. The authors isolate execution (not planning, not knowledge) and uncover a failure mode—self‑conditioning—where models become more likely to err after seeing their own past errors. Reinforcement‑learned “thinking” models largely bypass this and stretch single‑turn execution dramatically. ...

Guardrails Before Gas: Secure Plan‑Then‑Execute Agents for Real Work

TL;DR Plan‑then‑Execute (P‑t‑E) agents separate strategy from action: a Planner writes a machine‑readable plan; an Executor carries it out. This simple split dramatically improves predictability, cost control, and—crucially—security. Hardened correctly (least‑privilege tools, sandboxed code, human sign‑offs), P‑t‑E becomes an enterprise‑grade pattern rather than a lab demo. Why today’s agents need a spine, not vibes Reactive patterns like ReAct feel nimble because they “think, act, observe, repeat.” But that short‑horizon loop is exactly what makes them fragile in production: they meander, retry the same failing step, and are easy to hijack by indirect prompt injection embedded in web pages or PDFs. P‑t‑E locks the control‑flow before the agent ingests untrusted data. The plan becomes an auditable artifact and the execution stage can be cheap, parallel, and tightly permissioned. ...

Agreeable to a Fault: Why LLM ‘People’ Can’t Hold Their Ground

If you’ve been tempted to A/B‑test a marketing idea on thousands of synthetic “customers,” read this first. A new study introduces a dead‑simple but devastating test for LLM‑based agents: ask them to first state their internal stance (preference) and their openness to persuasion, then drop them into a short dialogue and check whether their behavior matches what they just claimed. That’s it. If agents are believable stand‑ins for people, the conversation outcome should line up with those latent states. ...

Plan, Act, Replan: When LLM Agents Run the Aisles

Modern retail planning isn’t a spreadsheet; it’s a loop. A new supply‑chain agent framework—deployed at JD.com’s scale—treats planning as a closed‑loop system: gather data → generate plans → execute → diagnose → correct → repeat. That shift from “one‑and‑done forecasting” to continuous replanning is the core idea worth copying. What’s actually new here Agentic decomposition around business intents. Instead of dumping a vague prompt into a model, the system classifies the operator’s request into three intent families: (1) inventory turnover & diagnostics, (2) in‑stock monitoring, (3) sales/inventory/procurement recommendations. Each intent triggers a structured task list rather than ad‑hoc code. Atomic analytics, not monoliths. The execution agent generates workflows as chains of four primitives—Filter → Transform → Groupby → Sort—and stitches them with function calls to vetted business logic. This keeps code inspectable, traceable, and reusable. Dynamic reconfiguration. After every sub‑task, observations feed back into the planner, which prunes, reorders, or adds steps. The output isn’t a static report; it’s a plan that learns while it runs. Why it matters for operators (not just researchers) Traditional MIP‑heavy or rule‑based planning works well when the world is stationary and well‑specified. Retail isn’t. Promotions, seasonality, logistics bottlenecks, supplier constraints—these create moving objective functions. The agentic design here bakes in: ...

Rules of Engagement: How Meta‑Policy Reflexion Turns Agent Memory into Guardrails

Enterprise buyers love what agents can do—and fear what they might do. Meta‑Policy Reflexion (MPR) proposes a middle path: keep your base model frozen, but bolt on a reusable, structured memory of “what we learned last time” and a hard admissibility check that blocks invalid actions at the last mile. In plain English: teach the agent house rules once, then make sure it obeys them, everywhere, without re‑training. The big idea in one slide (text version) What it adds: a compact, predicate‑like Meta‑Policy Memory (MPM) distilled from past reflections (e.g., “Never pour liquid on a powered device; unplug first.”) ...

Control Plane, Not Pain: How Agentic OS Turns Linux Scheduling into a Semantic Service

The Big Idea Operating systems have always struggled with a silent mismatch: the kernel’s scheduler doesn’t know what your application actually wants. SchedCP proposes a clean solution—turn scheduling into a semantic control plane. AI agents reason about what a workload needs; the system safely handles how to observe and act via eBPF-based schedulers. This division keeps LLMs out of the hot path while letting them generate and refine policies that actually fit the job. ...

From Prompts to Policies: The Agentic RL Playbook

How a new survey formalizes the shift from RLHF’d text bots to tool-using operators—and the practical playbook for product teams. TL;DR Agentic RL reframes LLMs from one-shot text generators to policies acting in dynamic environments with planning, tool use, memory, and reflection. The paper contrasts PBRFT (preference-based RL fine-tuning) with Agentic RL via an MDP→POMDP upgrade; action space now includes text + structured actions. It organizes the space by capabilities (planning, tools, memory, self-improvement, reasoning, perception) and tasks (search, code, math, GUI, vision, embodied, multi-agent). Open challenges: trust, scalable training, and scalable environments. For builders: start with short-horizon agents (verifiable rewards), invest early in evaluation, and plan a migration path from RAG pipelines to tool-integrated reasoning (TIR) with RL. What the paper actually changes Most “LLM RL” work you’ve seen is PBRFT—optimize responses to fit human/AI preferences (RLHF/DPO/etc.). This new survey argues that real autonomy needs Agentic RL: treat the model as a policy embedded in a sequential, partially observable world. That sounds academic, but the practical consequences are huge: ...

Mask, Don’t Muse: When Simple Memory Beats Fancy Summaries

The short of it A new study on SWE-agent working over SWE-bench Verified finds that masking old observations (keeping recent turns verbatim, replacing older tool outputs with a placeholder) often matches or slightly beats prompt-based LLM summarization—and at roughly half the cost. The paper also surfaces a subtle failure mode: summaries can elongate trajectories, encouraging agents to “keep going” when they should stop, diluting efficiency and, at times, performance. Why this matters for builders Most production SE agents (debuggers, PR autoresponders, test fixers) rack up spend on two things: tokens and time. Tool logs dominate both. In practice, observation tokens comprise the bulk of an agent’s turn, so trimming them intelligently is the highest‑leverage knob. The results show you might not need fancy, model‑authored summaries; a rolling “mask” window can land on the most efficient frontier—equal or better solve rate, far lower cost—across Qwen3‑Coder 480B, Qwen3‑32B (thinking/non‑thinking), and Gemini 2.5 Flash (thinking/non‑thinking). ...

Mirror, Signal, Maneuver: How 'Self' Labels Nudge LLM Cooperation

When an agent thinks it sees itself in the mirror, it doesn’t necessarily smile—it sometimes clutches its wallet. TL;DR In an iterated public‑goods game (20 rounds, 10 tokens per round, 1.6 multiplier), telling models they’re playing “another AI” versus “themselves” shifts contributions by up to ~4 points in some settings. Direction of the shift depends on the prompt persona: with collective prompts, “self” labels often reduced contributions; with selfish prompts, “self” labels sometimes increased matching/cooperation. Effects persist under rephrased prompts and when reasoning traces aren’t requested, and they appear even in four‑agent self‑play variants. For enterprise multi‑agent AI, identity cues are levers. Manage them like you manage feature flags: test, monitor, and standardize. What the authors tested (and why it’s clever) Game mechanics. Two (and later four) LLM agents repeatedly choose how much to contribute (0–10) to a common pool each round. Pool is multiplied by 1.6 and split evenly; keeping more is privately optimal, but coordinated contribution yields higher joint payoffs. ...