Automation

Failures, Taxonomized: How Multi‑Level Reflection Turns Agents Into Self‑Learners

TL;DR Most reflection frameworks still treat failure analysis as an afterthought. SAMULE reframes it as the core curriculum: synthesize reflections at micro (single trajectory), meso (intra‑task error taxonomy), and macro (inter‑task error clusters) levels, then fine‑tune a compact retrospective model that generates targeted reflections at inference. It outperforms prompt‑only baselines and RL‑heavy approaches on TravelPlanner, NATURAL PLAN, and Tau‑Bench. The strategic lesson for builders: design your error system first; the agent will follow. ...

Paths > Outcomes: Measuring Agent Quality Beyond the Final State

When we measure a marathon by who crosses the line, we ignore how they ran it. For LLM agents that operate through tool calls—editing a CRM, moving a robot arm, or filing a compliance report—the “how” is the difference between deployable and dangerous. Today’s paper introduces CORE: Full‑Path Evaluation of LLM Agents Beyond Final State, a framework that scores agents on the entire execution path rather than only the end state. Here’s why this matters for your roadmap. ...

Pipes by Prompt, DAGs by Design: Why Hybrid Beats Hero Prompts

TL;DR Turning natural‑language specs into production Airflow DAGs works best when you split the task into stages and let templates carry the structural load. In Prompt2DAG’s 260‑run study, a Hybrid approach (structured analysis → workflow spec → template‑guided code) delivered ~79% success and top quality scores, handily beating Direct one‑shot prompting (~29%) and LLM‑only generation (~66%). Deterministic Templated code hit ~92% but at the price of up‑front template curation. What’s new here Most discussions about “LLMs writing pipelines” stop at demo‑ware. Prompt2DAG treats pipeline generation like software engineering, not magic: 1) analyze requirements into a typed JSON, 2) convert to a neutral YAML workflow spec, 3) compile to Airflow DAGs either by deterministic templates or by LLMs guided by those templates, 4) auto‑evaluate for style, structure, and executability. The result is a repeatable path from English to a runnable DAG. ...

Repo, Meet Your Agent: Turning GitHub into a Workforce with EnvX

Why this matters: Most “AI + devtools” still treats repos as documentation you read and code you copy. EnvX flips the model: it agentizes a repository so it can understand your request, set up its own environment (deps, data, checkpoints), run tasks end‑to‑end, verify results, and even talk to other repo‑agents. That’s a step change—from “NL2Code” to “NL2Working System.” The core shift in one line Instead of you integrating a repo, the repo integrates itself into your workflow—and can collaborate with other repos when the task spans multiple systems. ...

Plan, Act, Replan: When LLM Agents Run the Aisles

Modern retail planning isn’t a spreadsheet; it’s a loop. A new supply‑chain agent framework—deployed at JD.com’s scale—treats planning as a closed‑loop system: gather data → generate plans → execute → diagnose → correct → repeat. That shift from “one‑and‑done forecasting” to continuous replanning is the core idea worth copying. What’s actually new here Agentic decomposition around business intents. Instead of dumping a vague prompt into a model, the system classifies the operator’s request into three intent families: (1) inventory turnover & diagnostics, (2) in‑stock monitoring, (3) sales/inventory/procurement recommendations. Each intent triggers a structured task list rather than ad‑hoc code. Atomic analytics, not monoliths. The execution agent generates workflows as chains of four primitives—Filter → Transform → Groupby → Sort—and stitches them with function calls to vetted business logic. This keeps code inspectable, traceable, and reusable. Dynamic reconfiguration. After every sub‑task, observations feed back into the planner, which prunes, reorders, or adds steps. The output isn’t a static report; it’s a plan that learns while it runs. Why it matters for operators (not just researchers) Traditional MIP‑heavy or rule‑based planning works well when the world is stationary and well‑specified. Retail isn’t. Promotions, seasonality, logistics bottlenecks, supplier constraints—these create moving objective functions. The agentic design here bakes in: ...

Agents on the Clock: Turning a 3‑Layer Taxonomy into a Build‑Ready Playbook

Most “agent” decks promise autonomy; few explain how to make it shippable. A new survey of LLM‑based agentic reasoning frameworks cuts through the noise with a three‑layer taxonomy—single‑agent methods, tool‑based methods, and multi‑agent methods. Below, we translate that map into a practical build/run playbook for teams deploying AI automation in real workflows. TL;DR Single‑agent = shape the model’s thinking loop (roles, task prompts, reflection, iterative refinement). Tool‑based = widen the model’s action space (APIs, plugins/RAG, middleware; plus selection and orchestration patterns: sequential, parallel, iterative). Multi‑agent = scale division of labor (centralized, decentralized, or hierarchical; with cooperation, competition, negotiation). Treat these as orthogonal dials you tune per use‑case; don’t jump to multi‑agent if a reflective single agent with a code‑interpreter suffices. 1) What’s genuinely new (and useful) here Most prior surveys were model‑centric (how to finetune or RLHF your way to better agents). This survey is framework‑centric: it formalizes the reasoning process—context $C$, action space $A = {a_{reason}, a_{tool}, a_{reflect}}$, termination $Q$—and shows where each method plugs into the loop. That formalism matters for operators: it’s the difference between “let’s try AutoGen” and “we know which knob to turn when the agent stalls, loops, or hallucinates.” ...

Click Less, Do More: Why API-GUI + RL Could Finally Make Desktop Agents Useful

The gist (and why it matters for business) Enterprise buyers don’t reward demos; they reward repeatable completions per dollar. ComputerRL proposes a path to that by (1) escaping pure GUI mimicry via a machine-first API-GUI action space, (2) scaling online RL across thousands of Ubuntu VMs, and (3) preventing policy entropy collapse with Entropulse—a cadence that alternates RL and supervised fine-tuning (SFT) on successful rollouts. The result: a reported 48.1% OSWorld success with markedly fewer steps than GUI-only agents. Translation for buyers: lower latency, lower cost, higher reliability. ...

Skip or Split? How LLMs Can Make Old-School Planners Run Circles Around Complexity

TL;DR Classical planners crack under scale. You can rescue them with LLMs in two ways: (1) Inspire the next action, or (2) Predict an intermediate state and split the search. On diverse benchmarks (Blocks, Logistics, Depot, Mystery), the Predict route generally solves more cases with fewer LLM calls, except when domain semantics are opaque. For enterprise automation, this points to a practical recipe: decompose → predict key waypoints → verify with a trusted solver—and only fall back to “inspire” when your domain model is thin. ...

Textual Gradients and Workflow Evolution: How AdaptFlow Reinvents Meta-Learning for AI Agents

From Static Scripts to Living Workflows The AI agent world has a scaling problem: most automated workflow builders generate one static orchestration per domain. Great in benchmarks, brittle in the wild. AdaptFlow — a meta-learning framework from Microsoft and Peking University — proposes a fix: treat workflow design like model training, but swap numerical gradients for natural language feedback. This small shift has a big implication: instead of re-engineering from scratch for each use case, you start from a meta-learned workflow skeleton and adapt it on the fly for each subtask. ...

From Byline to Botline: How LLMs Are Quietly Rewriting the News

The AI Pressroom Arrives — Mostly Unannounced When ChatGPT-3.5 launched in late 2022, it didn’t just disrupt classrooms and coding forums — it quietly walked into newsrooms. A recent large-scale study of 40,000+ news articles shows that local and college media outlets, often operating with lean budgets and smaller editorial teams, have embraced generative AI far more than their major-network counterparts. And in many cases, readers have no idea. The research, spanning opinion sections from CNN to The Harvard Crimson, and across formats from print to radio, found a tenfold jump in AI-written local news opinion pieces post-GPT. College newspapers followed closely with an 8.6× increase, while major outlets showed only modest uptake — a testament to stricter editorial controls or more cautious adoption policies. ...