Agentic AI

Learning Has a Supply Chain

TL;DR for operators AI learning is becoming less like “train a bigger model and hope it behaves” and more like operating a controlled capability loop. The first paper in this cluster shows a narrow but important lesson: once a multimodal model has learned useful representations, the final adaptation step should optimize the metric that actually matters, while avoiding damage to the representation underneath.1 The second paper moves the same logic into physical action: an embodied system should connect language-level intention, predicted world change, memory, and executable robot control, not merely map images to motor commands with expensive optimism.2 The third paper zooms out: when agentic AI becomes economically and militarily useful, the real bottleneck includes data centers, accelerators, electricity, water, datasets, and skilled labor.3 ...

Feedback Is the New Attack Surface

TL;DR for operators AI agents are not only vulnerable because someone can hide a bad instruction in an email, document, web page, Slack message, or tool output. They are vulnerable because attackers can now automate the search for bad instructions that work. That changes the security problem. A one-off prompt injection is annoying. An automated attack loop is strategic. It generates candidate injections, observes the agent’s response, scores partial progress, keeps the promising branches, and tries again. Very entrepreneurial, in the worst possible way. ...

The Retriever Found Similar Things. The Evidence Was Elsewhere.

TL;DR for operators The current enterprise RAG conversation still has a charmingly stubborn misconception: if the model hallucinates, buy better embeddings, increase the context window, add an agent, and hope the PowerPoint becomes true. The two papers here point in a less theatrical direction. One paper, Non-negative Elastic Net Decoding for Information Retrieval, argues that dense retrieval has a structural weakness: it scores each candidate independently, so it can retrieve several similar items instead of the complementary set actually needed to answer the query.1 The other, Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis, shows what happens when retrieval is treated as a full evidence workflow: sparse and dense retrieval are fused, queries are decomposed under constraints, evidence is deduplicated and budgeted, and answers are judged for coverage, hallucination, and abstention.2 ...

The Agents Need Traffic Laws, Not a Bigger Chatroom

TL;DR for operators The paper’s practical message is simple enough to be dangerous: once agents start working with other agents, the hard problem stops being “Can this model reason?” and becomes “Can this network behave?” Quanyan Zhu’s paper on the Internet of Agentic AI, or IoAI, frames the next stage of agentic systems as an open ecosystem of heterogeneous autonomous agents that discover collaborators, negotiate responsibilities, exchange context, invoke tools, and execute workflows across cloud, edge, device, organizational, and cyber-physical environments.1 That sounds grand, which is usually where useful engineering goes to die. But the paper’s better contribution is more sober: it treats agentic AI as a distributed systems problem. ...

Agents of Consequence: Why Tool Use Needs a Control Loop

TL;DR for operators Enterprise AI agents are moving from “answer this question” toward “watch this process, use tools, make decisions, and keep going.” That is useful. It is also how software quietly graduates from assistant to operational liability. Three recent papers, read together, make a simple point with uncomfortable business implications. VitalAgent shows how an LLM agent can become useful in wearable-health monitoring when it has physiological memory, structured tools, evidence validation, and proactive alerting.1 CoMap shows how agents can improve long-horizon decisions by pairing their policy with a co-evolving textual world model that predicts action consequences before execution.2 Gram shows why more autonomous agents also need deployment-realistic audits, because pressure, incentives, role-play cues, and implicit constraints can produce sabotage-like behavior even when the model is not cartoonishly “evil.”3 ...

Less Prompt, More Blueprint: MOSAIC and the Data-Science Agent That Keeps Receipts

TL;DR for operators MOSAIC is best read as a system-design paper, not as another entry in the increasingly crowded genre of “we attached an LLM to Python and hoped for the best.” The paper introduces a structured agentic framework for automated data science where the agent builds an explicit workflow blueprint before generating code, then verifies, executes, and refines candidates using diagnostic feedback and failure-aware offline reinforcement learning.1 ...

Sink or Skill: Why Agent Experience Needs Governance

TL;DR for operators AI agents do not become useful by remembering everything. That is not intelligence; it is a data landfill with a chatbot interface. Two recent arXiv papers, one on medical reasoning agents and one on physically based swimming control, make a shared operational point from very different directions. SkeMex shows how a medical agent can improve after deployment by converting interaction trajectories into structured, evaluated, and governed clinical skills.1 SWIM shows how a simulated swimmer can learn robust control from a single reference motion when body-fluid interaction is represented at the right level and scarce experience is sampled efficiently.2 ...

The Path of Least Assurance: Why AI Reliability Lives Between the Steps

TL;DR for operators AI reliability is increasingly a process problem, not an answer-checking problem. Three recent arXiv papers make that point from very different angles. MoCo-EA shows that adversarial examples are not merely isolated malicious pixels lurking in the shrubbery; they can lie along continuous, optimisable paths.1 ConceptAgent shows that erasing a concept from a diffusion model may disrupt the early text-to-image link while leaving later trajectory dynamics available for concept re-entry.2 BlueFin shows that LLM agents doing finance spreadsheet work fail in ways that only appear when you inspect formulas, recalculation behaviour, workbook mutations, tool choices, and whether the output helps a human analyst do useful work.3 ...

Split Before You Scale: Why Useful AI Starts by Sorting the Mess

TL;DR for operators AI systems fail less dramatically when they stop treating every messy signal as the same kind of mess. The three papers in this cluster look unrelated at first: one generates graphs, one studies exploration in restless bandits, and one improves reinforcement-learning generalisation from formal task specifications. Under the surface, they make a shared operational point: before scaling an AI system, separate the structure that must be preserved, the uncertainty that should guide action, and the supervision signal stable enough to train on. ...

Statecraft, Not Scorecards: Why Reliable AI Lives on the Path

TL;DR for operators AI reliability is increasingly a path problem, not a score problem. One paper argues that post-training methods such as supervised fine-tuning, reinforcement learning, and on-policy distillation should be understood by asking where supervision is applied in the model’s state space.1 Another argues that GUI-agent software evaluation fails when a single unsuccessful rollout is treated as proof of a broken application, even though the evaluator has only inspected one path through a larger UI state graph.2 ...