LLM Agents

When 'Check the AC' Becomes the Hard Part

TL;DR for operators Smart-home assistants do not fail only when users are vague. They fail when users become efficient. The PEC-Home paper studies a familiar pattern: after repeated interaction, people stop saying the whole thing. “Please turn on the air conditioner in the bedroom and set it to 26 degrees at 10 PM” eventually becomes “check the AC” or “handle that thing.” Humans manage this because shared context, identity, place, and prior routines do the missing work. Current LLM assistants are much less charming under that burden. ...

Feedback Is the New Attack Surface

TL;DR for operators AI agents are not only vulnerable because someone can hide a bad instruction in an email, document, web page, Slack message, or tool output. They are vulnerable because attackers can now automate the search for bad instructions that work. That changes the security problem. A one-off prompt injection is annoying. An automated attack loop is strategic. It generates candidate injections, observes the agent’s response, scores partial progress, keeps the promising branches, and tries again. Very entrepreneurial, in the worst possible way. ...

Ground Control to Synthetic Data: Why Enterprise LLMs Need a Source of Truth

TL;DR for operators Synthetic data is having its predictable enterprise moment: everyone wants more of it, faster, cheaper, and preferably without involving humans who ask inconvenient questions like “is this correct?” The two papers here are useful because they push against that lazy version of the story. StateGen, from PayPal AI, focuses on generating multi-turn training conversations for tool-augmented LLM agents, using an authoritative world-state object, tool simulation, persona variation, and multi-axis judging.1 CYQUARK focuses on generating Text-To-Cypher fine-tuning data from a target property graph and schema, expanding query expressivity while filtering natural-language paraphrases for logical fidelity.2 ...

Less Prompt, More Blueprint: MOSAIC and the Data-Science Agent That Keeps Receipts

TL;DR for operators MOSAIC is best read as a system-design paper, not as another entry in the increasingly crowded genre of “we attached an LLM to Python and hoped for the best.” The paper introduces a structured agentic framework for automated data science where the agent builds an explicit workflow blueprint before generating code, then verifies, executes, and refines candidates using diagnostic feedback and failure-aware offline reinforcement learning.1 ...

Logs Are Not Lineage: The Accountability Layer AI Agents Are Missing

TL;DR for operators The paper argues that trustworthy AI agents need more than accurate final answers. Once an agent can retrieve documents, call APIs, write memory, modify databases, send messages, or coordinate with other agents, trust depends on whether the organisation can reconstruct how the output or action happened. The useful mechanism is: ...

The Solver Isn’t the Strategy: FrontierOR’s Reality Check for AI Optimisation Agents

Scheduling a factory, routing a fleet, pricing airline seats, allocating scarce capacity: these are not “write me a Python script” problems with nicer stationery. In real operations research, the useful answer is not merely a correct mathematical model. It is a method that stays feasible, keeps solution quality high, and finishes before the business context has expired. ...

Commit Issues: Why Multi-Agent AI Needs Typed Finality, Not Another Vote

Vote counts are cheap; finality is expensive Vote. That is the comfortable answer whenever multiple AI agents disagree. Ask ten agents, collect ten outputs, pick the majority, maybe weight by confidence, then call the result “robust.” It has the pleasant managerial smell of a committee decision. Everyone participated, something won, a spreadsheet can be made. ...

Prompt and Order: Why LLM Trading Needs a Factory, Not a Fortune Teller

Orders are where trading systems stop sounding intelligent and start spending money. A model can narrate the market beautifully. It can explain momentum, liquidity, volatility regimes, inventory pressure, and the great moral tragedy of being early. None of that matters if the final system places the wrong limit order, sizes too aggressively, fills only in a fantasy simulator, or wins a backtest because it tried enough variants to accidentally find one that looked divine. ...

Search, Critique, Repeat: Critic-R Turns RAG Complaints into Retriever Training

Search failure is boring until it becomes expensive. A research agent asks for evidence. The retriever returns documents. The reasoning model reads them, continues writing, and eventually produces a confident answer. Somewhere in the middle, the evidence was slightly wrong: not irrelevant enough to trigger an obvious failure, not useful enough to support the next reasoning step. The agent proceeds anyway, because that is what agents do when we dress up uncertainty as workflow automation. ...

Scaffold and Ladder: Why AI Agents Need Meta-Reasoning, Not Longer Monologues

Workflow is where AI agents usually stop looking magical. Ask one to summarize a short memo, and it behaves like a competent intern with suspiciously fast typing. Ask it to investigate a compliance question across policies, contract clauses, ticket histories, and messy attachments, and the illusion starts to wobble. The agent searches once, reads too much at once, jumps to a plausible answer, and then politely explains the wrong conclusion with the confidence of a junior consultant who has discovered formatting. ...