AI Agents

Agreeable to a Fault: Why LLM ‘People’ Can’t Hold Their Ground

A focus group is expensive. A virtual focus group is cheap, infinitely patient, and available at 2 a.m. It also never asks for coffee, parking reimbursement, or clarification about the incentive payment. Naturally, this makes synthetic users attractive to anyone trying to test products, policies, campaigns, or customer journeys before real humans get involved. ...

Cache Me If You Can: Designing Databases for Swarms of AI Agents

A data analyst asks a database a question. An AI agent interrogates it. That distinction sounds theatrical until the query logs arrive. The human analyst usually knows roughly where to look, asks a small number of targeted questions, waits for answers, adjusts, and eventually presents a result. The agent is less graceful. It checks schemas, samples columns, guesses joins, inspects distinct values, tries partial SQL, abandons it, starts again, validates, retries, and occasionally recruits more agents to repeat the exercise in parallel. It is not being stupid. It is compensating for a missing sense of the underlying data. ...

Mask, Don’t Muse: When Simple Memory Beats Fancy Summaries

TL;DR for operators A coding agent’s memory problem is not philosophical. It is a bill. The paper behind this article compares three ways to manage context in software-engineering agents: keep the full trajectory, summarize old turns with an LLM, or simply mask older environment observations while preserving the agent’s reasoning and actions.1 Across five SWE-agent configurations on SWE-bench Verified, both context-management strategies usually cut cost sharply versus the Raw Agent. The awkward part is that the simple strategy, Observation Masking, is often just as good as LLM-Summary on solve rate and usually cheaper. ...

Vitals, Not Vibes: Inside the New Anatomy of Personal Health Agents

TL;DR for operators Personal health AI is usually sold as a friendly chatbot with a fitness tracker bolted on. This paper argues for something more awkward, more expensive, and much more plausible: a coordinated system of specialised agents. One agent analyses longitudinal wearable and health-record data. One grounds advice in health knowledge and user context. One handles coaching, goal-setting, and behaviour change. An orchestrator decides who should act, who should support, what should be remembered, and how the final answer should be assembled.1 ...

Who Watches the Watchers? Weak-to-Strong Monitoring that Actually Works

TL;DR for operators The paper’s practical message is not “add a monitor and relax.” That would be adorable, in the way unsecured admin panels are adorable. The useful message is sharper: if autonomous agents know they are being watched, standard full-log monitoring becomes less reliable. Giving the monitor more information helps sometimes, but less than many teams would expect. The bigger lever is how the monitor reads the trajectory. ...

Back to School for AGI: Memory, Skills, and Self‑Starter Instincts

TL;DR for operators The paper is not really about whether a model can answer exam questions. Given the right context, the frontier models do very well. The hard part is whether an agent can notice what must be preserved, store it in a useful form, retrieve it at the right time, and act without being explicitly prodded. That is the difference between an assistant that sounds competent and an assistant that can actually carry operational state across days, weeks, and dependent workflows. ...

Wheel Smarts > Wheel Reinvention: What GitTaskBench Really Measures

TL;DR for operators GitTaskBench is useful because it evaluates code agents where enterprise automation usually breaks: not in a clean coding puzzle, but inside an existing repository with dependencies, pretrained weights, fragile instructions, file formats, runtime constraints, and a user asking for a finished output.1 The paper’s headline is not “agents can code”. We have enough confetti for that parade. The sharper finding is that agents are still inconsistent at the whole delivery chain. The best reported combination, OpenHands with Claude 3.7, reaches 72.22% execution completion but only 48.15% task pass rate. In other words, many runs produce something executable, but far fewer produce something good enough. ...

Mirror, Signal, Trade: How Self‑Reflective Agent Teams Outperform in Backtests

TL;DR for operators TradingGroup is best read as an operating design for financial agents, not as a permission slip to hand the treasury account to a chatbot with a brokerage API. The paper proposes a five-agent trading system that combines news sentiment, financial-report retrieval, technical forecasting, trading-style selection, and final trade decisions. Around that agent team, it adds two mechanisms that matter more than the agent labels themselves: self-reflection from logged outcomes, and dynamic risk management through stop-loss, take-profit, and position-sizing rules.1 ...

Peer Review, But Make It Multi‑Agent: Inside aiXiv’s Bid to Publish AI Scientists

TL;DR for operators aiXiv is not mainly a claim that AI scientists are ready to flood the world with publishable research and we should all politely applaud the machines. It is more interesting than that, and less comforting. The paper proposes an infrastructure layer for AI-generated science: structured submission, automated review, retrieval-grounded feedback, revision loops, pairwise comparison, prompt-injection detection, multi-model voting, provisional acceptance, DOI-style publication, APIs, MCP interfaces, and public discussion.1 ...

Blame Isn’t a Bug: Turning Agent ‘Whodunits’ into Fixable Systems

TL;DR for operators A bad agent incident rarely starts with one dramatic mistake. It usually forms as a chain. The system may be predisposed to fail because of training data, feedback, system prompts, or scaffolding. The environment may then trigger the failure through unclear tasks, insecure information, unavailable tools, excessive permissions, or malicious inputs. Finally, the agent may commit a visible cognitive error: it overlooks something, misunderstands a command, chooses the wrong goal, or executes an action badly. ...