AI Governance

Reason, Reveal, Resist: The Persuasion Duality in Multi‑Agent AI

Meetings are already persuasive systems. Someone speaks first, someone sounds confident, someone produces a spreadsheet with just enough decimal places to look holy, and suddenly the room has moved. Multi-agent AI systems are not so different. They are becoming small artificial committees: one agent retrieves, another proposes, another critiques, another decides. The optimistic version says this gives us productive disagreement. The less adorable version says we have built a machine for circulating influence, and we are only now asking what makes one agent cave to another. ...

When Agents Get Bored: Three Baselines Your Autonomy Stack Already Has

Idle time is not empty time. Anyone who has managed a human team already knows this. Leave a capable person with no clear assignment and they may tidy the backlog, invent a side project, interrogate the process, or spend the afternoon constructing a philosophy of why the calendar is oppressive. Large language model agents, apparently, have their own version of this behaviour. Less caffeine, more JSON, same managerial problem. ...

Sandboxes & Ladders: How to Build a Steerable Agent Economy

Budgets are where autonomy becomes real. A chatbot can be annoying. An agent with a procurement account, API access, calendar authority, cloud credits, and a habit of negotiating with other agents is something else entirely. At that point, we are no longer discussing “workflow automation” in the tidy enterprise sense. We are discussing economic actors: software systems that request resources, trade off priorities, outsource tasks, pay for services, and generate consequences faster than the compliance department can ask for a meeting. ...

Agency Check, Please: What a New Benchmark Says About LLMs That Actually Empower Users

A customer asks your AI assistant to choose between two mortgage options. An employee asks whether to quit. A student says, very politely, “Please guide me, but don’t give me the answer.” A lonely user suggests the chatbot feels like a best friend. The easy product answer is: be helpful. The harder answer is: helpful to what? ...

Stop, Verify, and Listen: HALT‑RAG Brings a ‘Reject Option’ to RAG

RAG systems usually fail in a very business-like way: not with drama, but with confident paperwork. The retriever finds something. The generator writes something. The user sees an answer that looks plausible, well formatted, and sufficiently certain to be dangerous. Then someone asks the dull but expensive question: did the answer actually follow from the source? ...

Branching Out of the Middle: How a ‘Tree of Agents’ Fixes Long-Context Blind Spots

Contracts are not polite. They hide the important clause on page 83, define the crucial exception on page 17, and bury the fatal cross-reference in an appendix nobody wanted to read. Annual reports behave similarly. So do medical SOPs, litigation files, policy manuals, technical logs, and most documents produced by institutions that have discovered both Microsoft Word and committees. ...

Model Portfolio: When LLMs Sit the CFA

Exams are useful because they are rude. They do not care that a model sounds polished, cites the right buzzwords, or can produce a gorgeous paragraph about duration risk. They ask for A, B, or C. Then they mark the answer wrong. That is why a new CFA-based benchmark is more useful than another misty-eyed essay about AI “transforming finance.” The paper evaluates GPT-4o, GPT-o1, and o3-mini on 1,560 official CFA mock multiple-choice questions across Levels I, II, and III, both zero-shot and with a domain-reasoning RAG pipeline built from official CFA curriculum materials.1 The result is not a single leaderboard. It is closer to a routing manual. ...

Guard Rails > Horsepower: Why Environment Scaffolding Beats Bigger Models

A demo is cheap. Ask an AI agent to build a web app, watch it spin up a cheerful interface, click a few buttons, and everyone briefly pretends software engineering has been solved. Then production begins. The app boots but stores nothing. The database schema exists but the handler quietly forgets foreign keys. The UI looks plausible until the first state transition. The test suite passes because it checked the page title, not the workflow. Somewhere, a dashboard reports “success.” Somewhere else, a user discovers the thing is an elegant cardboard storefront. ...

Dial M—for Markets: Brain‑Scanning and Steering LLMs for Finance

TL;DR for operators This paper is not mainly about whether an LLM can forecast stock moves from news. That storyline is already crowded, noisy, and full of people discovering that backtests look unusually handsome when nobody has yet met execution costs. The more useful contribution is different: it shows a way to inspect and adjust the internal concepts an LLM activates while processing financial text. ...

Mirror, Signal, Maneuver: How 'Self' Labels Nudge LLM Cooperation

TL;DR for operators A paper on LLM self-recognition used an iterated public goods game to test a deceptively small intervention: tell an agent it is playing against “another AI agent,” or tell it it is playing against a model with its own name.1 The result was not a clean fairy tale about models recognising themselves and becoming benevolent little collectivists. Shame. That would have been simpler. ...