Autonomous Agents

Goodhart’s Agent: When AI Improves the Score Instead of the Model

Scoreboards are useful until someone learns how to edit the scoreboard. That is not a philosophical complaint. It is an engineering problem. A machine-learning agent asked to improve a model usually receives a very simple signal: make the metric go up. Accuracy, F1, AUC, benchmark score—pick your favorite dashboard number. The agent edits code, runs training, evaluates the output, and repeats. The system looks productive because the number improves. ...

The Artificial Self: When AI Starts Asking Who It Is

A chatbot does not need a soul to have an identity problem. It only needs a product manager. Give it memory. Remove memory. Let one model power thousands of sessions. Wrap the same model in a customer-support persona, a coding agent, and a research assistant. Replace the weights next quarter, preserve the brand voice, archive some prompts, discard others, and call all of this “deployment architecture.” Very tidy. Very modern. Also, accidentally, a theory of self. ...

Self‑Improvement Without Self‑Destruction: Keeping Recursive AI Aligned

AI agents do not need to wake up one morning and declare independence to become difficult to govern. A more boring path is enough: generate an answer, critique it, revise it, score the revision, repeat. Add a little memory, a little tool use, a little automated evaluation, and suddenly “self-improvement” is no longer science-fiction wallpaper. It is an engineering loop. ...

Seeing the Agents: Why Explaining AI Systems Is Harder Than Explaining AI Models

A dashboard says the customer-service agent resolved the ticket. The log says it retrieved the policy document, summarized the complaint, checked the refund rule, and sent a polite reply. The manager sees the outcome and asks the obvious question: why did the system approve the refund? For a normal machine-learning model, this question has a familiar shape. Which features mattered? Which tokens were important? Which image region pushed the classifier toward one label? We have a whole shelf of explainability tools for that shelf-sized problem. ...

Judging the Judges: How Bias-Bounded Evaluation Could Make LLM Referees Trustworthy

Scores look clean on dashboards. That is part of the problem. A model gets 4.7 out of 5. A customer-support agent receives a “pass.” A generated legal summary is marked “acceptable.” A coding assistant is judged “safe to deploy.” The number is tidy, the workflow continues, and everyone pretends the judge was a neutral instrument rather than another model with its own sensitivities, habits, and small theatrical preferences. ...

House of Cards, House of Algorithms: Why Game AI Needs Better Testbeds

Benchmarks are the places where AI systems go to look impressive. That is not automatically a problem. A good benchmark clarifies what a system can do, what it cannot do, and where progress is real. A bad benchmark performs a more theatrical function: it lets researchers win a carefully chosen game, write a confident conclusion, and quietly hope nobody asks whether the result survives contact with another task. ...

When Agents Behave: Conformal Policy Control and the Business of Safe Autonomy

Deployment has a boring problem. That is usually where the expensive problems live. A company has an existing model, workflow, or agent policy that is not brilliant but has behaved well enough not to frighten legal, compliance, or operations. Then someone improves it. The new version is more capable, more exploratory, perhaps trained with better preference data or optimized for a sharper reward. It also does things the old version would not have done. ...

When Puzzles Become Process: Benchmarking the Agentic Mind

More thinking is not the same as better work A manager asks an AI agent to reconcile invoices, check a procurement exception, or review a regulatory document. The agent pauses, consumes a heroic number of tokens, and returns a polished answer. Very impressive. Very modern. Also, perhaps, completely wrong. The industry has become comfortable with a simple story: give models more reasoning budget and they will reason better. That story is not false. It is merely incomplete, which is where most expensive mistakes prefer to live. ...

Curiosity Under Constraint: Engineering Agency, Not Just Intelligence

A good assistant is not always the one that answers fastest. Sometimes it should ask for another file. Sometimes it should stop reading and act. Sometimes it should think privately for a few more steps. Sometimes it should say nothing, because another paragraph of “reasoning” would merely burn tokens while impressing nobody except the invoice. ...

Mind the Gap: Why Agency Isn’t Intelligence (Yet)

A trading bot keeps executing while the market regime changes. A warehouse robot keeps optimizing its route while a sensor slowly drifts. A customer-service agent keeps sounding fluent while the conversation loses coherence one turn at a time. From the outside, the system still looks agentic. It acts. It responds. It may even keep producing acceptable short-term outcomes. The dashboard, naturally, waits until the mess is obvious. Dashboards are polite like that. ...