The Goats in the Machine: Why AI Agents Need Contracts, Not Personalities

TL;DR for operators

AI agents are leaving the demo booth and entering workspaces: repositories, customer records, procurement systems, legal drafts, financial workflows, support queues, and other places where a charming mistake becomes an operational incident. That changes the evaluation problem. It is no longer enough to ask whether an agent sounds sensible, acts “empathetic”, appears to “understand”, or seems to have “judgement”. Lovely theatre. Terrible control surface.

Two recent papers make a useful pair. Adrian de Wynter’s Age of Empires II paper argues that broad claims about human-like attributes in LLMs are fragile unless the measurement criteria are explicit and substrate-aware.¹ A separate Formal Skill paper proposes an engineering abstraction for making agent behaviour more executable, observable, and enforceable: JSON metadata, action schemas, Python executors, lifecycle hooks, routing, and skill-local state.²

Read together, the lesson is blunt: enterprise agents should not be bought, governed, or marketed as synthetic colleagues with mysterious inner qualities. They should be treated as operational systems with behavioural contracts. The point is not whether the agent “cares”. The point is whether it can collect evidence before acting, avoid forbidden actions, verify before completion, expose recovery state, and leave an audit trail sturdy enough to survive contact with a finance director.

The shared problem: fluent behaviour invites lazy interpretation

The modern AI agent is an ambiguity machine with a user interface. It writes like a person, apologises like a person, explains itself like a person, and sometimes makes mistakes with the eerie confidence of a junior consultant who has discovered adjectives. This makes anthropomorphic language almost irresistible. We say the model “understands” the task, “intends” to follow instructions, “knows” the policy, or “decides” to escalate.

Some of that language is convenient shorthand. The problem begins when shorthand becomes evaluation.

For a business owner or manager, this distinction is not philosophical decoration. It affects procurement, risk control, staffing, workflow design, vendor claims, and incident response. If an AI agent is trusted because it appears diligent, the organisation has no reliable way to know what will happen when the interface changes, the tool environment shifts, the prompt gets longer, the retrieval context becomes stale, or the agent enters a new workflow.

The two papers in this cluster attack the same problem from different ends of the chain.

The Age of Empires II paper is the methodological guardrail. It says: stop turning observed behaviour into broad claims about human-like attributes unless you can specify what is being measured, under what conditions, and whether that claim survives a change in representation.

The Formal Skill paper is the engineering response. It says: if you want reliable agent behaviour, move reusable procedure out of long natural-language instructions and into runtime objects that can enforce actions, gates, state, and recovery.

One paper removes the mythology. The other supplies the machinery.

First: the goat test for anthropomorphic claims

The Age of Empires II paper is deliberately absurd in the productive academic sense, which is the only acceptable kind of absurd. It builds and trains a simple neural network inside the videogame Age of Empires II, using the game as a computational substrate. The point is not that executives should begin their AI governance programme by hiring goats, although stranger digital transformation strategies have been funded.

The point is substrate non-uniqueness.

If a sufficiently powerful substrate can implement equivalent computation, then some input-output behaviour associated with an LLM could in principle be reproduced through a very different representation. The paper uses AoE II to make that uncomfortable. If an LLM-like system implemented through visible game mechanics gives a comforting response to “I feel lonely”, would observers still say it has empathy? Or would the moving goats make the ascription look suddenly less convincing?

That is the useful crack in the mirror. The same behavioural mapping may be preserved, while the perceived human-likeness changes. The paper argues that many claims about anthropomorphic attributes are therefore representation-sensitive. A chat window with fast, coherent text invites one interpretation. A clunky visible substrate invites another. The output did not necessarily change. The observer did.

This does not prove that LLMs lack human-like attributes. The paper is careful on that point. Its claim is narrower and more useful: broad claims about attributes such as understanding, morality, empathy, self-awareness, or anxiety become circular or uninformative if the experiment assumes the very attribute it claims to establish.

The author proposes a “null assumption”: do not begin by assuming that the system has or lacks the anthropomorphic attribute. Measure implementation-defined behaviour. Treat explanations, answers, and reactions as observable behaviours before interpreting them as evidence of inner properties.

For business readers, this translates cleanly:

Weak procurement claim	Stronger operational claim
“The agent understands customer intent.”	“The agent classifies these intent categories with these measured error rates under these conditions.”
“The agent has judgement.”	“The agent blocks, escalates, or approves according to these rules, with these exceptions and logs.”
“The agent can reason about compliance.”	“The agent checks these policy clauses, cites these sources, and cannot complete without these evidence fields.”
“The agent is empathetic.”	“The agent follows approved response patterns, avoids prohibited advice, and escalates defined risk signals.”

The downgrade from personality to measurement may feel less glamorous. Good. Glamour is not a control mechanism.

Then: Formal Skill turns behaviour into a runtime object

The Formal Skill paper starts from the opposite side of the problem. It accepts that LLM agents are increasingly used in real workspaces and asks how reusable agent capability should be represented.

The authors identify a practical weakness in today’s skill systems. Many agent “skills” are effectively documents: Markdown instructions, SKILL.md files, procedural text, optional scripts, and prompt-level guidance. These are useful, but the procedure remains informal. The runtime can store the text, but it cannot directly enforce the workflow described by the text.

That matters because business workflows are full of procedural invariants:

reproduce the bug before patching;
do not edit protected files;
verify before reporting completion;
collect evidence before recommending action;
escalate when risk thresholds are met;
produce required artefacts before closing the task.

A prompt can request these behaviours. A runtime contract can enforce them. This is not a subtle distinction. It is the difference between telling someone not to touch the red button and putting the red button behind an access-controlled panel.

Formal Skill is the authors’ proposed abstraction for that panel. It represents reusable capability through five main components:

Formal Skill component	Business meaning
JSON metadata and action schemas	The agent sees a compact, typed interface rather than a long procedural manual.
Reliable executors	Actions are implemented with deterministic validation and bounded side effects.
Lifecycle hooks	The runtime can intervene before and after model calls or tool calls.
Skill-local runtime state	Progress, evidence, verification, gates, and recovery context are explicit.
Routing metadata	Only relevant skills and tools are exposed for a task or subtask.

The paper implements this in FairyClaw, an event-driven agent runtime. Its code-repair case study, CodeRepairOps, turns repair into phases: reproduce, diagnose, patch, verify, and report. Instead of merely telling the model to follow that order, the runtime changes tool visibility by phase, rejects unsafe patch calls, records verification status, and blocks completion until required gates pass.

This is the engineering complement to the Age of Empires II paper’s warning. If broad anthropomorphic interpretation is weak, then the answer is not to invent better adjectives for the agent. The answer is to represent the desired behaviour in a form the system can observe and constrain.

The complementary logic chain

The relationship between the two papers is best understood as a chain:

Step	Paper contribution	Combined business meaning
1. Fluent output invites anthropomorphism	The AoE II paper shows that perceived human-likeness can depend on representation and substrate.	Do not treat interface impressions as evidence of operational reliability.
2. Generalised trait claims are fragile	The AoE II paper argues for explicit measurement criteria and a null assumption against broad anthropomorphic ascription.	Replace “the agent understands” with measurable behavioural claims.
3. Prompt procedures are weak controls	The Formal Skill paper shows that natural-language skills are costly, ambiguous, and weakly enforceable.	Do not confuse instructions with governance.
4. Runtime contracts enforce behaviour	Formal Skill moves procedures into schemas, executors, hooks, routing, and state.	Make the workflow inspectable, recoverable, and auditable.
5. Enterprise trust becomes operational	The two papers together shift attention from personality to procedure.	Buy and build agents around contracts, gates, logs, state, and evidence.

The larger conclusion is not “agents are dumb” or “agents are magic”. Both are emotionally satisfying and operationally lazy.

The conclusion is that agent reliability lives in the relationship between model behaviour, runtime structure, tool constraints, and measurement criteria. The model may be impressive. Fine. The business still needs the boring parts: schemas, gates, state, tests, logs, recovery policies, and escalation paths. The boring parts are where trust becomes useful.

What the papers show versus what this article infers

The papers themselves do not make the same claim.

The Age of Empires II paper shows that anthropomorphic claims about LLMs can become assumption-sensitive. Its argument is methodological: if the evidence for a human-like attribute depends on a representation that makes the system look human-like, then the claim needs explicit measurement criteria and clear scope. The paper’s null assumption is a way to avoid confusing observed patterns with ascribed properties.

The Formal Skill paper shows that agent procedures can be represented as executable runtime abstractions rather than prompt-heavy instruction documents. Its empirical evaluation reports that FairyClaw is competitive on Harness-Bench while using fewer tokens than the compared systems in the reported setup, and that its Formal Skill mechanism is especially visible in procedural tasks such as controlled code debugging. The important point is not the leaderboard. Leaderboards age like milk. The important point is the mechanism: routed skills, compact phase guidance, narrower tool exposure, explicit state, and gates.

The business interpretation is the synthesis: if anthropomorphic claims are weak and runtime contracts are possible, then organisations should evaluate agents through behavioural contracts rather than personality labels.

That interpretation goes beyond either paper individually. It is also exactly where managers should care.

The agent trust stack businesses actually need

A practical agent governance model should separate five layers that are often mashed together in vendor demos.

Layer	Bad question	Better question
Presentation	Does the agent sound professional?	Does the interface make unsupported competence claims more likely?
Behaviour	Does it seem to understand?	What observable behaviours are measured, and under what conditions?
Procedure	Did we give it instructions?	Which workflow steps are formalised, required, and ordered?
Enforcement	Did the prompt say not to do that?	Can the runtime block unsafe actions or premature completion?
Evidence	Can it explain itself?	What state, sources, tool calls, verification results, and gate failures are recorded?

This stack prevents a common management error: evaluating the agent at the presentation layer while deploying it into the enforcement layer. A chatbot can sound calm while violating procedure. A coding agent can narrate responsible behaviour while skipping verification. A support agent can express empathy while escalating nothing. A procurement agent can summarise policy while ignoring the clause that actually matters. Very polished. Very expensive.

The Formal Skill approach suggests what stronger control looks like. The runtime should know the current phase. It should expose only relevant actions. It should validate tool arguments. It should persist task-local state. It should block completion when evidence or verification gates remain open. It should make recovery explicit rather than burying it in a transcript the model may or may not remember correctly.

The Age of Empires II paper explains why this matters conceptually. If the agent’s apparent intelligence depends heavily on the interface, then governance cannot rely on the interface. It has to rely on what the system can be shown to do.

A procurement checklist for agent systems

When evaluating an agent product, do not ask whether it has “reasoning”, “judgement”, or “domain understanding” as if these are procurement categories. They are usually marketing fog wearing a lab coat.

Ask more mechanical questions.

Procurement area	Questions to ask
Behavioural scope	What exact behaviours are claimed? Which tasks, datasets, workflows, and failure conditions were tested?
Measurement	Are success criteria behavioural, or are they anthropomorphic labels such as judgement, empathy, autonomy, or understanding?
Runtime control	Which constraints are enforced by code rather than prompt instructions?
Tool exposure	Are tools dynamically routed by task and phase, or is the model shown a large action buffet and wished good luck?
State	What task-local state is persisted: phase, evidence, approvals, verification status, produced artefacts, open gates?
Completion gates	Can the system block its own final answer until required checks pass?
Recovery	What happens after invalid tool calls, failed verification, partial artefacts, or contradictory evidence?
Auditability	Can a reviewer reconstruct why the agent acted, what it saw, what it changed, and which gates were satisfied?
Cost	Does the system repeatedly load long procedural prompts, or does it use compact state-conditioned guidance?
Boundary claims	Does the vendor distinguish measured behaviour from claims about inner attributes?

This checklist is not anti-agent. It is pro-adoption without corporate self-hypnosis. If agents are going to touch real workflows, they deserve real controls.

Where this matters first

The argument applies broadly, but some domains need it sooner.

Software engineering agents are the obvious case because they already act inside repositories, run commands, edit files, and verify outputs. A code agent that cannot enforce “reproduce before patch” or “verify before completion” is not autonomous. It is a stochastic intern with shell access.

Customer support is another. Businesses love saying an AI support agent is empathetic. The more useful question is whether it detects regulated topics, avoids unsafe promises, escalates defined risk patterns, and records which knowledge-base entries supported its answer. A warm tone is not a compliance programme.

Finance and accounting workflows need the same discipline. An agent that “understands invoices” should be evaluated on extraction accuracy, exception handling, approval routing, duplicate detection, audit logs, and policy gates. Nobody should care whether it feels spiritually aligned with accrual accounting.

Legal and HR workflows are even more sensitive. The agent may draft, summarise, classify, or triage, but it should not be trusted because it speaks with solemn confidence. The system needs boundaries: what it can access, what it can recommend, what it must escalate, what evidence it must cite, and when it is blocked from completion.

In each case, the practical issue is not whether the agent resembles a human worker. The issue is whether the organisation can specify and enforce what the agent is allowed to do.

The hidden cost: formalisation is work

There is a trade-off. The Formal Skill paper is honest about it. Prompt skills are easy to write because they resemble onboarding documents. A manager can describe a procedure in prose. A team can paste it into a skill file. Everyone feels productive, which is always a warning sign.

Formal skills raise the authoring bar. Someone must define schemas, executors, hooks, state transitions, routing conditions, and gates. That takes engineering effort. It also requires sharper process knowledge. Vague workflows cannot be formalised without first becoming less vague. This is where many organisations discover that the AI project was not blocked by AI. It was blocked by the fact that nobody had written down how the work actually happens.

Still, this cost is not a defect. It is a diagnostic. If a process is too ambiguous to encode, it may also be too ambiguous to delegate safely to an agent. The right response is not to throw a larger model at it and hope the ambiguity becomes wisdom. The right response is to decide where ambiguity is acceptable, where human judgement is required, and where runtime enforcement is non-negotiable.

The likely future is hybrid. Human-readable procedures remain the starting point. LLMs may help convert them into formal components: action schemas, executor skeletons, hook logic, state definitions, and test cases. But the final responsibility should sit with the organisation. “The model generated the control policy” will not be an inspiring sentence in a post-incident review.

The executive mistake to avoid

The mistake is believing that better agents are agents that seem more human.

This belief is understandable. Human-like interfaces lower friction. They make tools accessible. They help users express intent. They create a sense of collaboration. Fine. But human-likeness is a presentation strategy, not a reliability guarantee.

The better target is not a more charming agent. It is a more bounded one.

A useful enterprise agent should be able to say, in effect:

Here is the task phase I am in.
Here are the actions currently available to me.
Here is the evidence I have collected.
Here are the tools I used.
Here are the constraints I cannot bypass.
Here are the verification checks still open.
Here is why I cannot finish yet.
Here is what changed after recovery.

That is not anthropomorphic. It is operational. Conveniently, operations are what businesses actually run on.

The combined lesson

The Age of Empires II paper gives us the caution: do not mistake representation for inner quality, and do not let anthropomorphic assumptions smuggle themselves into measurement. If the goats make the “empathy” look silly, the original claim probably needed better criteria.

The Formal Skill paper gives us the construction: move procedures into runtime-governed contracts where schemas, hooks, executors, routing, state, and gates shape what the agent can do. If the agent must verify before completion, do not merely ask nicely. Enforce it.

Together, they point toward a cleaner standard for business AI:

Do not ask whether the agent is human-like. Ask whether its behaviour is observable, constrained, recoverable, and auditable.

That sentence will not sell as many conference keynotes as “digital employees with judgement”. Unfortunate for the keynote circuit. Useful for everyone else.

Cognaptus: Automate the Present, Incubate the Future.

Adrian de Wynter, “If LLMs Have Human-Like Attributes, Then So Does Age of Empires II,” arXiv:2605.31514, 2026, https://arxiv.org/abs/2605.31514. ↩︎
Xi Zhang, Meijun Gao, Yuntian Zhao, Xinyu Tan, Yilun Yao, Feiyu Wang, Yanshu Wang, Dingsiyi, and Tong Yang, “Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents,” arXiv:2605.19604, 2026, https://arxiv.org/abs/2605.19604. ↩︎

TL;DR for operators#

The shared problem: fluent behaviour invites lazy interpretation#

First: the goat test for anthropomorphic claims#

Then: Formal Skill turns behaviour into a runtime object#

The complementary logic chain#

What the papers show versus what this article infers#

The agent trust stack businesses actually need#

A procurement checklist for agent systems#

Where this matters first#

The hidden cost: formalisation is work#

The executive mistake to avoid#

The combined lesson#