Harnessing the Harness: When AI Stops Being a Model Problem

Glue is not glamorous.

In most AI product discussions, the model gets the spotlight. The harness—the scripts, prompts, validators, retry rules, state files, tool adapters, and stopping criteria around the model—gets treated as plumbing. Necessary, slightly annoying, and best ignored until it leaks.

That habit is becoming expensive.

The paper Natural-Language Agent Harnesses argues that the surrounding execution system is no longer a secondary implementation detail. It is often the actual unit of agent performance, reliability, and portability.¹ The paper’s useful claim is not that “natural language replaces code.” That would be a lovely fantasy for people who have not debugged parsers, sandboxes, or file permissions lately. The sharper claim is that part of the harness can become an editable natural-language policy object, while exact execution remains in code.

That distinction matters. A natural-language harness is not just a longer prompt. It is also not a magic spell that turns governance into prose. It is closer to an operational constitution for an agent run: who acts, what state must survive, which evidence counts, when verification happens, how failures are retried, and what condition permits the system to stop.

The paper introduces two constructs. Natural-Language Agent Harnesses (NLAHs) are readable documents that describe run-level harness policy. Intelligent Harness Runtime (IHR) is the shared runtime that interprets those documents into child-agent calls, state updates, validation gates, artifact contracts, and stopping behavior. The authors then test whether this separation can work across coding, terminal-use, and computer-use benchmarks.

The result is not a simple “natural language wins” story. It is more useful than that. The comparison shows where natural language is strong, where code remains non-negotiable, and where agent teams quietly waste money by adding structure that does not actually bring the system closer to acceptance.

The real comparison is code policy, prompted policy, and runtime-executed policy

A normal summary would say: the authors propose NLAH and IHR, then show benchmark results. That is technically correct and editorially lazy.

The paper is built around a three-way comparison:

Control medium	What it means	Strength	Main weakness
Code harness	The original controller implementation: scripts, framework defaults, adapters, state machines, validators, and prompts mixed together	Deterministic control and exact operations	Policy is buried inside implementation details
Prompted NLAH	The same natural-language harness content placed into a normal agent prompt	Easy to try, no special runtime	Natural language remains mostly advisory
IHR-executed NLAH	The harness document is interpreted by a runtime with explicit semantics for child calls, state, contracts, and stopping	Readable policy becomes executable behavior	Extra overhead, weaker handoff, interpretation uncertainty

This is the right comparison because the business question is not whether prose can beat code in a knife fight. It cannot, and should not be asked to. The real question is whether the reusable policy layer of an agent system can be separated from the deterministic mechanism layer.

In production terms: can the workflow logic become something a team can read, review, test, transfer, and ablate—without pretending that validators and security boundaries should be written as vibes?

The paper’s answer is: partially yes.

What moves into language, and what must stay in code

The paper is careful about the boundary, and this is where many casual readings will go wrong.

NLAHs carry policy. They can say: create a state file before delegation; ask a verifier to inspect a candidate patch; preserve evidence before finalizing; retry only after a classified failure; keep independent branches in clean contexts; stop only when the artifact contract is satisfied.

Code still carries exact mechanism. The authors explicitly keep tests, parsers, sandboxing, benchmark adapters, artifact validators, tool execution, logging, and other precision-sensitive operations in code. Good. Nobody needs a natural-language regex engine. We already have enough ways to suffer.

The paper’s appendix formalizes this division across five layers:

Layer	Carrier	Responsibility
Base runtime code	Code	Model routing, tool schemas, bash execution, timeouts, event streams, run state, sandbox limits
Runtime policy	Fixed natural language	Shared IHR semantics: parent-child boundaries, artifact rules, completion gates, audit discipline
NLAH	Replaceable natural language	Task-family roles, stages, validation policy, recovery policy, state strategy, module composition
Scripts and adapters	Code hooks	Tests, parsers, validators, benchmark wrappers, artifact processing
Model internals	Outside NLAH	Sampling, decoding, model-internal reasoning, provider-level mechanisms

This is the core design contribution. Natural language becomes a policy layer, not an execution substrate. The runtime and scripts turn that policy into observable behavior.

That makes NLAH more serious than prompt engineering, but less grandiose than “language as software.” It is not software. It is executable policy under a runtime that still depends on software.

The main evidence says NLAHs are viable, not automatically superior

The paper tests NLAH+IHR on three benchmark families:

SWE-bench Verified, for repository-grounded software issue resolution.
Terminal-Bench 2.0, for long-horizon Linux command-line tasks.
OSWorld, for computer-use tasks in desktop environments.

The main RQ1 evidence compares the code harness, prompted NLAH, and IHR-executed NLAH. The benchmark score is the primary task metric reported by each benchmark.

Benchmark / harness family	Code harness	Prompted NLAH	IHR-executed NLAH	How to read it
SWE Verified / Live-SWE	67.0	77.0	73.0	NLAH is competitive and above the code realization, though not the best in this run
Terminal-Bench 2.0 / MHTBA	36.0	57.3	53.9	Natural-language realizations outperform the transplanted code artifact under the common GPT setting
OSWorld / SeeAct	47.1	47.9	46.3	All three are essentially in the same regime

The paper directly shows that IHR-executed NLAHs can drive real multi-step agent runs without collapsing performance. That is the feasibility result.

It does not show that NLAH always beats prompting. On SWE and TB2, prompted NLAH scores higher than IHR-executed NLAH. On OSWorld, prompted NLAH is also slightly higher. The more precise reading is that IHR gives the natural-language harness stronger execution semantics and auditability, while the current prototype pays extra overhead and suffers handoff loss.

That difference is important for business users. If the only metric is short-term task score, plain prompting may sometimes look better. If the metric includes inspectability, modular testing, transferability, and governance, prompted NLAH is weaker because it leaves much of the policy as instruction text rather than operational structure.

In other words, IHR is not buying free accuracy. It is buying a more inspectable control layer. Accuracy is allowed to come along, but it is not the only guest at the table.

The compactness result is the quiet business result

The paper’s strongest representation-level result is not the headline benchmark score. It is the reduction in static harness materials.

Harness family	Code materials	NLAH materials	Interpretation
Live-SWE	60.1k tokens across 68 files	2.9k tokens across 3 files	Large controller bundle becomes a readable policy document
MHTBA	10.5k tokens across 3 files	0.8k tokens across 1 file	The harness pattern becomes compact enough to inspect and move
SeeAct	47.5k tokens across 5 files	1.4k tokens across 1 file	GUI-agent policy is separated from execution machinery

This does not mean the system becomes smaller overall. Runtime prompts, generated logs, tool outputs, and scripts still exist. The point is narrower: the reusable policy layer becomes much shorter and easier to inspect.

That is where the business relevance begins.

A company building agent workflows rarely wants to read sixty files to answer a basic question: “What exactly does this agent believe counts as done?” If the answer is scattered across controller code, implicit framework defaults, tool wrappers, and final-output parsers, then debugging becomes folklore. The senior engineer remembers why the retry rule exists. The new engineer changes it. The agent begins confidently eating its own tail. A normal Tuesday.

NLAH makes the policy object explicit. That does not guarantee correctness. But it gives teams something they can review, diff, reuse, and test as a distinct artifact.

The mechanism audit asks whether the harness leaves traces

The paper wisely does not stop at outcome scores. If NLAH+IHR is genuinely executing harness policy, the run should leave measurable traces: workflow structure, contracts, tool use, recovery behavior, state boundaries, and handoff behavior.

This is the purpose of RQ2. It is not a second benchmark leaderboard. It is a mechanism audit.

Evidence type	Likely purpose	What it supports	What it does not prove
Pattern-preservation metrics	Mechanism evidence	NLAH runs preserve recognizable prompt contracts, tool surfaces, workflow stages, and ordered workflow	That preserved structure is always optimal
Artifact contract, tool success, failed-tool continuation	Mechanism evidence	IHR turns harness clauses into observable artifacts, tool-mediated actions, and recovery behavior	That all failures are handled well
Orchestration reliability and handoff recall	Runtime bottleneck diagnosis	Parent-child execution creates real control boundaries but loses information across them	That handoff weakness is fundamental rather than engineering-related
MHTBA timeout diagnostics in the appendix	Implementation/portability diagnosis	A code artifact can encode model-specific stopping assumptions that transfer poorly	That code harnesses are generally worse than NLAHs

The numbers are instructive. On Live-SWE, IHR-executed NLAH reaches 1.000 artifact contract compliance, 0.933 tool call success, and 0.992 continuation after failed tool calls. On MHTBA, the corresponding values are 0.955, 0.928, and 0.995.

These are the good signs: contracts, tools, and recovery are being materialized.

The weak sign is handoff. Information Handoff Recall drops to 0.322 on Live-SWE and 0.553 on MHTBA under parent-child execution. Orchestration reliability is also lower for NLAH than for prompted execution. This is not surprising. IHR creates explicit call boundaries, and boundaries are where information goes to retire early.

This should shape how operators read the paper. The runtime makes policy auditable, but the current implementation still needs better handoff discipline. In a business workflow, that translates into a simple rule: if you split work across agents, do not trust “the previous agent probably passed the important context.” Require state files, evidence paths, and explicit acceptance records.

The appendix shows why code can fail by being too specific

The Terminal-Bench result is easy to misread. The code harness score for MHTBA is low under the paper’s common GPT setting: 36.0 versus 57.3 for prompted NLAH and 53.9 for IHR-executed NLAH. A careless reading would say natural language beats code.

The appendix says something more precise.

The MHTBA code artifact was originally associated with a different native setting: Claude Opus 4.6 with multiple attempts. In the paper’s controlled experiment, the same released artifact is run under a common GPT setting with one attempt. The appendix diagnoses the portability problem.

Out of 89 TB2 samples, 66 end with AgentTimeoutError. Among those timeout runs, 21 already have verifier reward 1.0, meaning the task state had satisfied the verifier but the controller failed to stop cleanly. In disagreement cases where code fails but Prompt or NLAH succeeds, the code artifact often spends hundreds of episodes, consumes tens of millions of input tokens, and accumulates many warnings that the previous response contained no tool calls.

The mechanism is almost painfully plausible. The controller expects a particular two-step completion protocol. Under GPT, many trajectories answer the confirmation with text such as DONE but no tool call. The controller then warns about the missing tool call, the model issues a harmless no-op, and the run loops back into the completion gate until timeout.

That is not a domain failure. It is a model-harness adaptation failure.

This appendix is one of the paper’s most practically useful sections. It shows that code harnesses can be brittle not because code is bad, but because code makes hidden behavioral assumptions precise. If those assumptions are tuned to one model’s tool-calling habits, stopping behavior, or caching protocol, portability can suffer.

Natural-language policy can sometimes preserve the high-level harness idea while avoiding an overly specific state machine. That is not a license to replace controllers with prose. It is a reminder that exactness is only valuable when it is exact about the right thing.

The ablations say state and evidence beat decorative complexity

RQ3 turns explicit NLAH modules into interventions. This is where the paper becomes especially relevant for agent product teams, because it asks which harness modules actually help.

The authors compare a Basic condition against NLAH modules such as file-backed state, evidence-backed answering, verifier separation, self-evolution, multi-candidate search, dynamic orchestration, context compression, and markdown memory. These are ablations: module-level tests under a shared runtime, not separate full-system claims.

Module	SWE Verified effect	OSWorld effect	Interpretation
File-backed state	73.0 → 75.6	44.4 → 58.3	Strongest cross-benchmark state discipline result
Evidence-backed answering	73.0 → 75.8	44.4 → 47.2	Consistently positive, modest, acceptance-aligned
Verifier	73.0 → 73.2	44.4 → 52.8	Helpful when the verifier remains close to the benchmark gate
Self-evolution	73.0 → 78.8	44.4 → 52.8	Strong positive solve-loop result
Multi-candidate search	73.0 → 71.4	44.4 → 47.2	More branching does not guarantee better control
Dynamic orchestration	73.0 → 74.6	44.4 → 47.2	Real but modest gains
Context compression	73.0 → 72.0	44.4 → 36.1	Compression can lose action-critical details
Markdown memory	73.0 → 70.2	44.4 → 50.0	Mixed; free-form memory is not the same as durable state

The pattern is clear: modules help when they shorten the path from intermediate work to final acceptance. File-backed state preserves task-relevant facts. Evidence-backed answering forces the agent to justify completion using artifacts. Self-evolution sharpens retry and update behavior. Verifier modules help when their judgment is close to the evaluator’s own acceptance condition.

By contrast, multi-candidate search adds agent calls without reliably improving performance. On SWE, agent calls jump from 1.1 to 5.7 while performance drops from 73.0 to 71.4. That is a lovely example of the enterprise automation disease: if one workflow is unreliable, create five workflows and a committee.

Context compression is also revealing. It hurts both benchmarks, especially OSWorld. The likely reason is not that summarization is evil. The reason is that compressed context can drift away from action-critical details. Durable, path-addressable state is safer than an elegant summary that forgets the one file path, validation result, or UI state that actually matters.

For product teams, the takeaway is almost boring, which is usually a sign that it is useful: preserve state, preserve evidence, align verification with acceptance, and be suspicious of extra orchestration whose main achievement is looking sophisticated in an architecture diagram.

What Cognaptus would infer for business use

The paper directly shows that NLAH+IHR can execute compact natural-language harness policies with competitive outcomes and auditable process traces across benchmarked prototypes.

Cognaptus would infer three practical design principles from this, with boundaries attached.

1. Treat workflow policy as a reviewable asset

Agent workflows should not live only in controller code and scattered prompt fragments. A production automation system benefits when the policy layer is readable:

roles and responsibilities;
task contracts;
state carriers;
validation gates;
retry rules;
evidence requirements;
stopping conditions.

This does not require adopting the exact IHR architecture. The broader lesson is representational: if a workflow policy is important enough to affect outcomes, it should be visible enough to review.

The boundary is that readable policy is not proof of executed behavior. Teams still need logs, trace replay, validators, and run-level audits.

2. Use natural language where editability matters, not where precision matters

Natural language is useful for describing strategy, roles, gates, and operational intent. Code is still better for parsers, permissions, tests, sandboxes, credentials, tool schemas, and deterministic validators.

A practical split looks like this:

Put in natural-language harness policy	Keep in code or controlled configuration
When to delegate	How child processes are launched securely
What evidence must be preserved	How evidence files are parsed and validated
When a retry is allowed	Timeout, budget, and permission enforcement
Which verifier role should inspect output	The validator itself
What final contract must be satisfied	The exact acceptance checker
What state later agents must reopen	State serialization and access controls

This split is less fashionable than saying “everything becomes language.” It is also less likely to set the building on fire.

3. Evaluate modules by acceptance alignment, not architectural elegance

The ablations suggest that not all structure is useful. File-backed state and evidence-backed answering improve because they protect the acceptance path. Multi-candidate search and compression can fail because they add process without preserving the right evidence.

For business automation, the question should not be: “Can we add a planner, verifier, critic, memory module, and self-improvement loop?”

The question should be: “Which part of the current failure mode prevents acceptance, and does this module reduce that failure?”

That is the difference between engineering and ritual.

The limitations are not decorative; they define the deployment boundary

The paper’s limitations are not generic academic modesty. They are operational constraints.

First, natural language is imprecise. An NLAH clause can be under-specified, paraphrased poorly, or interpreted differently by different models. That is why the authors keep exact mechanisms in code and judge behavior through runs rather than through the text alone.

Second, IHR introduces overhead. The RQ1 process metrics show more model calls, tool calls, and token use in many NLAH settings. Some wall-clock times are still competitive because rigid code controllers may loop inefficiently, but the prototype runtime is not cost-free. Any business deployment has to measure the cost of orchestration against the value of auditability and failure recovery.

Third, handoff remains weak. The parent-child runtime makes boundaries inspectable, but it can lose information across those boundaries. This is exactly the kind of problem that becomes invisible when one long prompt carries everything. Cleaner architecture can expose dirtier interfaces.

Fourth, portable harness logic can spread risky workflows. Harnesses mediate tools, files, permissions, and delegation. Externalizing them improves reuse, but also creates surfaces for prompt injection, malicious tool grafting, and supply-chain contamination. A reusable harness library without provenance, review, sandboxing, and permission control is not a productivity platform. It is a buffet for future incident reports.

The business value is cheaper diagnosis, not just higher scores

The article this replaces had the right instinct: AI performance is shifting from model-only thinking toward system-level design. The paper makes that claim sharper.

The value of NLAH is not that it magically improves benchmark accuracy. The paper’s own numbers are more nuanced. Prompted NLAH sometimes scores better. IHR-executed NLAH sometimes costs more. Code harnesses can still be the right place for exact behavior.

The value is that harness policy becomes a thing.

A readable thing. A testable thing. A transferable thing. A module that can be ablated instead of a rumor embedded in controller code.

That matters because real AI automation failures are rarely caused by “the model is slightly less intelligent than expected.” More often, they come from dull operational mistakes: the agent forgot the evidence, misread the completion condition, retried the wrong failure, compressed away the crucial state, used a verifier misaligned with the final judge, or got stuck in a stopping protocol designed for another model.

Better models will reduce some of these failures. They will not eliminate the need for harness policy. In fact, stronger models may make harness design more important because they can act farther, touch more tools, and fail in more expensive ways.

So the paper’s practical message is not: stop improving models.

It is: stop pretending the model is the whole system.

The next serious layer of AI automation will not be only model selection, prompt style, or context stuffing. It will be harness representation: how workflows are specified, executed, inspected, transferred, and governed.

Glue is still not glamorous. But once the system depends on it, glamour is beside the point.

Cognaptus: Automate the Present, Incubate the Future.

Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Hai-Tao Zheng, “Natural-Language Agent Harnesses,” arXiv:2603.25723v2, 18 May 2026, https://arxiv.org/html/2603.25723. ↩︎

The real comparison is code policy, prompted policy, and runtime-executed policy#

What moves into language, and what must stay in code#

The main evidence says NLAHs are viable, not automatically superior#

The compactness result is the quiet business result#

The mechanism audit asks whether the harness leaves traces#

The appendix shows why code can fail by being too specific#

The ablations say state and evidence beat decorative complexity#

What Cognaptus would infer for business use#

1. Treat workflow policy as a reviewable asset#

2. Use natural language where editability matters, not where precision matters#

3. Evaluate modules by acceptance alignment, not architectural elegance#

The limitations are not decorative; they define the deployment boundary#

The business value is cheaper diagnosis, not just higher scores#