Glue is not glamorous.
In most AI product discussions, the model gets the spotlight. The harness—the scripts, prompts, validators, retry rules, state files, tool adapters, and stopping criteria around the model—gets treated as plumbing. Necessary, slightly annoying, and best ignored until it leaks.
That habit is becoming expensive.
The paper Natural-Language Agent Harnesses argues that the surrounding execution system is no longer a secondary implementation detail. It is often the actual unit of agent performance, reliability, and portability.1 The paper’s useful claim is not that “natural language replaces code.” That would be a lovely fantasy for people who have not debugged parsers, sandboxes, or file permissions lately. The sharper claim is that part of the harness can become an editable natural-language policy object, while exact execution remains in code.
That distinction matters. A natural-language harness is not just a longer prompt. It is also not a magic spell that turns governance into prose. It is closer to an operational constitution for an agent run: who acts, what state must survive, which evidence counts, when verification happens, how failures are retried, and what condition permits the system to stop.
The paper introduces two constructs. Natural-Language Agent Harnesses (NLAHs) are readable documents that describe run-level harness policy. Intelligent Harness Runtime (IHR) is the shared runtime that interprets those documents into child-agent calls, state updates, validation gates, artifact contracts, and stopping behavior. The authors then test whether this separation can work across coding, terminal-use, and computer-use benchmarks.
The result is not a simple “natural language wins” story. It is more useful than that. The comparison shows where natural language is strong, where code remains non-negotiable, and where agent teams quietly waste money by adding structure that does not actually bring the system closer to acceptance.
The real comparison is code policy, prompted policy, and runtime-executed policy
A normal summary would say: the authors propose NLAH and IHR, then show benchmark results. That is technically correct and editorially lazy.
The paper is built around a three-way comparison:
| Control medium | What it means | Strength | Main weakness |
|---|---|---|---|
| Code harness | The original controller implementation: scripts, framework defaults, adapters, state machines, validators, and prompts mixed together | Deterministic control and exact operations | Policy is buried inside implementation details |
| Prompted NLAH | The same natural-language harness content placed into a normal agent prompt | Easy to try, no special runtime | Natural language remains mostly advisory |
| IHR-executed NLAH | The harness document is interpreted by a runtime with explicit semantics for child calls, state, contracts, and stopping | Readable policy becomes executable behavior | Extra overhead, weaker handoff, interpretation uncertainty |
This is the right comparison because the business question is not whether prose can beat code in a knife fight. It cannot, and should not be asked to. The real question is whether the reusable policy layer of an agent system can be separated from the deterministic mechanism layer.
In production terms: can the workflow logic become something a team can read, review, test, transfer, and ablate—without pretending that validators and security boundaries should be written as vibes?
The paper’s answer is: partially yes.
What moves into language, and what must stay in code
The paper is careful about the boundary, and this is where many casual readings will go wrong.
NLAHs carry policy. They can say: create a state file before delegation; ask a verifier to inspect a candidate patch; preserve evidence before finalizing; retry only after a classified failure; keep independent branches in clean contexts; stop only when the artifact contract is satisfied.
Code still carries exact mechanism. The authors explicitly keep tests, parsers, sandboxing, benchmark adapters, artifact validators, tool execution, logging, and other precision-sensitive operations in code. Good. Nobody needs a natural-language regex engine. We already have enough ways to suffer.
The paper’s appendix formalizes this division across five layers:
| Layer | Carrier | Responsibility |
|---|---|---|
| Base runtime code | Code | Model routing, tool schemas, bash execution, timeouts, event streams, run state, sandbox limits |
| Runtime policy | Fixed natural language | Shared IHR semantics: parent-child boundaries, artifact rules, completion gates, audit discipline |
| NLAH | Replaceable natural language | Task-family roles, stages, validation policy, recovery policy, state strategy, module composition |
| Scripts and adapters | Code hooks | Tests, parsers, validators, benchmark wrappers, artifact processing |
| Model internals | Outside NLAH | Sampling, decoding, model-internal reasoning, provider-level mechanisms |
This is the core design contribution. Natural language becomes a policy layer, not an execution substrate. The runtime and scripts turn that policy into observable behavior.
That makes NLAH more serious than prompt engineering, but less grandiose than “language as software.” It is not software. It is executable policy under a runtime that still depends on software.
The main evidence says NLAHs are viable, not automatically superior
The paper tests NLAH+IHR on three benchmark families:
- SWE-bench Verified, for repository-grounded software issue resolution.
- Terminal-Bench 2.0, for long-horizon Linux command-line tasks.
- OSWorld, for computer-use tasks in desktop environments.
The main RQ1 evidence compares the code harness, prompted NLAH, and IHR-executed NLAH. The benchmark score is the primary task metric reported by each benchmark.
| Benchmark / harness family | Code harness | Prompted NLAH | IHR-executed NLAH | How to read it |
|---|---|---|---|---|
| SWE Verified / Live-SWE | 67.0 | 77.0 | 73.0 | NLAH is competitive and above the code realization, though not the best in this run |
| Terminal-Bench 2.0 / MHTBA | 36.0 | 57.3 | 53.9 | Natural-language realizations outperform the transplanted code artifact under the common GPT setting |
| OSWorld / SeeAct | 47.1 | 47.9 | 46.3 | All three are essentially in the same regime |
The paper directly shows that IHR-executed NLAHs can drive real multi-step agent runs without collapsing performance. That is the feasibility result.
It does not show that NLAH always beats prompting. On SWE and TB2, prompted NLAH scores higher than IHR-executed NLAH. On OSWorld, prompted NLAH is also slightly higher. The more precise reading is that IHR gives the natural-language harness stronger execution semantics and auditability, while the current prototype pays extra overhead and suffers handoff loss.
That difference is important for business users. If the only metric is short-term task score, plain prompting may sometimes look better. If the metric includes inspectability, modular testing, transferability, and governance, prompted NLAH is weaker because it leaves much of the policy as instruction text rather than operational structure.
In other words, IHR is not buying free accuracy. It is buying a more inspectable control layer. Accuracy is allowed to come along, but it is not the only guest at the table.
The compactness result is the quiet business result
The paper’s strongest representation-level result is not the headline benchmark score. It is the reduction in static harness materials.
| Harness family | Code materials | NLAH materials | Interpretation |
|---|---|---|---|
| Live-SWE | 60.1k tokens across 68 files | 2.9k tokens across 3 files | Large controller bundle becomes a readable policy document |
| MHTBA | 10.5k tokens across 3 files | 0.8k tokens across 1 file | The harness pattern becomes compact enough to inspect and move |
| SeeAct | 47.5k tokens across 5 files | 1.4k tokens across 1 file | GUI-agent policy is separated from execution machinery |
This does not mean the system becomes smaller overall. Runtime prompts, generated logs, tool outputs, and scripts still exist. The point is narrower: the reusable policy layer becomes much shorter and easier to inspect.
That is where the business relevance begins.
A company building agent workflows rarely wants to read sixty files to answer a basic question: “What exactly does this agent believe counts as done?” If the answer is scattered across controller code, implicit framework defaults, tool wrappers, and final-output parsers, then debugging becomes folklore. The senior engineer remembers why the retry rule exists. The new engineer changes it. The agent begins confidently eating its own tail. A normal Tuesday.
NLAH makes the policy object explicit. That does not guarantee correctness. But it gives teams something they can review, diff, reuse, and test as a distinct artifact.
The mechanism audit asks whether the harness leaves traces
The paper wisely does not stop at outcome scores. If NLAH+IHR is genuinely executing harness policy, the run should leave measurable traces: workflow structure, contracts, tool use, recovery behavior, state boundaries, and handoff behavior.
This is the purpose of RQ2. It is not a second benchmark leaderboard. It is a mechanism audit.
| Evidence type | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Pattern-preservation metrics | Mechanism evidence | NLAH runs preserve recognizable prompt contracts, tool surfaces, workflow stages, and ordered workflow | That preserved structure is always optimal |
| Artifact contract, tool success, failed-tool continuation | Mechanism evidence | IHR turns harness clauses into observable artifacts, tool-mediated actions, and recovery behavior | That all failures are handled well |
| Orchestration reliability and handoff recall | Runtime bottleneck diagnosis | Parent-child execution creates real control boundaries but loses information across them | That handoff weakness is fundamental rather than engineering-related |
| MHTBA timeout diagnostics in the appendix | Implementation/portability diagnosis | A code artifact can encode model-specific stopping assumptions that transfer poorly | That code harnesses are generally worse than NLAHs |
The numbers are instructive. On Live-SWE, IHR-executed NLAH reaches 1.000 artifact contract compliance, 0.933 tool call success, and 0.992 continuation after failed tool calls. On MHTBA, the corresponding values are 0.955, 0.928, and 0.995.
These are the good signs: contracts, tools, and recovery are being materialized.
The weak sign is handoff. Information Handoff Recall drops to 0.322 on Live-SWE and 0.553 on MHTBA under parent-child execution. Orchestration reliability is also lower for NLAH than for prompted execution. This is not surprising. IHR creates explicit call boundaries, and boundaries are where information goes to retire early.
This should shape how operators read the paper. The runtime makes policy auditable, but the current implementation still needs better handoff discipline. In a business workflow, that translates into a simple rule: if you split work across agents, do not trust “the previous agent probably passed the important context.” Require state files, evidence paths, and explicit acceptance records.
The appendix shows why code can fail by being too specific
The Terminal-Bench result is easy to misread. The code harness score for MHTBA is low under the paper’s common GPT setting: 36.0 versus 57.3 for prompted NLAH and 53.9 for IHR-executed NLAH. A careless reading would say natural language beats code.
The appendix says something more precise.
The MHTBA code artifact was originally associated with a different native setting: Claude Opus 4.6 with multiple attempts. In the paper’s controlled experiment, the same released artifact is run under a common GPT setting with one attempt. The appendix diagnoses the portability problem.
Out of 89 TB2 samples, 66 end with AgentTimeoutError. Among those timeout runs, 21 already have verifier reward 1.0, meaning the task state had satisfied the verifier but the controller failed to stop cleanly. In disagreement cases where code fails but Prompt or NLAH succeeds, the code artifact often spends hundreds of episodes, consumes tens of millions of input tokens, and accumulates many warnings that the previous response contained no tool calls.
The mechanism is almost painfully plausible. The controller expects a particular two-step completion protocol. Under GPT, many trajectories answer the confirmation with text such as DONE but no tool call. The controller then warns about the missing tool call, the model issues a harmless no-op, and the run loops back into the completion gate until timeout.
That is not a domain failure. It is a model-harness adaptation failure.
This appendix is one of the paper’s most practically useful sections. It shows that code harnesses can be brittle not because code is bad, but because code makes hidden behavioral assumptions precise. If those assumptions are tuned to one model’s tool-calling habits, stopping behavior, or caching protocol, portability can suffer.
Natural-language policy can sometimes preserve the high-level harness idea while avoiding an overly specific state machine. That is not a license to replace controllers with prose. It is a reminder that exactness is only valuable when it is exact about the right thing.
The ablations say state and evidence beat decorative complexity
RQ3 turns explicit NLAH modules into interventions. This is where the paper becomes especially relevant for agent product teams, because it asks which harness modules actually help.
The authors compare a Basic condition against NLAH modules such as file-backed state, evidence-backed answering, verifier separation, self-evolution, multi-candidate search, dynamic orchestration, context compression, and markdown memory. These are ablations: module-level tests under a shared runtime, not separate full-system claims.
| Module | SWE Verified effect | OSWorld effect | Interpretation |
|---|---|---|---|
| File-backed state | 73.0 → 75.6 | 44.4 → 58.3 | Strongest cross-benchmark state discipline result |
| Evidence-backed answering | 73.0 → 75.8 | 44.4 → 47.2 | Consistently positive, modest, acceptance-aligned |
| Verifier | 73.0 → 73.2 | 44.4 → 52.8 | Helpful when the verifier remains close to the benchmark gate |
| Self-evolution | 73.0 → 78.8 | 44.4 → 52.8 | Strong positive solve-loop result |
| Multi-candidate search | 73.0 → 71.4 | 44.4 → 47.2 | More branching does not guarantee better control |
| Dynamic orchestration | 73.0 → 74.6 | 44.4 → 47.2 | Real but modest gains |
| Context compression | 73.0 → 72.0 | 44.4 → 36.1 | Compression can lose action-critical details |
| Markdown memory | 73.0 → 70.2 | 44.4 → 50.0 | Mixed; free-form memory is not the same as durable state |
The pattern is clear: modules help when they shorten the path from intermediate work to final acceptance. File-backed state preserves task-relevant facts. Evidence-backed answering forces the agent to justify completion using artifacts. Self-evolution sharpens retry and update behavior. Verifier modules help when their judgment is close to the evaluator’s own acceptance condition.
By contrast, multi-candidate search adds agent calls without reliably improving performance. On SWE, agent calls jump from 1.1 to 5.7 while performance drops from 73.0 to 71.4. That is a lovely example of the enterprise automation disease: if one workflow is unreliable, create five workflows and a committee.
Context compression is also revealing. It hurts both benchmarks, especially OSWorld. The likely reason is not that summarization is evil. The reason is that compressed context can drift away from action-critical details. Durable, path-addressable state is safer than an elegant summary that forgets the one file path, validation result, or UI state that actually matters.
For product teams, the takeaway is almost boring, which is usually a sign that it is useful: preserve state, preserve evidence, align verification with acceptance, and be suspicious of extra orchestration whose main achievement is looking sophisticated in an architecture diagram.
What Cognaptus would infer for business use
The paper directly shows that NLAH+IHR can execute compact natural-language harness policies with competitive outcomes and auditable process traces across benchmarked prototypes.
Cognaptus would infer three practical design principles from this, with boundaries attached.
1. Treat workflow policy as a reviewable asset
Agent workflows should not live only in controller code and scattered prompt fragments. A production automation system benefits when the policy layer is readable:
- roles and responsibilities;
- task contracts;
- state carriers;
- validation gates;
- retry rules;
- evidence requirements;
- stopping conditions.
This does not require adopting the exact IHR architecture. The broader lesson is representational: if a workflow policy is important enough to affect outcomes, it should be visible enough to review.
The boundary is that readable policy is not proof of executed behavior. Teams still need logs, trace replay, validators, and run-level audits.
2. Use natural language where editability matters, not where precision matters
Natural language is useful for describing strategy, roles, gates, and operational intent. Code is still better for parsers, permissions, tests, sandboxes, credentials, tool schemas, and deterministic validators.
A practical split looks like this:
| Put in natural-language harness policy | Keep in code or controlled configuration |
|---|---|
| When to delegate | How child processes are launched securely |
| What evidence must be preserved | How evidence files are parsed and validated |
| When a retry is allowed | Timeout, budget, and permission enforcement |
| Which verifier role should inspect output | The validator itself |
| What final contract must be satisfied | The exact acceptance checker |
| What state later agents must reopen | State serialization and access controls |
This split is less fashionable than saying “everything becomes language.” It is also less likely to set the building on fire.
3. Evaluate modules by acceptance alignment, not architectural elegance
The ablations suggest that not all structure is useful. File-backed state and evidence-backed answering improve because they protect the acceptance path. Multi-candidate search and compression can fail because they add process without preserving the right evidence.
For business automation, the question should not be: “Can we add a planner, verifier, critic, memory module, and self-improvement loop?”
The question should be: “Which part of the current failure mode prevents acceptance, and does this module reduce that failure?”
That is the difference between engineering and ritual.
The limitations are not decorative; they define the deployment boundary
The paper’s limitations are not generic academic modesty. They are operational constraints.
First, natural language is imprecise. An NLAH clause can be under-specified, paraphrased poorly, or interpreted differently by different models. That is why the authors keep exact mechanisms in code and judge behavior through runs rather than through the text alone.
Second, IHR introduces overhead. The RQ1 process metrics show more model calls, tool calls, and token use in many NLAH settings. Some wall-clock times are still competitive because rigid code controllers may loop inefficiently, but the prototype runtime is not cost-free. Any business deployment has to measure the cost of orchestration against the value of auditability and failure recovery.
Third, handoff remains weak. The parent-child runtime makes boundaries inspectable, but it can lose information across those boundaries. This is exactly the kind of problem that becomes invisible when one long prompt carries everything. Cleaner architecture can expose dirtier interfaces.
Fourth, portable harness logic can spread risky workflows. Harnesses mediate tools, files, permissions, and delegation. Externalizing them improves reuse, but also creates surfaces for prompt injection, malicious tool grafting, and supply-chain contamination. A reusable harness library without provenance, review, sandboxing, and permission control is not a productivity platform. It is a buffet for future incident reports.
The business value is cheaper diagnosis, not just higher scores
The article this replaces had the right instinct: AI performance is shifting from model-only thinking toward system-level design. The paper makes that claim sharper.
The value of NLAH is not that it magically improves benchmark accuracy. The paper’s own numbers are more nuanced. Prompted NLAH sometimes scores better. IHR-executed NLAH sometimes costs more. Code harnesses can still be the right place for exact behavior.
The value is that harness policy becomes a thing.
A readable thing. A testable thing. A transferable thing. A module that can be ablated instead of a rumor embedded in controller code.
That matters because real AI automation failures are rarely caused by “the model is slightly less intelligent than expected.” More often, they come from dull operational mistakes: the agent forgot the evidence, misread the completion condition, retried the wrong failure, compressed away the crucial state, used a verifier misaligned with the final judge, or got stuck in a stopping protocol designed for another model.
Better models will reduce some of these failures. They will not eliminate the need for harness policy. In fact, stronger models may make harness design more important because they can act farther, touch more tools, and fail in more expensive ways.
So the paper’s practical message is not: stop improving models.
It is: stop pretending the model is the whole system.
The next serious layer of AI automation will not be only model selection, prompt style, or context stuffing. It will be harness representation: how workflows are specified, executed, inspected, transferred, and governed.
Glue is still not glamorous. But once the system depends on it, glamour is beside the point.
Cognaptus: Automate the Present, Incubate the Future.
-
Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Hai-Tao Zheng, “Natural-Language Agent Harnesses,” arXiv:2603.25723v2, 18 May 2026, https://arxiv.org/html/2603.25723. ↩︎