Guardrails Over Gigabytes: Making LLM Coding Agents Behave

The coding agent did not fail quietly. That was the point.

A coding agent writes a patch. The patch looks plausible. The imports are clean enough. The function names sound like they belong in the repository. The explanation is fluent, naturally. Fluency is what these systems do best.

Then the build breaks.

This is not a surprising story anymore. It is the standard operating texture of LLM-based software automation: impressive partial competence surrounded by small, expensive lies. The industry’s reflex has been equally familiar: use a bigger model, provide more context, extend the prompt, run more samples, ask the model to reason harder, and hope the hallucination politely gets bored.

The paper behind today’s article proposes a less theatrical answer: stop treating the LLM as the controller. Treat it as an unreliable generator placed inside a deterministic execution system.¹

That sounds modest. It is not. It changes where responsibility lives.

Instead of asking a stochastic model to both produce code and decide whether the code is good, the paper formalizes a Dual-State Action Pair, or DSAP: every generative action is paired with a deterministic post-condition guard. The LLM generates. The guard checks. The workflow advances only when the guard says the artifact satisfies the required condition.

This is the boring engineering move that agent demos tend to skip. Unfortunately for demos, boring is where reliability usually lives.

The paper’s central move is to move control outside the model

The useful distinction in the paper is not “small model versus large model.” It is “generator versus controller.”

Current LLM agent systems often blur these roles. A model reasons about a task, generates an artifact, evaluates its own progress, decides what to do next, and explains why everything is fine. That is elegant in the way a single-person accounting department is elegant: fewer handoffs, fewer forms, and absolutely wonderful opportunities for fraud.

The DSAP framework separates the roles.

The system has two kinds of state:

State type	What it tracks	Why it matters
Workflow state	Finite, observable guard statuses	Keeps execution control deterministic and auditable
Environment state	Generated artifacts, context, feedback history	Stores the messy stochastic material without letting it silently drive control
Guard feedback	Rejection reasons and diagnostic messages	Feeds retries without pretending the model has verified itself

The LLM lives on the environment side. It produces artifacts: code, tests, analyses, patches, or other intermediate outputs. The workflow controller lives on the deterministic side. It sees guard outcomes and decides whether to advance, retry, backtrack, or escalate.

The paper’s guard is not merely a filter. A filter blocks something. A guard senses whether an output has satisfied a post-condition and converts an opaque generated artifact into an observable workflow signal.

That signal has three states:

Guard verdict	Operational meaning	Workflow consequence
Satisfied	The artifact passes the required check	Advance to the next step
Unsatisfied but recoverable	The artifact failed in a way that may be fixed with feedback	Retry with refined context
Guard-determined unrecoverable failure	The artifact failed in a way the system should not keep retrying	Escalate

This tri-state design matters because not all failures deserve the same response. A syntax error may deserve a retry. A policy violation may deserve escalation. A repeated test failure may indicate that the current step is not the root cause. A model that keeps inventing nonexistent file paths is not asking for encouragement. It is asking for adult supervision.

DSAP is not a prompt trick; it is a transaction boundary

The atomic unit of the framework is simple:

specification/context
        ↓
LLM generator
        ↓
artifact
        ↓
deterministic guard
        ↓
advance / retry / escalate

The important word is “atomic.” The paper treats generation and verification as an indivisible execution pair. A workflow step is not complete when the model emits text. It is complete when the emitted artifact passes its guard.

That design prevents a common agent failure: invalid intermediate outputs quietly contaminating later steps. In an unguarded workflow, a wrong analysis can produce a wrong test, which produces a wrong patch, which produces a beautiful explanation of why the wrong patch is correct. The system becomes a confidence laundering machine.

DSAP interrupts that laundering process. Failed artifacts do not advance the workflow state. They remain in the environment state as feedback and history.

This is also where the paper’s formalism becomes practically useful. The authors assume a generator with nonzero task capability: roughly, if the model has some probability of producing a valid artifact under a valid specification, then repeated guarded attempts can drive failure probability down. That does not mean retries magically create competence. It means guards can exploit competence when it exists.

This is a crucial boundary. A guard can amplify a weak but real capability. It cannot summon one from the void. The paper’s experiments make that distinction very visible.

The recovery hierarchy is a retry policy with a spine

Naïve retries are cheap to describe and expensive to run. If an LLM fails, try again. If it fails again, try again harder. If the workflow has multiple steps, retry everything. Congratulations: you have invented compute waste with a user interface.

The paper’s recovery architecture is more disciplined. It defines three levels:

Recovery level	What happens	What problem it addresses	Business translation
Level 1: context refinement	Retry the same step with guard feedback	Local, fixable generation failure	Cheap automatic correction
Level 2: informed backtracking	Detect stagnation, invalidate dependent steps, inject downstream failure context upstream	Root cause may sit in an earlier step	Structured recovery instead of blind retry
Level 3: human escalation	Stop automation when budgets are exhausted or the guard marks failure unrecoverable	Automation should not keep pretending	Controlled handoff with diagnostic history

The key mechanism is stagnation detection. If recent failures are repetitive or highly similar, the system treats the local step as stuck. It can then backtrack to an upstream step, invalidate downstream artifacts, and inject a failure summary into the upstream context.

For example, suppose patch generation keeps failing because the earlier analysis selected the wrong file. Retrying patch generation alone may only produce different wrong patches. Level 2 recovery says: go back, revise the analysis, then regenerate dependent outputs.

This is sensible. It is also bounded. The paper explicitly avoids infinite agentic wandering by assigning retry and escalation budgets. When those budgets are exhausted, the workflow stops. This is not defeatism. This is the difference between an automation system and a chatbot with a loop.

The diagnostic probes show guard value, not general intelligence

The paper’s first experimental block is the cleanest evidence for DSAP. It uses three diagnostic coding probes designed to isolate different failure modes:

Probe	What it tests	Likely purpose in the paper
LRU Cache	Standard implementation with high prior familiarity	Main evidence for drift prevention
Template Engine	Conceptually familiar but structurally novel implementation	Main evidence for structural feedback
Password Validator	Rule-heavy task with prime-number calculation	Main evidence for calculation-gap correction

The setup evaluates 13 models from 1.3B to 15B parameters, with 50 independent trials per model-task pair. The comparison is between a baseline single attempt and a guarded configuration with retries.

The strongest reported improvements are large:

Task	Model	Baseline success	Guarded success	Gain	Average retries
Password Validator	StarCoder2 15B	0%	66%	+66 pp	0.84
Password Validator	DeepSeek-Coder 6.7B	50%	96%	+46 pp	0.72
Password Validator	Granite-Code 3B	36%	80%	+44 pp	1.46
LRU Cache	DeepSeek-Coder 6.7B	48%	98%	+50 pp	0.76
LRU Cache	Granite-Code 8B	60%	98%	+38 pp	0.52
LRU Cache	Yi-Coder 1.5B	62%	98%	+36 pp	0.76
Template Engine	Yi-Coder 9B	56%	98%	+42 pp	0.92
Template Engine	StarCoder2 15B	60%	100%	+40 pp	0.32
Template Engine	Qwen2.5-Coder 3B	8%	42%	+34 pp	2.72

The headline temptation is obvious: “guards make small models powerful.” That is close enough to be useful and vague enough to be dangerous.

A better interpretation is this: guards help when the model already has partial capability and the failure is observable. In the LRU Cache task, many models know the pattern but drift stochastically. Guards close the gap by rejecting incorrect attempts. In the Password Validator task, guards provide concrete counterexamples for calculation errors. In the Template Engine task, guards supply structural feedback as the model tries to synthesize parser logic.

The paper also reports that the framework converges at roughly 1.2–2.1 times baseline cost for qualified models, compared with a fixed cost of five attempts for naïve Pass@5 sampling. That matters operationally. A guarded system is not just “try five times and pick one.” It is “try when needed, stop when satisfied, and use failure feedback.”

The phrase “qualified models” carries a lot of weight. The paper excludes models that fail to produce parsable output at a minimal threshold from gain analysis. That is not a statistical footnote. It is a deployment rule.

Before a business chooses a model for an agentic workflow, it should not ask only, “How large is this model?” It should ask, “For this specific step, does the model have enough base capability for guards to amplify?”

That is a much better procurement question. Less glamorous, naturally.

The SWE-Bench experiment is where the paper becomes more honest

The second experimental block is more interesting because it is less flattering.

The paper applies the recovery architecture to 99 SWE-Bench Pro instance-arm pairs using Qwen3-Coder-Next. The workflow variants follow a test-driven structure: analysis, test generation, and patch generation. The paper is careful about what this experiment is meant to show. It is not a benchmark competition, and Docker-based SWE-Bench evaluation was not performed. The purpose is to evaluate recovery mechanism behavior: escalation patterns, context injection, stagnation detection, and the boundary between execution control and planning.

That distinction should not be skipped. In business terms, this is not evidence that the system can autonomously fix real GitHub issues. It is evidence about how the recovery machinery behaves when placed inside a harder software repair setting.

The reported metrics are revealing:

Metric	Result	Interpretation
Instance-arm pairs attempted	99	Recovery behavior sample
Steps resolved at Level 1	56 / 278, or 20.1%	Some failures are locally recoverable
Stagnation events	71	Level 2 triggering was exercised
Total escalation cycles	190	Recovery was active, not theoretical
Mean cascade depth	1.7 steps	Backtracking usually affected more than one step
Context injection events	71	Downstream failure summaries were injected upstream
Upstream output changed after injection	71 / 71, or 100%	Injection mechanically changed generation
Test generation recovery after backtracking	6 / 16, or 37.5%	Escalation helped a tractable intermediate step
Patch generation recovery after backtracking	0 / 55, or 0%	Escalation did not solve the hard step
All attempted steps passed guards	4 / 99, or 4.0%	Limited workflow completion
Viable final patches	0	No autonomous repair success

The most important number is not the 100% context injection effectiveness. That number means the upstream generator changed its output after receiving injected failure summaries. It shows the mechanism is not inert.

But “changed” is not “correct.” This is the cruel little word that ruins many agent demos.

The decisive finding is the asymmetry: backtracking recovered test generation 37.5% of the time, but recovered patch generation 0% of the time. The recovery system could help when the task was tractable enough for revised upstream context to matter. It could not bridge the gap when the step required a successful functional patch.

The paper reports that no instance produced a viable patch. It also notes that all four workflow completions were achieved at Level 1 only, and two of those completions only reached test generation rather than patch generation. So even the completion metric must be read carefully: it reflects guard satisfaction on attempted steps, not successful end-to-end software repair.

This is not a weakness of the paper. It is one of its more useful contributions. The experiment gives us a clean boundary: execution recovery is necessary, but not sufficient, for autonomous software engineering.

The failure modes say the missing piece is planning

The paper identifies several failure modes in the SWE-Bench experiment. They point in the same direction.

Patch generation often failed because the model hallucinated search strings that did not exist in the target file. Sometimes backtracking regenerated the analysis, but the new analysis did not prevent the patch generator from choosing another wrong edit path. In other cases, the escalation axis was wrong: the problem was not localization but fix strategy. The workflow could rerun analysis, but it could not choose a different repair decomposition.

There were also context-management failures. Accumulated feedback and injected summaries can exceed the model’s effective context window, making later generations worse rather than better. Some failures never triggered stagnation because the model produced varied failures, not repetitive ones. From the stagnation detector’s perspective, the system was not stuck. From the business owner’s perspective, the invoice was still growing.

These are not random blemishes. They expose the same architectural limit.

The workflow topology was human-authored and static. The system could retry within that topology. It could backtrack within that topology. It could inject failure feedback within that topology. But it could not decide that the topology itself was wrong.

That is planning.

For a software repair task, planning means selecting the right decomposition for the specific issue: whether to inspect files first, generate regression tests, patch configuration, update API calls, modify dependency assumptions, change fixtures, or route to a specialized framework strategy. DSAP controls execution after a workflow exists. It does not synthesize the workflow.

This distinction is the paper’s best antidote to agent hype. Guarded execution makes agents behave better. It does not make them understand the task decomposition problem. The model may be less chaotic inside the lane, but someone still has to choose the lane.

What this means for business automation

For business use, the paper’s lesson is not “replace engineers with guarded coding agents.” Please do not do that and then act surprised when the guardrails become incident reports.

The practical lesson is more specific: agentic automation should be designed around observable post-conditions, bounded recovery, and escalation paths.

A useful implementation checklist looks like this:

Design question	DSAP-informed answer	Business consequence
What does each agent step produce?	A concrete artifact, not a vibe	Output becomes testable
How is success verified?	Deterministic post-condition guard	Control moves outside the LLM
What happens after failure?	Retry with guard feedback, within budget	Compute cost becomes bounded
When is local retry insufficient?	Stagnation or repeated failure pattern	Root-cause correction becomes possible
What upstream steps can be revisited?	Explicit escalation routing	Recovery is designed, not improvised
When does automation stop?	Budget exhaustion or unrecoverable guard verdict	Human review receives diagnostics
Which model should run the step?	Task-specific qualification	Model choice becomes empirical
What data is collected?	Artifact plus guard result	Failures become training and monitoring data

This framework is especially relevant for coding assistants, data pipeline agents, report-generation systems, compliance drafting workflows, and internal business automation where outputs can be validated against deterministic or semi-deterministic checks.

Examples include:

code must compile;
generated SQL must parse and pass row-count constraints;
a financial report must reconcile totals;
a legal template must include required clauses;
a customer-support response must avoid prohibited claims;
a data-cleaning agent must preserve schema invariants;
a research assistant must attach traceable evidence for every factual claim.

The common feature is not that these tasks are easy. It is that the workflow can define post-conditions. Without post-conditions, “agent reliability” becomes a matter of taste. Taste is not a control system.

The ROI is cheaper diagnosis, not just cheaper generation

A common business argument for smaller models is cost. Smaller models are cheaper to run, easier to deploy locally, and more attractive for privacy-sensitive environments. The paper supports part of that argument, but only under a condition: the model must already be qualified for the task.

The deeper ROI is diagnostic.

A guarded workflow tells you where failure occurs. Did the model fail syntax? Did it fail tests? Did it pass tests but violate a business rule? Did retries help? Did backtracking change upstream output? Did the same guard fail repeatedly? Did the failure require human escalation?

That information changes management.

Without guard-level telemetry, an AI automation failure is a blob: “the agent didn’t work.” With DSAP-style execution, failure becomes a taxonomy. The organization can decide whether to improve prompts, swap models, add context retrieval, rewrite guards, split a workflow step, introduce a planner, or remove automation from that task entirely.

This matters because the real cost of failed agents is not only compute. It is the time humans spend interpreting messy failure. A system that fails with structured evidence is much cheaper to improve than a system that fails with a confident paragraph.

The boundary: guardrails do not replace planning

The paper’s limitations should be stated once, clearly.

First, the strongest diagnostic results come from controlled probes, not full production software engineering. They show that guards can improve reliability in tasks with observable checks and partially capable models. They do not prove general autonomous coding competence.

Second, the SWE-Bench experiment was designed to evaluate recovery behavior, not to compete on benchmark resolution. The paper explicitly notes that Docker-based SWE-Bench evaluation was not performed. The result is still useful, but the claim should be kept in its lane.

Third, recovery routing is human-configured. The system can decide when a step is stuck, but the workflow designer specifies which upstream step to revisit. That is acceptable for structured business automation. It is insufficient for broad autonomous problem solving.

Fourth, guards require verifiable post-conditions. Many business tasks involve ambiguity: strategy, negotiation, taste, judgment, political sensitivity, or incomplete information. These tasks may still use guardrails, but the guards will be weaker, more human-in-the-loop, or more policy-oriented.

Finally, context injection is not a cure. The SWE-Bench results show that injected feedback changed upstream outputs every time, yet patch generation still had 0% recovery. More context can change behavior without improving correctness. Anyone who has watched an LLM confidently revise a wrong answer into a different wrong answer has already met this phenomenon in the wild.

The better architecture is less magical and more accountable

The older story of coding agents was simple: give the model a goal, a repository, and some tools; let it reason; wait for a pull request. It is an appealing story because it compresses a messy engineering organization into a single animated assistant. It is also a wonderful way to rediscover why engineering organizations have review gates, tests, issue triage, ownership boundaries, rollback plans, and senior engineers.

This paper argues for a less magical architecture. The LLM generates artifacts. Deterministic guards verify post-conditions. Failure feedback refines context. Repetition triggers backtracking. Budgets prevent runaway retries. Humans receive escalations when the system reaches its designed boundary.

That is not as cinematic as an autonomous software engineer. It is much closer to a usable automation system.

The title of this article says “guardrails over gigabytes,” but the point is not that model scale no longer matters. Scale matters. Capability matters. Context matters. The paper’s sharper point is that capability without control is operationally fragile. A larger stochastic generator is still stochastic. It may produce better candidates, but it should not be allowed to certify them.

For businesses, the immediate takeaway is straightforward: build coding and workflow agents as controlled execution systems, not as clever monologues with tool access.

Make every step produce an artifact. Make every artifact face a guard. Make every retry spend a budget. Make every escalation carry evidence. Then, and only then, decide whether the model is good enough.

The future of useful agents may not begin with a bigger brain. It may begin with a smaller permission slip.

Cognaptus: Automate the Present, Incubate the Future.

Matthew Thompson, “The Dual-State Architecture for Reliable LLM Agents,” arXiv:2512.20660v2, 2026. ↩︎

The coding agent did not fail quietly. That was the point.#

The paper’s central move is to move control outside the model#

DSAP is not a prompt trick; it is a transaction boundary#

The recovery hierarchy is a retry policy with a spine#

The diagnostic probes show guard value, not general intelligence#

The SWE-Bench experiment is where the paper becomes more honest#

The failure modes say the missing piece is planning#

What this means for business automation#

The ROI is cheaper diagnosis, not just cheaper generation#

The boundary: guardrails do not replace planning#

The better architecture is less magical and more accountable#