Opening — Why this matters now
AI coding agents are everywhere—and still, maddeningly unreliable. They pass unit tests they shouldn’t. They hallucinate imports. They invent APIs with confidence that would be admirable if it weren’t so destructive. The industry response has been predictable: bigger models, longer prompts, more retries.
This paper proposes something less glamorous and far more effective: stop asking stochastic models to behave like deterministic software engineers.
Instead, treat them like what they are—unpredictable generators—and wrap them in the same kind of control frameworks software engineering has used for decades to manage unreliable components. The result is not a new model, but a new architecture. And the empirical gains are not subtle.
Background — Context and prior art
Software engineering has long accepted that some components are inherently unreliable. Configuration management systems like CFEngine, CI/CD pipelines, and Test-Driven Development do not assume correctness; they assume eventual convergence through validation.
Modern LLM-based agents, however, blur a critical boundary. Techniques such as Chain-of-Thought or ReAct embed decision-making inside the same probabilistic process that generates text. When reasoning and generation share the same stochastic substrate, failure modes compound rather than correct.
The theoretical lens here is refreshingly classical. Drawing from:
- Promise Theory: unreliable actors make promises; consumers verify.
- Agent–Environment separation (Sutton & Barto): only what you can modify belongs to the agent.
- Bounded rationality: satisficing beats optimizing when costs matter.
Under this view, the LLM is not the agent. It is part of the environment.
Analysis — What the paper actually does
1. Move the control boundary
The core move is architectural: relocate decision-making outside the LLM. The agent is a deterministic controller. The LLM is a stochastic oracle.
This single distinction unlocks everything else.
2. Dual-State Architecture
The system state is split cleanly:
| State | Role | Properties |
|---|---|---|
| Workflow State | Control | Finite, deterministic, auditable |
| Environment State | Generation | Stochastic, append-only, opaque |
The workflow state tracks truth assignments to validation guards—nothing more. The environment state stores artifacts, history, and error traces without polluting control logic.
3. Atomic Action Pairs
Every meaningful step is an indivisible transaction:
- Generate an artifact with the LLM
- Verify it immediately with a deterministic guard
If verification fails, the workflow state does not advance. Only the context is refined.
This eliminates an entire class of failure where invalid outputs silently corrupt downstream steps.
4. Guards as sensors, not filters
Guards do more than block bad outputs. They sense reality and project probabilistic generation into a binary, observable control state.
Syntax checks, unit tests, architectural rules, even human review—all are modeled uniformly as guard functions.
The planner never sees probabilities. It sees pass or fail.
Findings — Results with visualization
Across three diagnostic coding tasks (LRU Cache, Template Engine, Password Validator), the framework was tested on 13 models ranging from 1.3B to 15B parameters.
Reliability gains
| Task | Max Reliability Gain |
|---|---|
| Password Validator | +66 percentage points |
| Template Engine | +42 percentage points |
| LRU Cache | +50 percentage points |
Crucially, these gains were achieved at 1.2–2.1× compute cost, dramatically cheaper than naïve best-of-N sampling.
The qualification insight
Not all models benefit. The framework amplifies capability—it does not create it.
Models with effectively zero probability of following instructions (ϵ ≈ 0) remain unusable. But once a minimal capability threshold is crossed, architectural constraints dominate parameter count.
A 6.7B model with guards can outperform a 15B model without them.
Multi-step workflows (TDD)
In a test-driven development pipeline—tests first, implementation second—the same pattern holds. Reliability scales with model size, but failure modes are now explainable.
When things break, they break because specifications are wrong, not because the agent “got confused.” That distinction matters.
Implications — What this means in practice
Smaller models, local control
This framework makes sub-15B models viable for serious software engineering tasks. That has direct implications for:
- On-prem deployments
- IP-sensitive codebases
- Regulated environments
Safety becomes systemic
Instead of trying to train safety into model weights, safety emerges from workflow structure. The LLM remains creative—and dangerous—inside a sandbox of deterministic constraints.
Credit assignment finally makes sense
Immediate verification collapses the reward horizon. Every failure is attributed to the last generation attempt. This turns retries from waste into labeled training data.
In other words: the architecture quietly solves a reinforcement learning problem most agent frameworks ignore.
Conclusion — A quieter kind of progress
This paper does not introduce a new loss function or a clever prompt trick. It does something rarer: it formalizes what good engineers already know.
Unreliable components should not be trusted. They should be constrained.
By treating LLMs as stochastic environments rather than decision-makers, and by enforcing atomic generate–verify loops with explicit guards, we get systems that are both imaginative and dependable.
No hype. No mysticism. Just architecture doing its job.
Cognaptus: Automate the Present, Incubate the Future.