The agent keeps looking in the wrong place
An incident happens. A service slows down. A pod restarts. A dashboard turns the tasteful shade of operational panic.
The enterprise AI agent is asked to help. It reads logs, calls tools, inspects metrics, follows traces, and produces a plausible chain of reasoning. Sometimes it finds the root cause. Sometimes it wanders through the topology graph like a consultant discovering Kubernetes for the first time.
The usual answer is to improve the prompt. Add a better instruction. Add more retrieved context. Add another tool. Add a stern reminder not to hallucinate. The system improves a little, then stalls again.
The paper behind this article makes a more interesting move. It treats the agent’s behavior not merely as text generation, but as a sequential decision process that can be abstracted, scored, and guided without changing the underlying LLM weights.1 In other words, the authors do not ask: “How do we write a better prompt?” They ask: “Can we learn a policy over the agent’s reasoning path, then use that policy to engineer the context at the moment the agent is about to make a decision?”
That is the useful reframing. Context engineering, in this paper, is not prompt decoration. It is policy-guided intervention.
The real bottleneck is not intelligence; it is feedback
Enterprise AI agents live in a less convenient world than benchmark agents.
A coding benchmark may provide unit tests. A math benchmark may provide exact answers. A game environment may allow cheap self-play. Enterprise operations rarely offer such luxuries. The authors highlight four conditions that make ordinary fine-tuning or online reinforcement learning difficult: limited and proprietary data, complex real-world reasoning, restricted self-play, and scarce verifiable feedback.
This is not just a data quantity problem. It is a feedback geometry problem.
An SRE diagnosis agent can take many turns before reaching a conclusion. The final answer may be judged correct or incorrect, but the intermediate steps are harder to evaluate. Did the agent inspect the right entity too late? Did it waste turns on a harmless service? Did it miss a cascading failure because the topology path looked unimportant? These are process-level questions. They are exactly the questions that matter operationally, and exactly the questions that are expensive to label by hand.
Fine-tuning asks for many examples. Process reward modeling asks for dense supervision. Online RL asks for safe exploration. Enterprise teams often have none of these in sufficient quality. What they do have, however, is historical trajectory data: logs of what agents did, what they inspected, and whether the final diagnosis was good enough.
The paper’s framework, DT-MDP-CE, is built around squeezing more structure out of those traces.
DT-MDP compresses free-form reasoning into a finite decision model
The first move is abstraction.
A language agent produces free-form thoughts and actions. In formal terms, that is an unpleasantly large state-action space. The authors frame an LLM agent as operating in a partially observable decision process: the agent sees observations, emits thoughts or action-like text, and updates its behavior over multiple turns. Directly optimizing that space is not realistic when data are limited.
So the paper builds what it calls a Digital-Twin Markov Decision Process, or DT-MDP. This is not a full simulator of the enterprise environment. That distinction matters. A full simulator would try to model everything: infrastructure, incidents, telemetry, hidden states, and causal propagation. The DT-MDP is narrower. It is a finite abstraction of the agent’s reasoning behavior.
The raw trajectory is converted into an abstract trajectory:
Here, observations and thoughts are mapped into finite states and actions. The abstraction is intentionally lossy. It gives up some semantic richness so that offline reinforcement learning becomes tractable.
For the SRE case study, the authors test three representation families:
| DT-MDP representation | What it encodes | Operational meaning | Boundary |
|---|---|---|---|
| Name-based | Entity names as state/action elements | Learns which named components are worth exploring | Does not transfer well when entity sets change |
| Name-type-based | Entity-name and entity-type pairs | Adds structure such as pod, service, or other entity type | Higher dimensionality; still tied to explicit entities |
| Topology-based | Distances and structural relationships in the dependency graph | Learns search behavior over incident propagation structure | Depends on having a meaningful topology graph |
This table is the core mechanism. The agent’s reasoning is no longer treated as a stream of text to be admired, blamed, or reprompted. It becomes a sequence of decisions over abstract entities.
The mildly uncomfortable part is that representation design does a lot of work. The authors are honest about this: constructing the DT-MDP requires approximation, heuristics, and domain knowledge. There is no magic “enterprise digital twin” button, despite what certain pitch decks would no doubt prefer.
But the bargain is attractive. A small loss in fidelity buys a large gain in learnability.
Contrastive IRL turns imperfect traces into a reward signal
The second move is reward learning.
A common trap in enterprise RL is trying to define the correct reward manually. That sounds disciplined until the team discovers that the real objective is a swamp: diagnose the root cause, avoid irrelevant branches, preserve useful evidence, minimize time and tokens, and do not accidentally optimize for confident nonsense. Elegant reward functions go to this swamp to retire.
The paper avoids direct reward design by using contrastive inverse reinforcement learning, specifically a T-REX-style approach. Instead of assuming that all demonstrations are expert demonstrations, it learns from ranked trajectories. A better trajectory should receive a higher cumulative learned reward than a worse one.
The loss has the familiar preference-learning shape:
The practical idea is simple: the system does not need a perfect numeric score for every step. It needs enough ranking signal to infer which state-action patterns tend to appear in better diagnostic trajectories.
In the SRE experiments, trajectory rankings are derived from correctness signals such as root-cause and fault-propagation quality, judged against ground truth. This learned reward is then used to train policies with offline RL methods such as Conservative Q-Learning. Candidate policies are filtered through off-policy evaluation before online testing.
That sequence matters:
- Abstract the trajectory into a finite decision model.
- Learn a reward from mixed-quality trajectory rankings.
- Train candidate policies offline.
- Use off-policy evaluation to choose policies worth testing.
- Apply the selected policy through context engineering during live agent execution.
The paper is not claiming that noisy LLM-as-judge scores become truth by changing acronyms. The stronger claim is more modest and more useful: pairwise ranking can be enough to learn process guidance when dense labels are unavailable.
Context engineering becomes a control layer, not a prettier prompt
Now comes the part that should interest enterprise teams most.
The learned RL policy does not fine-tune the LLM. It does not replace the agent. It intervenes in the context around the agent at decision time.
For the EoG agent, which investigates suspicious entities in a topology-guided SRE workflow, the paper tests three context engineering strategies:
| Strategy | Intervention | Plain-English version | Business interpretation |
|---|---|---|---|
| I. Suggesting via prompts | Add high-probability entities as suggestions | “These entities are often relevant; lean toward them if evidence supports it.” | Soft guidance with low disruption |
| II. Pruning explorations | Remove low-probability entities from the queue | “Do not waste turns on these candidates.” | Token and time control |
| III. Prioritizing explorations | Reorder candidates by policy probability | “Look here first.” | Faster path to useful evidence |
This is a different mental model from ordinary prompt engineering. The prompt is no longer just a static instruction. It becomes an actuator. The policy decides what should be suggested, removed, or prioritized; the prompt and queue logic are merely the interface through which that decision reaches the agent.
For ReAct, the same three ideas are adapted to a Thought-Action-Observation loop: suggest likely entities before the next thought-action pair, skip actions whose abstract probability is too low, or generate multiple candidates and execute the one the policy ranks highest.
That adaptation is important because it supports the paper’s claim of model-agnosticism at the framework level. The intervention is not hard-coded to one agent architecture. The exact hook changes, but the control principle is the same: infer the abstract state, evaluate candidate abstract actions, and modify context so the agent is nudged toward the learned policy.
The main experiment tests whether the mechanism helps diagnosis
The primary evaluation uses ITBench SRE diagnosis scenarios. The training data include 819 trajectories and 12,079 turns collected from 12 SRE diagnosis scenarios. The test set contains six unseen ITBench scenarios, and each diagnosis process is repeated 15 times. The reported metrics are Pass@3 recall and Pass@3 F1, meaning a scenario counts as successful if at least one of three sampled trials reaches the correct diagnosis.
The main result is straightforward: DT-MDP-CE improves the EoG baseline across the evaluated context engineering strategies and DT-MDP variants. The authors report that Name-type and Topology-based configurations show statistically significant improvements after Bonferroni correction, while Name-based variants show numerical gains but not corrected significance.
That distinction is worth keeping. The paper does not show that every abstraction is equally strong. It shows that richer abstractions, especially those adding entity type or topology structure, are more reliable than names alone.
Table 1 in the paper also makes the cost story less simplistic. Some strategies add tokens or time; pruning can reduce them. The baseline uses 439K input tokens, 4.7K output tokens, and 841 seconds. Strategy II under the Topology variant uses 305K input tokens, 3.2K output tokens, and 778 seconds. That is not merely “more reasoning for better accuracy.” In that case, the policy-guided intervention also cuts waste.
The cost evidence should not be oversold. These are experimental measurements in one SRE setup, not a universal law that pruning will always reduce cost. Still, it points to a practical possibility: the value of policy-guided context engineering may be better search, not longer prompts.
The ablations explain why reward learning matters
The most important ablation compares three ways of deriving behavior:
| Method | Likely purpose of test | What it supports | What it does not prove |
|---|---|---|---|
| RL-IRL | Test policies trained on contrastive IRL-derived intermediate rewards | Intermediate reward learning improves policy induction | That the learned reward is universally valid outside similar workflows |
| RL-Sparse | Test policies trained mainly on final/outcome reward | Sparse final signals are weaker for multi-turn guidance | That sparse rewards are useless in all settings |
| Behavior Cloning | Test imitation of observed behavior | Imitation is not enough when traces are mixed quality | That behavior cloning cannot help with cleaner expert data |
| Baseline | Test the original agent without RL-guided CE | Policy-guided CE adds value over the unmodified agent | That the base agent itself is weak in all domains |
The Critical Difference analysis places RL-IRL in the best-ranking group, while sparse-reward RL, behavior cloning, and baseline tend to cluster lower. That supports the paper’s mechanism rather than merely its outcome.
The point is not “RL beats prompting,” which would be a lazy headline and therefore probably popular. The point is more specific: when the agent’s task is multi-turn and process quality matters, learned intermediate rewards can produce better policy guidance than either final sparse scores or imitation of historical behavior.
Behavior cloning is especially revealing. If historical trajectories are mixed quality, copying them faithfully is a questionable ambition. Enterprise data often contain exactly that problem: human operators, scripts, and agents all leave trails of partial success, local hacks, and procedural drift. A method that learns from relative quality rather than blindly imitating demonstrations is better aligned with the reality of enterprise logs.
The generalization tests are promising, but not a passport to every workflow
The paper then asks whether the framework travels.
First, it applies DT-MDP-CE to a ReAct agent in the same SRE diagnosis domain. RL-based context engineering again improves performance over the original ReAct baseline and a behavior-cloning variant. Strategy III performs best on Pass@3 F1 in this setting, plausibly because selecting the highest-probability candidate combines prioritization with a pruning-like effect.
Second, the authors test transfer from SRE incidents to software engineering scenarios from the same application environment. Policies learned from SRE incidents are applied without retraining to code-related failure mechanisms. The DT-MDP variants outperform the baseline, and performance improves from Name-based to Name-type and Topology-based abstractions, with Topology strongest on average.
This is useful evidence, but it should be read carefully. It is a transfer test within a related operational universe, not proof that the same learned policy can jump from Kubernetes incidents to loan underwriting, procurement approvals, or medical triage. The business inference is that structural abstractions may transfer better than literal entity names when the workflow shares a similar search logic.
Third, the paper tests different model sizes and families across Mistral and Llama. RL-based context engineering improves performance across both families, with the largest gains at medium scale. Smaller models benefit less, likely because they cannot exploit the guidance as well. Larger models gain less because their baseline performance is already stronger.
That result is commercially interesting. It suggests a middle zone where a competent but not frontier-scale model, guided by a learned policy layer, may deliver attractive performance. Not every enterprise workflow needs the largest model if the system architecture reduces wasted exploration. Shocking news for anyone selling token volume as a personality trait.
The robustness tests are mostly about sensitivity, not a second thesis
The later experiments should not be read as separate grand claims. They are mostly robustness and sensitivity checks.
Feature enrichment adds Hubs features and HMM-derived hidden states to the Topology representation. Accuracy appears to improve, though variance also increases. The sensible reading is that richer structural features may help, but representation engineering remains domain-dependent.
The expert-trajectory test varies the number of successful trajectories used for training. Behavior cloning appears more sensitive to the number of expert trajectories, while RL—especially RL-IRL—remains more robust and achieves the best initial-value scores in off-policy evaluation. This supports the paper’s premise that contrastive reward learning is useful when successful trajectories are limited.
The threshold tests vary the percentile cutoffs used for suggestion and pruning. Results remain broadly stable over the tested ranges, with small changes relative to variance. This matters because a method that collapses when a threshold moves slightly is not a method; it is a spreadsheet tantrum. Here, the threshold sensitivity looks manageable.
The implementation appendix also matters more than it first appears. The prompts used to extract abstract states and actions are conservative: they instruct the model to use only allowed names and types, avoid hallucinated entities, and return structured JSON or exact tuples. That is not glamorous, but it is exactly the kind of glue work that determines whether an offline policy layer has usable input.
The business value is cheaper control, not magical autonomy
The business relevance of this paper is not that every company should build a digital twin of everything. That would be expensive, slow, and a reliable way to create governance meetings with no natural end.
The more practical reading is narrower:
| Paper result | Directly shown | Cognaptus inference | Uncertainty boundary |
|---|---|---|---|
| DT-MDP abstraction improves SRE diagnosis agents | EoG and ReAct improve under RL-guided CE on ITBench-style tasks | Structured workflows can benefit from finite decision abstractions | Open-ended knowledge work may not compress cleanly |
| Contrastive IRL beats sparse reward and behavior cloning in the tested setup | RL-IRL achieves stronger ranks in CD analysis | Mixed-quality enterprise traces can become useful training material | Requires meaningful trajectory ranking signals |
| Topology and Name-type abstractions are stronger than Name-only | Significant gains appear for Name-type and Topology variants | Structural features transfer better than literal entity labels | Requires valid domain structure, such as topology or process graph |
| Pruning can reduce token/time cost | Topology Strategy II uses fewer tokens and less time than baseline in the reported table | Policy layers may improve ROI by reducing wasted exploration | Cost effects depend on agent design and intervention point |
| Medium models gain most in the model-size test | Largest gains appear at medium scale across tested families | Policy-guided CE may help optimize model-cost tradeoffs | Not a replacement for capability when base model is too weak |
For enterprise teams, the attractive part is that the framework avoids modifying model weights. That reduces deployment friction. It can sit around an existing agent as a learned guidance layer, provided the organization can capture trajectories, abstract states/actions, rank outcomes, and safely intervene in the agent’s context or queue logic.
The likely early use cases are not generic chatbots. They are structured, repeated, multi-step workflows where the environment has a graph, queue, case state, checklist, dependency map, or procedural topology. ITOps diagnosis is a natural example. Customer support escalation, compliance investigation, claims handling, security alert triage, and workflow repair may also fit if their action spaces can be abstracted.
The ROI logic is also different from classic automation. The gain may come from three sources: higher diagnosis accuracy, fewer wasted tool calls, and more predictable agent behavior. The last one matters. In production, controllability is not a philosophical luxury. It is what keeps automation from becoming theater with API bills.
Where the framework still depends on human design
The paper’s strongest idea is also its main dependency: abstraction.
A DT-MDP is only as useful as its state-action representation. In SRE, topology gives the method something real to hold. Entities have relationships. Failures propagate. Exploration order matters. A policy over candidate entities is operationally meaningful.
In another domain, the abstraction may be much harder. If the task is open-ended strategic writing, negotiation, or ambiguous advisory work, reducing reasoning to finite states and actions may erase the very context that determines quality. If trajectory rankings are noisy in the wrong way, contrastive IRL may learn organizational bias rather than better behavior. If interventions are too weak, the policy will not move the agent. If interventions are too strong, the system may suppress useful exploration.
There is also an evaluation issue. The paper relies on LLM-as-a-judge protocols for aspects of trajectory quality and diagnosis evaluation. That is reasonable in this research setting, especially where ground-truth fault propagation is available for comparison. But in business deployment, judge design becomes part of the control system. A bad judge quietly becomes a bad reward function wearing a lab coat.
Finally, the framework is offline, but not free. It requires trajectory logging, abstraction prompts or parsers, representation design, policy training, off-policy evaluation, and integration hooks inside the agent runtime. This is lighter than full fine-tuning, but heavier than prompt tinkering. The right comparison is not “free prompt vs expensive RL.” It is “local prompt patches vs reusable process-control layer.”
From prompt craft to process control
The paper’s contribution is not that prompts are obsolete. Prompts remain one of the places where the intervention is applied. The contribution is that prompts are no longer the whole control strategy.
DT-MDP-CE offers a mechanism chain:
Historical agent traces
↓
Finite DT-MDP abstraction
↓
Contrastive IRL reward learning
↓
Offline RL policy induction and selection
↓
Policy-guided context intervention
↓
Better agent search behavior
That chain is why the paper deserves attention. It gives enterprise AI teams a way to think beyond two tired options: endlessly patch prompts, or fine-tune a model with data they do not really have.
The larger lesson is that agent improvement may increasingly happen outside the model. Around the LLM, we will see more policy layers, memory managers, context routers, tool controllers, evaluators, and workflow-specific digital twins. The model remains the reasoning engine, but the system around it decides what the model sees, what choices are available, and which path is worth trying first.
That is less glamorous than “fully autonomous AI.” It is also more likely to survive contact with an enterprise system at 3:00 a.m.
Cognaptus: Automate the Present, Incubate the Future.
-
Xi Yang, Aurélie Lozano, Naoki Abe, Bhavya, Saurabh Jha, Noah Zheutlin, Rohan R. Arora, Yu Deng, and Daby M. Sow, “A Context Engineering Framework for Improving Enterprise AI Agents based on Digital-Twin MDP,” arXiv:2603.22083v1, 23 March 2026, https://arxiv.org/abs/2603.22083. ↩︎