Memory is easy to sell and hard to govern.
Every enterprise AI demo eventually reaches the same theatrical moment: the agent remembers something. A prior customer preference. A workflow exception. A formatting habit. A failed action that should not be repeated. Everyone nods. Someone says “continuous learning.” A roadmap slide appears. The slide is almost certainly too optimistic.
The uncomfortable part is that memory is not automatically intelligence. A folder full of trajectories is not experience. A vector database full of fragments is not judgment. And an agent that remembers everything may simply become better at retrieving irrelevant advice with confidence, which is a very modern form of bureaucracy.
That is the useful entry point into Dynamic Dual-Granularity Skill Bank for Agentic RL, the paper introducing D2Skill.1 The paper is not merely another “add memory to agents” proposal. Its more interesting claim is narrower and sharper: reusable experience becomes valuable only when it is organized at the right granularity, tested against counterfactual behavior, retrieved with utility awareness, and pruned before the memory bank turns into a landfill with embeddings.
That makes D2Skill worth reading mechanism-first. The benchmark numbers matter, but the numbers are not the main business lesson. The main lesson is how the system decides what should be remembered, when it should be used, and when it deserves to be forgotten.
The real problem is not that agents forget; it is that they remember badly
Agentic reinforcement learning operates in awkward territory. The agent is not solving a neat classification problem. It interacts with an environment over many steps, receives partial observations, chooses text-based or tool-like actions, and often gets sparse feedback only at the end.
In classical terms, the environment may have a state, but the model does not see that full state. It sees a task description, the latest observation, and a limited interaction history. The prompt window is therefore doing the work of memory, state estimation, and attention management at the same time. That is brave. Also slightly unfair.
The obvious response is to add external memory. Store previous trajectories. Retrieve relevant examples. Let the agent learn from its past. This sounds reasonable until one asks what exactly is being stored.
A complete trajectory may contain a useful high-level strategy, several irrelevant actions, one lucky mistake, and a local correction that matters only under a specific observation. Treating the whole trajectory as one reusable “experience” is like telling a new employee: “Here is a 40-page project archive. Somewhere inside is the reason the invoice failed.” Technically helpful. Operationally rude.
D2Skill starts from a more disciplined distinction. It separates experience into two skill types:
| Skill type | What it guides | When it matters | Business analogy |
|---|---|---|---|
| Task skill | The overall task strategy | Before or across a trajectory | A playbook or standard operating procedure |
| Step skill | A local action correction | At a specific interaction step | A troubleshooting note or exception rule |
This distinction is not cosmetic. A task skill can say, in effect, “for this kind of shopping task, compare product constraints before selecting.” A step skill can say, “when the page shows no matching item, reformulate the search rather than selecting the first approximate result.” One helps with direction; the other repairs local failure.
Most business workflows need both. A customer-support agent needs the refund policy and the local exception for a damaged item photo. A procurement agent needs the vendor-selection process and the specific correction when a supplier page hides minimum-order quantities. A research agent needs the literature-search plan and the step-level warning that one database returns conference abstracts without full papers.
The misconception D2Skill pushes against is therefore simple: more memory is not necessarily better memory. The relevant unit is not “everything the agent has seen.” It is reusable guidance whose usefulness has been tested.
D2Skill treats memory as a skill bank, not a diary
The D2Skill framework has three connected parts: skill-injected RL training, reflection-driven skill generation, and skill-bank retrieval plus management. The connection among the three matters more than any single component.
First, the policy interacts with the environment under reinforcement learning. Some trajectories are run with retrieved skills injected into the context. Others are run without skills. Because both groups use the same underlying policy, the difference between their outcomes becomes an estimate of whether the skill bank helped.
Second, when performance is poor, an external reflector model analyzes representative trajectories and generates new skills. The paper uses closed-source models such as Gemini-3-Flash or O3 as reflectors in different settings, but the important architectural point is that the reflector is not used simply as the acting agent. It is used as a critic and abstraction engine.
Third, generated skills are retrieved and managed dynamically. Retrieval is not just semantic similarity. D2Skill combines similarity, utility, and an exploration bonus. Pruning is also not random cleanup. Skills with weak utility are eventually evicted, while newly created skills receive a grace period before being judged.
A compact version of the mechanism looks like this:
Failed or weak task performance
↓
Reflection on failed and successful trajectories
↓
Generate task skills + step skills
↓
Retrieve skills during future rollouts
↓
Compare skill-injected rollouts with baseline rollouts
↓
Update skill utility and policy rewards
↓
Prune low-utility skills from the bank
That loop is the paper’s actual contribution. The skill bank is not a static knowledge base bolted onto an agent. It co-evolves with the policy.
The paired rollout design turns “helpful memory” into something measurable
The clever part of D2Skill is not merely that it retrieves skills. Many systems retrieve things. Some retrieve magnificently irrelevant things. The clever part is that D2Skill tries to estimate whether retrieval actually improved behavior.
For each task group, D2Skill divides rollouts into two groups:
- a baseline group, where the policy acts without skill injection;
- a skill group, where the policy acts with retrieved skills injected into the context.
The performance gap becomes a hindsight signal. In simplified language:
If the skill-injected group performs better, the retrieved task skills receive positive utility updates. Step skills receive credit based on the trajectories in which they appeared. Utility is updated using an exponential moving average, so one lucky rollout does not immediately turn a mediocre skill into company policy. Sensible. Many organizations could learn from that sentence.
This paired design matters because agent memory otherwise suffers from attribution confusion. If an agent succeeds after seeing a retrieved memory, did the memory help, or was the task already easy? If it fails, was the skill bad, or did the policy ignore it? D2Skill does not solve attribution perfectly, but it creates a cleaner comparison than simply observing outcomes after retrieval.
This is also where the system begins to resemble operational A/B testing. The question is not “does this memory sound relevant?” The question is closer to: “under comparable conditions, did injecting this guidance improve task completion?”
For business readers, that distinction is the difference between a knowledge base and a governed learning system.
Retrieval is a ranking problem, not a vibes problem
Most retrieval-augmented systems begin with semantic similarity. That is useful but insufficient. Similarity answers the question “does this stored item look related to the current context?” It does not answer “has this item historically helped?”
D2Skill’s retrieval has two stages. The first stage filters candidate skills by embedding similarity between the current query key and stored retrieval keys. Task-level retrieval uses the task identifier as the query context. Step-level retrieval uses both the task identifier and the current observation.
The second stage ranks candidates using a score that combines:
| Ranking signal | Function | Why it matters |
|---|---|---|
| Semantic similarity | Finds contextually relevant skills | Prevents obviously mismatched advice |
| Skill utility | Favors skills that have helped before | Converts memory from storage into evaluated knowledge |
| Exploration bonus | Gives under-tested skills some chance | Avoids freezing the bank too early around initially lucky skills |
This design recognizes a boring but important truth: relevance and usefulness are different properties. A policy document may be relevant to a customer complaint and still useless for resolving it. A prior failure note may look less semantically elegant but prevent exactly the wrong next action.
The paper’s utility-aware retrieval therefore turns the skill bank into something closer to a portfolio. A skill has an expected value, an uncertainty profile, and an opportunity cost because it occupies limited prompt and retrieval capacity. Suddenly, “memory management” sounds less like a feature and more like capital allocation. Good. That is where the adult supervision begins.
Pruning is not housekeeping; it is part of learning
The paper is unusually clear that an expanding skill bank can become harmful. This is one of its most business-relevant points.
In many enterprise AI discussions, memory growth is treated as a natural sign of maturity. More cases handled. More exceptions stored. More customer-specific context. More institutional knowledge. This sounds attractive until retrieval quality declines and the agent starts surfacing stale, redundant, or low-quality guidance.
D2Skill imposes capacity limits on skill pools and periodically prunes low-scoring skills. The eviction score incorporates utility and usage-related information, while newly created skills are temporarily protected so they have time to be evaluated. This avoids two bad extremes: never forgetting anything, and killing new skills before they have enough evidence.
In operational terms, D2Skill’s pruning mechanism says: memory has carrying costs. It competes for attention, retrieval bandwidth, prompt space, and policy influence. If a stored skill cannot earn its place, it should not stay just because it was once generated by a very confident reflector model.
That is not a minor engineering detail. In the ablation results, removing skill management damages performance more than many readers might expect. The system without skill management keeps accumulating skills, but validation performance falls sharply. The lesson is blunt: unmanaged memory can be worse than smaller memory.
What the experiments are actually testing
The paper evaluates D2Skill on ALFWorld and WebShop. ALFWorld tests embodied household-style tasks through text interactions. WebShop tests web shopping tasks where the agent must search, compare, and select products. These are not live enterprise workflows, but they are reasonable stress tests for long-horizon, partially observed agent behavior.
The experiments are doing several different jobs. Treating every table as “proof that D2Skill is better” would be lazy, and this is not that kind of establishment.
| Evidence item | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Main performance table | Main evidence | D2Skill improves success rates over GRPO and competitive memory or skill baselines in the tested settings | That the same gains will transfer unchanged to enterprise workflows |
| Ablation table | Component attribution | Dual granularity, utility mechanisms, and skill management each matter | That the exact component weights are universal across tasks |
| Skill-bank dynamics figures | Mechanism analysis | Utility-aware management improves the quality of stored and retrieved skills | That utility estimates are perfectly causal |
| Evaluation with different skills | Transferability analysis | Some benefits are internalized into the policy, and learned skills remain useful across evaluation variants | That external skill banks can always be swapped safely |
| Training-cost table | Implementation and efficiency analysis | D2Skill adds modest overhead over GRPO and is much cheaper than SkillRL in the reported setup | That production serving costs are similarly modest |
This separation matters because the paper’s strongest claim is not simply “D2Skill wins.” It is that the win is plausibly tied to a coherent memory-management mechanism.
The main results show large gains, with an important caveat about setup
For Qwen2.5-7B-Instruct, D2Skill reaches 90.6% overall success on ALFWorld with Gemini-3-Flash-generated skills, compared with 75.0% for GRPO and 89.1% for SkillRL. On WebShop, the best D2Skill variants reach 91.1 average score and 84.4% success rate, compared with GRPO at 86.0 score and 72.6% success. SkillRL reports 85.2 score and 72.7% success.
| Setting | GRPO | Best D2Skill result | Gain |
|---|---|---|---|
| Qwen2.5, ALFWorld overall success | 75.0 | 90.6 | +15.6 points |
| Qwen2.5, WebShop success | 72.6 | 84.4 | +11.8 points |
| Qwen3-4B base, ALFWorld overall success | 53.9 | 72.7 | +18.8 points |
| Qwen3-4B + SFT, ALFWorld at 120 steps | 92.9 | 95.3 | +2.4 points |
| Qwen3-4B + SFT, WebShop success at 120 steps | 79.9 | 81.3 | +1.4 points |
The pattern is worth interpreting carefully. D2Skill’s gains are largest when the base policy has more room to improve. In the stronger SFT-initialized setting, final gains are smaller, but training efficiency remains meaningful: after 40 steps, D2Skill reaches 92.2% on ALFWorld, nearly matching GRPO’s 92.9% after 120 steps.
That is a useful signal. The skill bank may matter most when the agent is competent enough to use guidance but not already saturated by supervised initialization and RL. Very weak agents may not use skills reliably; very strong agents may need less external scaffolding. The paper does not frame it exactly this way, but for deployment planning, that middle zone is where the architecture becomes most interesting.
There is also a subtle comparison with SkillRL. The authors note that SkillRL constructs skills from validation trajectories, which gives it privileged information. D2Skill builds its bank only from training-time experience and still performs competitively or better in the reported comparisons. That matters because enterprise systems usually cannot assume a clean validation oracle quietly handing over reusable skills from the future. Very inconsiderate of reality, but there we are.
The ablations say memory governance matters as much as memory content
The ablation study is where the paper becomes more than a benchmark report. It tests variants of D2Skill on ALFWorld with Qwen3-4B-Instruct-2507 and reports validation success.
| Variant | Validation success | Interpretation |
|---|---|---|
| Full D2Skill | 72.7 | Baseline for the ablation study |
| Without task skills | 62.7 | High-level guidance matters |
| Without step skills | 60.2 | Local correction matters, slightly more in this setting |
| Without skill management | 57.8 | Keeping all accumulated skills degrades memory quality |
| Without baseline group | 68.8 | Paired comparison improves credit assignment |
| Without utility retrieval | 64.8 | Similarity-only retrieval is weaker |
| Without utility module | 62.5 | Utility estimation supports both retrieval and optimization |
| Without skills / GRPO | 53.9 | Skill-free RL is the lower reference point |
Three readings are especially useful.
First, both task skills and step skills contribute. Removing either hurts. This supports the dual-granularity thesis rather than merely showing that “some memory” helps.
Second, removing skill management drops validation success to 57.8, worse than removing only task skills or only step skills. That is the governance result. Memory quality is not just a function of what gets added; it is also a function of what gets removed.
Third, removing the baseline group or utility mechanisms reduces performance but does not collapse the system entirely. This suggests that skills themselves provide a major part of the benefit, while utility estimation and paired rollouts improve how the system values and deploys those skills. In business terms: the knowledge base helps, but measurement discipline makes it usable at scale.
The reflector models are better critics than frontline workers
One of the more interesting findings is almost hidden in plain sight. Closed-source models used as reflectors are not necessarily strong standalone agents in the environments. In Table 1, Gemini-3-Flash and O3 do not dominate as direct agents. Yet when used to critique trajectories and extract reusable skills, they improve the trained policy.
That distinction has real architectural implications.
A high-end model does not always need to be the actor. It may be more valuable as a reviewer, diagnostician, and skill extractor. The lower-cost policy performs the repeated environment interactions; the stronger model intervenes when reflection is needed. This division of labor is economically attractive because reflection can be triggered selectively, especially when performance falls below a threshold.
For enterprise agent design, this suggests a practical architecture:
| Role | Model or system component | Main job |
|---|---|---|
| Actor | Efficient task policy | Execute repeated workflow steps |
| Critic / reflector | Stronger reasoning model | Diagnose failures and abstract reusable guidance |
| Skill bank | Managed external memory | Store task and step skills with utility scores |
| Retrieval manager | Ranking and pruning layer | Decide what guidance enters the agent context |
This is not the same as “use the biggest model everywhere.” That strategy is easy to explain and expensive to regret. D2Skill points toward a more modular system: spend reasoning budget where diagnosis and abstraction matter, not necessarily at every action step.
The cost result is modest overhead, not free lunch
The training-cost table reports ALFWorld wall-clock training time on 8 H100 GPUs. GRPO takes 20.8 hours. SkillRL takes 49.2 hours. D2Skill takes 25.6 hours.
| Method | Training hours | Relative cost |
|---|---|---|
| GRPO | 20.8 | 1.0× |
| D2Skill | 25.6 | 1.2× |
| SkillRL | 49.2 | 2.4× |
The authors attribute D2Skill’s relatively low overhead to efficient retrieval: batched embedding queries and incremental updates to skill embeddings, so newly added skills are encoded without repeatedly reprocessing the entire bank.
This is an implementation detail, but it matters. Many memory systems look elegant until the retrieval and update pipeline becomes the real bottleneck. D2Skill’s overhead is not zero, but the reported cost profile supports the claim that the skill bank is not merely buying accuracy with uncontrolled compute.
Still, one should read this as training overhead in a benchmark setup, not as a full production cost model. Enterprise systems would also need monitoring, human review for sensitive workflows, access-control logic, audit trails, and rollback mechanisms for bad skills. The paper does not test those. It should not be blamed for not solving every enterprise compliance problem before breakfast.
What Cognaptus would infer for business systems
The paper directly shows that D2Skill improves performance on ALFWorld and WebShop under the reported model and training setups. It also shows through ablations that dual-granularity skills, utility-aware retrieval, and skill management contribute to the gains.
The business inference is broader but should remain disciplined.
D2Skill suggests that workflow agents should not treat interaction logs as a passive archive. Logs should be processed into reusable operational knowledge at multiple levels:
| Paper mechanism | Business translation | Practical design implication |
|---|---|---|
| Task skills | Workflow playbooks | Extract recurring strategies from successful and failed cases |
| Step skills | Local exception rules | Capture specific corrections tied to observations or states |
| Paired rollout comparison | Controlled usefulness testing | Evaluate whether guidance improves outcomes versus no guidance |
| Utility-aware retrieval | Performance-weighted knowledge access | Rank knowledge by relevance and historical usefulness |
| Skill pruning | Memory governance | Remove stale, redundant, or low-value guidance |
| Reflector model | Post-hoc diagnosis engine | Use stronger models selectively to convert failures into reusable rules |
The ROI pathway is not “memory makes agents smarter.” That is too vague to be useful. The stronger pathway is: structured memory reduces repeated failure, shortens future task completion, improves transfer across similar tasks, and lowers the cost of diagnosing agent mistakes.
In a customer service setting, this might mean the agent learns not only the refund policy but also the step-specific exception that prevents an escalation loop. In procurement, it might learn the sourcing playbook and the page-level correction for hidden supplier constraints. In internal research, it might learn both the search strategy and the local warning that a specific database returns misleading metadata.
The paper’s mechanism maps cleanly to these cases. The empirical evidence does not yet prove them. That distinction is small enough to be ignored in a sales deck and large enough to matter in implementation.
Where the paper’s evidence stops
D2Skill is tested on two representative benchmarks, not on live enterprise workflows. ALFWorld and WebShop are useful because they require long-horizon interaction, partial observation, and action correction. But they are still benchmarks. They do not include messy organizational incentives, regulatory constraints, adversarial users, confidential documents, or humans changing the process halfway through because someone attended a strategy offsite.
The system also relies on an external reflector model for skill generation. That may be acceptable, even desirable, in many architectures. But it creates dependency questions: reflector quality, cost, latency, privacy, and consistency. If the reflector generates a flawed step skill, the utility mechanism may eventually suppress it, but “eventually” is not always good enough in high-risk workflows.
There is also a measurement boundary. The paired rollout design gives a practical estimate of skill utility, but it is not a perfect causal instrument. Skill-injected and baseline groups share the same policy, which helps, but trajectory stochasticity and environment variation still complicate attribution. The system measures usefulness well enough to guide training in the tested environments. It does not certify that a skill is universally valid.
Finally, the strongest business claims would require testing under production-like conditions: persistent user-specific context, changing task distributions, permission boundaries, audit requirements, and human correction loops. D2Skill gives a promising design pattern. It is not an enterprise deployment manual wearing a lab coat.
The strategic lesson is memory discipline
The most useful idea in D2Skill is not that agents need memory. That is now obvious enough to be uninteresting. The useful idea is that memory must be structured, scored, retrieved, and pruned.
Task skills without step skills are too blunt. Step skills without task skills are too local. Similarity without utility is too trusting. Utility without exploration is too conservative. Memory without pruning is just hoarding with a cosine metric.
D2Skill’s contribution is to put these pieces into one reinforcement-learning loop. The policy improves, the skill bank evolves, and the system learns not only from outcomes but from the measured difference between acting with guidance and acting without it.
For businesses building agentic systems, this points to a practical shift. The competitive asset is not merely the model, nor even the raw interaction data. It is the managed conversion of repeated experience into reusable operational skill.
That is less glamorous than a giant model announcement. It is also closer to how durable process advantage usually works.
Cognaptus: Automate the Present, Incubate the Future.
-
Songjun Tu, Chengdong Xu, Qichao Zhang, Yaocheng Zhang, Xiangyuan Lan, Linjing Li, and Dongbin Zhao, “Dynamic Dual-Granularity Skill Bank for Agentic RL,” arXiv:2603.28716v1, 30 March 2026, https://arxiv.org/abs/2603.28716. ↩︎