Automation has a boring failure mode: the moment the world becomes slightly more complicated than the workflow diagram, the system starts asking for a human.
That is not because the model lacks vocabulary. It is because the automation system does not know how to grow its own capabilities. Most AI agents are still built around a fixed menu of actions, fixed task definitions, and fixed reward signals. They can optimize, but they rarely expand the set of things they know how to optimize for. Very impressive, in the way a microwave is impressive until you ask it to cook without buttons.
The paper CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs attacks this problem from a useful angle: instead of asking a foundation model to act inside the environment at every step, it asks the model to write executable reward programs that define new skills, connect them into a prerequisite graph, and use those programs to train a reinforcement-learning agent from scratch.1
That distinction matters. CODE-SHARP is not another “LLM controls the agent” architecture. The foundation model is not sitting inside the runtime loop, improvising every movement like an overpaid intern with terminal access. It is used offline to discover, implement, judge, and mutate skill programs. At runtime, the learned policy and the generated Python reward programs do the work.
The paper’s real contribution is therefore not simply that CODE-SHARP performs well on Craftax and XLand. The more interesting point is mechanical: it shows a plausible route from hand-designed automation logic toward self-expanding skill libraries, where new capabilities are written as code, reused as prerequisites, and improved through evaluation.
That is the part worth understanding.
The bottleneck is not reward design; it is reward extension
Traditional reinforcement learning starts with a defined task and a reward function. That is fine when the designer knows the task in advance. It becomes painful when the desired behavior is open-ended.
A reward function can teach an agent to complete a task. It does not automatically teach the system what the next meaningful task should be. For open-ended agents, this creates a practical bottleneck: someone must keep naming goals, defining success conditions, and decomposing long-horizon behavior into trainable parts.
Recent foundation-model-driven systems try to reduce that burden. Some use language models as planners. Some use them to suggest goals. Some use them to provide reward feedback. But many still depend on heavy human engineering: hand-built APIs, state captioners, translation functions, or demonstrations. The paper cites the Mineflayer API used in Minecraft work as roughly 13,000 lines of JavaScript, and MineDojo as trained on 33 years of annotated human gameplay. That is not “general agent intelligence.” That is a very large adapter bill.
CODE-SHARP tries to remove part of that bridge. It gives the foundation model access to environment source code, strips out reward and achievement logic, and asks it to generate Python programs that encode skills. These programs become the reward basis for training a goal-conditioned reinforcement-learning agent.
The shift is subtle but important:
| Older pattern | CODE-SHARP pattern |
|---|---|
| Human defines task rewards | Foundation model writes executable skill rewards |
| Agent learns one task or a fixed set of tasks | Agent trains on a growing archive of skills |
| Long-horizon behavior is hand-decomposed | Skills point to prerequisite skills |
| Foundation model may act or plan repeatedly at runtime | Foundation model mainly builds and edits the skill archive offline |
| Transfer requires environment-specific glue code | Transfer still requires source-code access, but less hand-written task glue |
This is why the paper’s “reward program” framing is more interesting than the usual “agent discovers skills” headline. Skills are not just labels. They are executable, testable artifacts.
A SHARP is a skill written as a small reward program
The paper calls each generated skill a SHARP, short for Skill as a Hierarchical Reward Program. A SHARP is a Python program that contains two essential pieces.
First, it has a success condition. This is the local definition of what it means for the skill to be completed. For example, a crafting skill might check whether the agent has a stone pickaxe in inventory.
Second, it has an ordered set of prerequisite conditions. Each condition is paired with another SHARP that should be invoked when that condition is not yet satisfied. For instance, “craft stone pickaxe” may require that the agent already has wood, stone, and access to a crafting table. If stone is missing, the system routes toward the “mine stone” SHARP. If mining stone requires a wooden pickaxe, routing continues downward.
In plain terms, each SHARP says:
- Here is what success looks like.
- Here are the immediate prerequisites.
- If a prerequisite is missing, use an older skill to get it.
That last word, immediate, carries much of the paper. CODE-SHARP does not ask the foundation model to write a complete long-horizon recipe from scratch every time. It asks for the marginal new behavior plus the nearest dependencies. The hierarchy does the rest.
This creates a directed acyclic graph of skills. Early skills are simple: find a tree, find water, mine wood. Later skills compose them: craft a wooden pickaxe, craft a stone pickaxe, craft torches, descend into dungeons, explore dark mines.
The business analogy is not “an AI writes its own dreams.” Cute, but too theatrical. The better analogy is a workflow library that can generate new validation functions and dependency links as it learns what operations are possible.
A primitive automation says, “Run step A, then B, then C.”
A SHARP-like system says, “To perform C, check whether B is satisfied; if not, invoke the skill that makes B true; if that requires A, route there first.”
That is a different kind of operational memory.
The recursive router is the mechanism that makes the hierarchy useful
A graph of skills is not automatically useful. Many systems have libraries, taxonomies, and dependency maps. They sit there, looking organized, quietly doing nothing.
CODE-SHARP’s key runtime mechanism is a recursive transition operator. At each environment step, the system starts from a target SHARP and descends through unmet prerequisites until it reaches the most immediately actionable skill. That active SHARP conditions the policy and provides the reward.
So if the target is “craft stone pickaxe,” the runtime system does not blindly reward only the final stone-pickaxe state. It asks what is missing now. If the agent lacks stone, it routes to mining stone. If mining stone requires a wooden pickaxe and the agent lacks one, it routes again. Once the lower-level condition is satisfied, the system can move back up.
This has two practical effects.
The first is learning efficiency. A new skill does not require the agent to rediscover the entire chain leading to it. The hierarchy guides the agent into states where only the new marginal behavior remains to be learned.
The second is robustness. Because routing is recalculated from the current state, the system can adapt when the environment changes or when prerequisites are already satisfied. Flat plans are brittle because they assume the sequence remains valid. Recursive prerequisite routing is more like operational diagnosis: inspect the current state, identify the nearest missing condition, dispatch to the right subroutine.
This is why the paper’s flat reward program ablation matters. The authors compare CODE-SHARP with CODE-FRP, where reward programs are flat rather than recursively hierarchical. The flat version must enumerate the full sequence of prerequisite skills. That increases the specification burden on the foundation model and removes step-level adaptation.
The result is not flattering for flatness. In Craftax-Classic, CODE-SHARP reaches a median success rate of 94.8% across 22 achievements, while CODE-FRP reaches 38.9%, ELLM reaches 8.1%, and OMNI reaches 15.8%. The average success rates are closer but still favor CODE-SHARP: 67.2% versus 44.2% for CODE-FRP, 39.8% for ELLM, and 39.2% for OMNI.
The median result is especially revealing. CODE-SHARP does not merely raise performance on a few easy tasks. It shifts the distribution of achieved skills upward.
The foundation model is a skill librarian, not a joystick
One likely misunderstanding is that CODE-SHARP works because the foundation model is acting as an intelligent controller at every step. It does not.
The foundation model appears in two archive-building loops.
The first loop discovers new SHARPs. It proposes candidate skills, translates them into Python code, and uses a foundation-model judge to evaluate correctness, feasibility, and novelty. Candidates that compile and show learning progress are added to the archive.
The second loop mutates existing SHARPs. It samples weaker skills more often, asks the model to propose repairs, and evaluates whether the mutation improves success. Useful mutations replace earlier versions.
At runtime, the foundation model is absent. The policy is trained with PPO using rewards generated by the SHARP archive. The agent is conditioned on the active SHARP name embedding and the environment state. The reward comes from the executable SHARP code.
That design choice is commercially interesting because runtime foundation-model calls are expensive, slow, and operationally fragile. A system that uses a large model to build skill definitions offline, then executes compiled reward logic and learned policies online, has a different cost profile from one that asks a language model to reason through every action.
The paper’s XLand experiment makes this distinction sharper. XLand introduces stochastic object interaction rules and lower-level physical navigation. CODE-SHARP achieves a median success rate of 79.2% and an average of 61.6% across 20 evaluation targets. ELLM reaches 29.7% median and 45.4% average; OMNI reaches 2.3% median and 19.3% average; CODE-FRP reaches 6.1% median and 22.8% average.
The reason is not mystical intelligence. It is that SHARPs can route through different prerequisite skills depending on the current state. In one example, the red-key pickup achievement involves an active crafting rule sampled from 18 possible rules. CODE-SHARP succeeds in over 37% of episodes; ELLM succeeds in 0.6%.
That is not a general proof of enterprise readiness. It is a clean demonstration of why state-dependent prerequisite routing beats fixed target-state sequences when the environment varies.
Archive evolution is not decoration; it repairs the model’s first draft
The mutation loop deserves more attention than a simple summary usually gives it.
Foundation models write imperfect code and imperfect dependencies. In CODE-SHARP, that is assumed rather than hidden. The archive is not treated as a sacred collection of first drafts. It is continuously edited.
In XLand, only 11% of achievements were matched to mutated SHARPs, but those mutations produced an average relative performance increase of 130.7% over the base versions. The paper attributes the gain mainly to adding prerequisites that were missing in the original specification.
In the long-run Craftax-Extended experiment, the authors describe mutations that fix structural inefficiencies such as condition ordering and poorly chosen prerequisites. One example is a KillOrcWarrior SHARP whose base version required descending to the dungeon before crafting a stone sword. The refined version reorders the steps, improving success.
This matters because it changes how we should interpret the framework. CODE-SHARP is not claiming that a foundation model can perfectly design a curriculum in one pass. It is closer to a generated-code system with evaluation-driven repair. The model proposes. The environment tests. The archive keeps the better version.
For business automation, that is a more credible pattern than “the AI understands everything.” It suggests a pipeline where generated workflow checks and prerequisite maps are sandboxed, measured, and promoted only when they improve execution. Less magical. More useful.
The long-run result tests open-ended growth, not just task performance
The comparative experiments test whether CODE-SHARP beats prior open-ended skill discovery methods in Craftax-Classic and XLand. The long-run experiment asks a different question: can the archive continue to grow into a useful general skill base?
For this test, the authors scale to Craftax-Extended, which adds NetHack-like dungeon-crawling dynamics and eight new levels. CODE-SHARP runs for 100 proposal iterations and 85 evolution iterations, discovering an average of 90 SHARPs. In one reported run, the appendix lists 93 discovered SHARPs.
The evaluation is deliberately harder than “did the agent complete the same rewards it trained on?” The agent is trained only on CODE-SHARP-discovered rewards, not on the original environment rewards or benchmark rewards. Then an FM-based policy planner composes discovered SHARPs into policies-in-code for four challenging benchmark tasks: Crafting, Dungeon, Navigation, and Mines.
This means benchmark performance acts as a joint test of two things:
| What is being tested | Why it matters |
|---|---|
| Archive breadth | Did CODE-SHARP discover enough useful skills to cover the environment? |
| Skill fidelity | Did the agent actually learn the behavior each SHARP intended? |
| Composability | Can separate skills be assembled into longer benchmark policies? |
| Hierarchical value | Does SHARP routing outperform flat reward programs under the same planner? |
The headline result is strong. CODE-SHARP achieves an overall benchmark score of 50.5. That matches the fine-tuned agent trained from original Craftax-Extended rewards and benchmark finetuning, which scores 47.9. It outperforms the task-specific agent at 12.2, the ReAct-style LLM agent at 6.7, and CODE-FRP at 14.1.
The detailed breakdown is more informative than the average. CODE-SHARP beats the fine-tuned agent on Crafting, 66.5 versus 58.1, and Dungeon, 87.4 versus 77.1. The fine-tuned agent remains ahead on Navigation, 17.3 versus 12.7, and Mines, 39.2 versus 35.6. The paper suggests that ground-truth environment rewards may explicitly incentivize deeper exploration, which helps in those latter tasks.
That boundary is important. CODE-SHARP does not dominate everywhere. It performs best where discovered hierarchical skills cover the needed behavior and can be composed effectively. It is weaker where the benchmark depends on deeper exploration patterns that the discovered reward archive may not sufficiently incentivize.
The appendix makes this visible. In the Mines benchmark, CODE-SHARP performs strongly through earlier milestones such as crafting tools, torches, descending to the dungeon, and killing dungeon monsters. Performance drops on late milestones such as drinking water and finding diamonds. That is not failure hidden in the footnotes; it is the shape of the frontier.
The ablations identify the training mechanics, not just the architecture
The paper’s appendix includes an ablation study over three training components: open-ended training, adaptive reward scaling, and opportunistic sampling.
These are easy to treat as implementation details. They are not.
Open-ended training allows the agent to keep moving through targets rather than treating each episode as a single isolated attempt. Adaptive reward scaling increases the reward weight for skills with low success rates. Opportunistic sampling biases target selection toward SHARPs whose conditions are already satisfied but whose prerequisites have low success rates, exploiting rare states where the agent can practice difficult skills.
The average benchmark scores degrade monotonically as these components are removed:
| Condition | Average score | Interpretation |
|---|---|---|
| Full CODE-SHARP | 50.55 | Complete system |
| No opportunistic sampling | 31.93 | Major drop; frontier practice matters |
| No reward scaling or opportunistic sampling | 21.20 | Hard skills receive weaker learning pressure |
| No open-ended training, reward scaling, or opportunistic sampling | 13.50 | Episodic uniform training fails to use the archive effectively |
The likely purpose of this ablation is not to prove that every possible implementation must use these exact formulas. It shows that the hierarchy alone is not enough. Once the archive grows, the training distribution becomes a strategic object. The agent must spend enough time at the frontier of its capability graph, not merely sample uniformly from a menu of old and new skills.
For enterprise readers, the analogy is straightforward. A company may have a knowledge base, workflow library, or process map. That does not mean employees or agents practice the right frontier tasks. Capability growth depends on what gets sampled, reinforced, and repaired.
Smaller models expose the real dependency: skill novelty judgment
The paper also tests a smaller foundation model, Qwen3-30B-A3B-Thinking-2507, against the larger Qwen3-235B-A22B-Thinking-2507 used in the main experiments. This is best read as a sensitivity test, not a second thesis.
The smaller model performs competitively on lower-complexity skills. It even scores higher on Crafting in one comparison. But it underperforms on longer-horizon tasks such as Dungeon, Mines, and Navigation. The authors report that Qwen3-30 struggles with judging novelty, producing duplicate skills that stall discovery around minor variations instead of pushing toward more complex skills.
That detail matters more than the raw model-size comparison. CODE-SHARP depends not only on code generation but also on curriculum judgment: what is new, what is feasible, what builds on the archive, and what should be repaired. A cheaper model that can write syntactically valid code may still fail to grow the archive in the right direction.
The business interpretation is not “use the largest model forever.” That would be the easy and expensive lesson. The better lesson is that different parts of the pipeline have different model-quality requirements:
| Pipeline role | Main requirement | Failure mode |
|---|---|---|
| Skill proposal | Understand environment affordances | Trivial or impossible skills |
| Skill implementation | Write executable code | Compile errors or wrong success checks |
| Skill judging | Assess novelty and feasibility | Duplicate skills, shallow archive growth |
| Mutation | Diagnose structural flaws | Wrong prerequisites or worse variants |
| Policy planning | Compose useful skills | Missing auxiliary objectives |
The paper suggests that novelty judgment and long-horizon composition may be the expensive parts. That is useful for anyone designing a production version. Spend model budget where weak judgment damages the archive, not where a cheap code translator is already sufficient.
What CODE-SHARP directly shows
A disciplined reading separates paper evidence from business inference.
The paper directly shows that, in game-like embodied environments with accessible source code, CODE-SHARP can autonomously generate a growing archive of executable hierarchical reward programs. Those programs can train a goal-conditioned RL agent from scratch without hand-written achievement rewards, curated demonstrations, or runtime foundation-model control.
It also directly shows that hierarchical routing is central to performance. CODE-FRP shares much of the setup but removes recursive SHARP routing. Its performance collapses relative to CODE-SHARP in both comparative and long-run evaluations.
Finally, the paper shows that archive growth can translate into downstream zero-shot benchmark performance. The Craftax-Extended result is the strongest evidence here: the agent trained only on generated SHARP rewards can match a fine-tuned ground-truth-reward baseline on the aggregate benchmark score, while significantly outperforming ReAct and task-specific baselines.
That is enough to be interesting. No need to inflate it into general artificial intelligence discovering its destiny before breakfast.
What Cognaptus infers for business automation
The business relevance is not that companies should immediately train RL agents inside their ERP systems. Please do not release a dungeon-crawling reward-program agent into procurement and call it transformation.
The useful inference is architectural.
Many business processes contain implicit prerequisite structures. A sales operation needs lead qualification before proposal drafting. A finance automation needs document completeness before reconciliation. A compliance workflow needs evidence collection before risk classification. A customer support agent needs diagnosis before resolution.
Today, these structures are often hardcoded as brittle workflow trees. CODE-SHARP points to a different pattern: generated, executable success checks plus dependency routing plus evaluation-driven archive repair.
In a business setting, the analog of a SHARP might be a small program that verifies whether a task state is complete:
- customer identity verified;
- invoice fields reconciled;
- missing document requested;
- support issue classified;
- data pipeline freshness confirmed;
- exception escalated with required evidence.
The analog of prerequisite routing would be the system identifying which lower-level condition is missing and invoking the relevant subroutine. The analog of mutation would be revising the success condition or prerequisite map when operational evidence shows the first version was wrong.
That is the practical path from this paper to enterprise AI:
| Paper mechanism | Business analog | Potential value |
|---|---|---|
| SHARP success condition | Executable workflow completion check | Less vague task status |
| SHARP prerequisite graph | Dependency map among business subroutines | Better routing and reuse |
| Recursive transition operator | Diagnose nearest missing condition | Less brittle process execution |
| Proposal and judge loop | Generate and screen new workflow skills | Lower process-engineering cost |
| Mutation loop | Repair flawed task definitions | Continuous process improvement |
| Opportunistic sampling | Practice or test frontier cases | Better handling of rare exceptions |
The ROI story is not “agents become autonomous.” That phrase has been abused enough. The ROI story is cheaper diagnosis, reusable subroutines, and reduced manual workflow engineering in domains where success conditions can be programmatically checked.
Where the paper does not yet travel
The boundaries are not minor.
First, CODE-SHARP requires access to environment source code. That is reasonable in simulated worlds and some software environments. It is harder in messy business settings where the “environment” is a mix of SaaS tools, human approvals, informal policies, PDFs, and incomplete logs.
Second, the experiments are in game-like embodied environments. They are complex enough to test long-horizon skill discovery, but they are still clean compared with real enterprise processes. Business workflows contain ambiguous objectives, legal constraints, adversarial incentives, and social context. A reward program can check whether a field is complete. It cannot, by itself, decide whether a client relationship is politically sensitive.
Third, the framework assumes that candidate skills can be evaluated through environment interaction. In production, unsafe or costly actions cannot simply be tried. A business version would need sandboxing, audit logs, approval gates, and simulation layers before any generated skill touches live operations.
Fourth, the foundation model’s quality matters. The smaller-model ablation shows that weaker novelty judgment can stall the archive. If a business system keeps generating near-duplicate “skills,” the result is not intelligence. It is process clutter with a nicer name.
Finally, CODE-SHARP trains RL agents with nontrivial compute. The comparative experiments use GPU resources and multi-day runs. The long-run experiment requires roughly three days. That may be acceptable for research and some high-value automation domains. It is not a casual plug-in for ordinary office workflows.
The limitation is not that the paper is weak. It is that its strongest version assumes a world where behavior can be simulated, state can be inspected, and success can be encoded. Enterprises should read it as an architecture for controlled environments first, not a universal recipe for office autonomy.
The real shift: from prompts to capability archives
Most agent discussions still orbit around prompts, tools, and memory. CODE-SHARP shifts attention to a more durable object: the skill archive.
A prompt tells the system what to do now. A tool lets it act. A memory stores past information. A skill archive stores executable definitions of what capabilities exist, how they depend on one another, and how the agent should be rewarded for acquiring them.
That is a more scalable unit of agent development. It is also easier to govern. A generated skill can be inspected. A prerequisite graph can be audited. A mutation can be compared against its parent. A success condition can be tested. None of this makes the system safe by default, but it gives governance something concrete to hold onto.
The existing agent hype cycle likes to imagine systems that “just figure it out.” CODE-SHARP is more interesting because it shows what “figuring it out” might need to look like mechanically: write a local success condition, connect it to prerequisites, train the policy, evaluate progress, repair the archive, and repeat.
That is less romantic than an agent with ambitions.
It is also more useful.
Conclusion: agents do not need bigger dreams; they need better prerequisites
CODE-SHARP’s strongest idea is not that an agent can write its own ambitions. It is that a system can turn possible ambitions into executable, hierarchical reward programs.
The hierarchy is the point. Without it, long-horizon tasks become brittle sequences. With it, new skills can be defined by marginal behavior and routed through already learned prerequisites. That is why the flat reward program ablation matters. That is why the mutation loop matters. That is why the long-run archive matters.
For business AI, the lesson is not to copy Craftax into the enterprise. The lesson is to stop treating automation as a pile of isolated tasks. Real capability grows through reusable subroutines, explicit success checks, dependency-aware routing, and disciplined repair.
CODE-SHARP does not solve enterprise automation. It offers a clean research prototype of a deeper pattern: agents become more useful when their capabilities are not just prompted, but organized, tested, and evolved as code.
The ambition is not the miracle.
The archive is.
Cognaptus: Automate the Present, Incubate the Future.
-
Richard Bornemann, Pierluigi Vito Amadori, and Antoine Cully, “CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs,” arXiv:2602.10085v3, 21 May 2026, https://arxiv.org/html/2602.10085. ↩︎