Opening — Why This Matters Now
Everyone wants “agentic AI.” Few are willing to admit that most agents today are glorified interns with a checklist.
Reinforcement learning (RL) systems remain powerful—but painfully narrow. They master what we explicitly reward. Nothing more. The real bottleneck isn’t compute. It isn’t model size. It’s imagination—specifically, how rewards are defined.
The paper CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs tackles a structural weakness in modern AI systems: if you don’t know the future tasks, how do you design the rewards today?
The answer, in short: let the system write its own reward programs—hierarchically, incrementally, and in executable code.
Not prompt engineering. Not reward tuning.
Programmatic ambition.
Background — The Reward Function Trap
Traditional RL requires:
- A defined task.
- A hand-designed reward function.
- Iterative optimization.
This works beautifully—until you want open-ended capability growth.
Recent approaches attempt to automate reward refinement using foundation models (FMs). But most still assume a fixed task structure. They improve scoring rules; they do not expand the universe of goals.
The constraint can be summarized as:
$$ \text{Policy} = \arg\max_\pi \mathbb{E}[R_{\text{fixed}}] $$
If $R$ is fixed, intelligence plateaus at the boundary of your imagination.
CODE-SHARP reframes the problem.
Instead of optimizing policy under a static reward, it continuously discovers new reward programs—structured hierarchically and stored in a skill archive.
This is not reward tuning.
It is reward evolution.
What CODE-SHARP Actually Does
At its core, CODE-SHARP builds a hierarchical directed graph of executable reward programs, called SHARP skills.
Each node represents:
- A skill
- Its reward function written in code
- Dependencies on lower-level skills
The system uses a Foundation Model to:
- Propose new reward programs
- Evaluate their novelty and utility
- Insert them into a growing skill graph
The result is a continuously expanding capability archive.
Structural Architecture
| Layer | Function | Role |
|---|---|---|
| Foundation Model | Generates reward code | Creative expansion |
| Skill Archive (Graph) | Stores hierarchical reward programs | Memory + structure |
| Goal-Conditioned Agent | Trains exclusively on discovered rewards | Execution engine |
| High-Level FM Planner | Composes skills | Strategic coordination |
The crucial idea: the agent trains only on rewards discovered by the system itself.
No handcrafted long-horizon objectives.
Hierarchical Reward Programs — Why It Works
Flat skill libraries fail at scale. Composition becomes brittle.
CODE-SHARP enforces structure:
- Low-level atomic behaviors
- Mid-level compositional skills
- High-level multi-stage objectives
Graph-based organization allows skill reuse and refinement.
Mathematically, if $S$ represents discovered skills and $G$ the skill graph:
$$ S_{t+1} = S_t \cup \text{FM_generate}(G_t) $$
The space of achievable policies expands as the reward basis expands.
This transforms RL from optimization into curriculum self-generation.
Empirical Results — Craftax as a Stress Test
The system was evaluated in the Craftax environment, a long-horizon crafting world where tasks require sequential dependencies.
Key findings:
| Metric | Baseline Pretrained Agents | Task-Specific Experts | CODE-SHARP + Planner |
|---|---|---|---|
| Long-Horizon Success Rate | Moderate | High (task-bound) | Highest |
| Generalization Across Goals | Low | Low | High |
| Average Performance Gain | — | — | +134% |
Two insights matter most:
- A single goal-conditioned agent trained only on discovered rewards can solve increasingly long-horizon tasks.
- When composed by a high-level planner, discovered skills outperform expert policies by over 134% on average.
That is not marginal.
That is structural leverage.
What This Means for Business AI
Open-ended capability growth is not just an academic ambition. It has direct operational implications.
1. Automation at the Frontier
Most enterprise automation fails because workflows evolve.
Hardcoded reward systems = brittle bots.
Hierarchical reward programs suggest a pathway toward systems that:
- Expand internal capability graphs
- Discover new subroutines
- Adapt without full retraining
2. Reduced Human Reward Engineering
Designing robust incentive systems is expensive.
If foundation models can propose reward structures in code, the bottleneck shifts from designing objectives to governing expansion.
Which raises governance questions.
Governance & Control — The Quiet Risk
An agent that writes its own reward programs is, effectively, writing its own ambitions.
The paper focuses on performance. Businesses must focus on containment.
Key oversight questions:
| Risk Dimension | Governance Challenge |
|---|---|
| Reward Drift | Are new skills aligned with enterprise goals? |
| Skill Graph Complexity | Can we audit hierarchical dependencies? |
| Planner Autonomy | Who validates high-level compositions? |
| Execution Safety | Can discovered skills trigger unsafe actions? |
Open-ended growth is powerful. Unbounded growth is dangerous.
Enterprise adoption will require:
- Skill approval pipelines
- Reward sandboxing
- Auditable skill graphs
- Planner-level constraint systems
Broader Implications — Beyond Crafting Worlds
The conceptual leap is larger than the experiment.
CODE-SHARP suggests that:
- Foundation models can generate executable incentives
- Skill graphs can become internal knowledge economies
- RL agents can scale capability without fixed task sets
This architecture resembles:
- Organizational learning systems
- API-based microservice ecosystems
- Modular capability marketplaces
The frontier shifts from model training to capability ecosystem design.
And that is a different business entirely.
Conclusion — From Training to Evolution
Most AI systems today are optimized.
CODE-SHARP proposes systems that evolve.
The shift from fixed rewards to hierarchical reward programs may prove as important as the shift from small models to foundation models.
Because once an agent can program its own incentives, it stops waiting for instructions.
It starts building its own ladder.
The real question is not whether it can climb.
It’s whether we designed the walls wisely.
Cognaptus: Automate the Present, Incubate the Future.