Code-SHARP: When Agents Start Writing Their Own Ambitions

Opening — Why This Matters Now

Everyone wants “agentic AI.” Few are willing to admit that most agents today are glorified interns with a checklist.

Reinforcement learning (RL) systems remain powerful—but painfully narrow. They master what we explicitly reward. Nothing more. The real bottleneck isn’t compute. It isn’t model size. It’s imagination—specifically, how rewards are defined.

The paper CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs tackles a structural weakness in modern AI systems: if you don’t know the future tasks, how do you design the rewards today?

The answer, in short: let the system write its own reward programs—hierarchically, incrementally, and in executable code.

Not prompt engineering. Not reward tuning.

Programmatic ambition.

Background — The Reward Function Trap

Traditional RL requires:

A defined task.
A hand-designed reward function.
Iterative optimization.

This works beautifully—until you want open-ended capability growth.

Recent approaches attempt to automate reward refinement using foundation models (FMs). But most still assume a fixed task structure. They improve scoring rules; they do not expand the universe of goals.

The constraint can be summarized as:

$$ \text{Policy} = \arg\max_\pi \mathbb{E}[R_{\text{fixed}}] $$

If $R$ is fixed, intelligence plateaus at the boundary of your imagination.

CODE-SHARP reframes the problem.

Instead of optimizing policy under a static reward, it continuously discovers new reward programs—structured hierarchically and stored in a skill archive.

This is not reward tuning.

It is reward evolution.

What CODE-SHARP Actually Does

At its core, CODE-SHARP builds a hierarchical directed graph of executable reward programs, called SHARP skills.

Each node represents:

A skill
Its reward function written in code
Dependencies on lower-level skills

The system uses a Foundation Model to:

Propose new reward programs
Evaluate their novelty and utility
Insert them into a growing skill graph

The result is a continuously expanding capability archive.

Structural Architecture

Layer	Function	Role
Foundation Model	Generates reward code	Creative expansion
Skill Archive (Graph)	Stores hierarchical reward programs	Memory + structure
Goal-Conditioned Agent	Trains exclusively on discovered rewards	Execution engine
High-Level FM Planner	Composes skills	Strategic coordination

The crucial idea: the agent trains only on rewards discovered by the system itself.

No handcrafted long-horizon objectives.

Hierarchical Reward Programs — Why It Works

Flat skill libraries fail at scale. Composition becomes brittle.

CODE-SHARP enforces structure:

Low-level atomic behaviors
Mid-level compositional skills
High-level multi-stage objectives

Graph-based organization allows skill reuse and refinement.

Mathematically, if $S$ represents discovered skills and $G$ the skill graph:

$$ S_{t+1} = S_t \cup \text{FM_generate}(G_t) $$

The space of achievable policies expands as the reward basis expands.

This transforms RL from optimization into curriculum self-generation.

Empirical Results — Craftax as a Stress Test

The system was evaluated in the Craftax environment, a long-horizon crafting world where tasks require sequential dependencies.

Key findings:

Metric	Baseline Pretrained Agents	Task-Specific Experts	CODE-SHARP + Planner
Long-Horizon Success Rate	Moderate	High (task-bound)	Highest
Generalization Across Goals	Low	Low	High
Average Performance Gain	—	—	+134%

Two insights matter most:

A single goal-conditioned agent trained only on discovered rewards can solve increasingly long-horizon tasks.
When composed by a high-level planner, discovered skills outperform expert policies by over 134% on average.

That is not marginal.

That is structural leverage.

What This Means for Business AI

Open-ended capability growth is not just an academic ambition. It has direct operational implications.

1. Automation at the Frontier

Most enterprise automation fails because workflows evolve.

Hardcoded reward systems = brittle bots.

Hierarchical reward programs suggest a pathway toward systems that:

Expand internal capability graphs
Discover new subroutines
Adapt without full retraining

2. Reduced Human Reward Engineering

Designing robust incentive systems is expensive.

If foundation models can propose reward structures in code, the bottleneck shifts from designing objectives to governing expansion.

Which raises governance questions.

Governance & Control — The Quiet Risk

An agent that writes its own reward programs is, effectively, writing its own ambitions.

The paper focuses on performance. Businesses must focus on containment.

Key oversight questions:

Risk Dimension	Governance Challenge
Reward Drift	Are new skills aligned with enterprise goals?
Skill Graph Complexity	Can we audit hierarchical dependencies?
Planner Autonomy	Who validates high-level compositions?
Execution Safety	Can discovered skills trigger unsafe actions?

Open-ended growth is powerful. Unbounded growth is dangerous.

Enterprise adoption will require:

Skill approval pipelines
Reward sandboxing
Auditable skill graphs
Planner-level constraint systems

Broader Implications — Beyond Crafting Worlds

The conceptual leap is larger than the experiment.

CODE-SHARP suggests that:

Foundation models can generate executable incentives
Skill graphs can become internal knowledge economies
RL agents can scale capability without fixed task sets

This architecture resembles:

Organizational learning systems
API-based microservice ecosystems
Modular capability marketplaces

The frontier shifts from model training to capability ecosystem design.

And that is a different business entirely.

Conclusion — From Training to Evolution

Most AI systems today are optimized.

CODE-SHARP proposes systems that evolve.

The shift from fixed rewards to hierarchical reward programs may prove as important as the shift from small models to foundation models.

Because once an agent can program its own incentives, it stops waiting for instructions.

It starts building its own ladder.

The real question is not whether it can climb.

It’s whether we designed the walls wisely.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why This Matters Now#

Background — The Reward Function Trap#

What CODE-SHARP Actually Does#

Structural Architecture#

Hierarchical Reward Programs — Why It Works#

Empirical Results — Craftax as a Stress Test#

What This Means for Business AI#

1. Automation at the Frontier#

2. Reduced Human Reward Engineering#

Governance & Control — The Quiet Risk#

Broader Implications — Beyond Crafting Worlds#

Conclusion — From Training to Evolution#