Opening — Why this matters now
AI agents are graduating from chat windows into worlds.
Robots assemble parts. Digital assistants browse the web. Game agents mine diamonds in Minecraft with suspiciously human determination. Yet as soon as these agents face long-horizon tasks—problems that require dozens or hundreds of coordinated actions—they tend to collapse under their own memory of mistakes.
Not because they cannot reason.
But because they cannot learn from experience in a structured way.
A recent research paper introduces Steve‑Evolving, a framework designed to solve precisely this bottleneck. Instead of relying on model retraining or massive trajectory logs, the system continuously converts experience into structured knowledge—skills and constraints—that directly guide future planning.
In short: the agent learns the way professionals do.
Not by remembering every event, but by extracting the rules that matter.
Background — The limits of memory‑based agents
Many modern LLM agents already operate in complex environments. Systems such as Voyager, Jarvis‑1, and Optimus‑1 attempt to handle long‑horizon tasks by storing trajectories or building knowledge graphs.
But these approaches share a subtle limitation.
They treat experience as data to retrieve, not knowledge to evolve.
| Approach | Memory Strategy | Limitation |
|---|---|---|
| Trajectory storage | Store successful task histories | Hard to generalize patterns |
| Reflection agents | Use language summaries of failures | Weak attribution of root causes |
| Skill libraries | Encode successful behaviors | Failures often discarded |
This becomes particularly problematic in embodied environments—physical or simulated worlds where errors arise from navigation, spatial interaction, resource constraints, or environmental hazards.
In such environments, simply telling an agent “that failed” is rarely useful.
What matters is why.
Analysis — The Steve‑Evolving framework
The proposed architecture introduces a self‑evolution pipeline built around three stages:
- Experience Anchoring
- Experience Distillation
- Knowledge‑Driven Closed‑Loop Control
Together they form a loop where interaction continuously generates knowledge.
1. Experience Anchoring
Every action taken by the agent is recorded as a structured event rather than a raw trajectory.
Each experience is stored as:
$$ e_t = \langle s_{pre}^{(t)}, a_t, D(s_t,a_t), s_{post}^{(t)} \rangle $$
Where the diagnostic function returns detailed signals about what happened during execution.
These include:
- state differences
- failure categories
- continuous indicators
- loop detection signals
The result is a high‑information execution record, far richer than simple success or failure.
To manage scale, the system organizes experience into a three‑tier structure:
| Layer | Role |
|---|---|
| Document layer | raw interaction records |
| Index layer | searchable spatial and semantic metadata |
| Summary layer | compressed historical patterns |
This architecture enables the agent to recall past events efficiently while preserving detailed diagnostics.
2. Dual‑Track Experience Distillation
The key innovation lies in how experience is transformed into reusable knowledge.
Instead of treating success and failure symmetrically, the framework extracts two different types of knowledge.
Positive track: Skill distillation
Successful action sequences are generalized into reusable skills.
Each skill contains:
| Component | Meaning |
|---|---|
| Preconditions | environmental requirements |
| Action flow | stable step sequence |
| Verification rule | how success is confirmed |
| Effects | resulting state change |
This converts episodic experience into procedural expertise.
Negative track: Guardrail extraction
Failures are not discarded.
Instead, repeated failures produce guardrails—explicit rules that prevent dangerous actions.
Example guardrail:
| Trigger | Forbidden Action | Reason |
|---|---|---|
| Near lava pool | Mining movement loops | Navigation hazard |
These guardrails act as negative prompts injected into the planner.
The result is a planning system that simultaneously knows:
- what usually works
- what definitely fails
Humans would call this “experience.” AI systems call it “structured constraints.” Same concept.
3. Knowledge‑Driven Closed‑Loop Planning
Once distilled, knowledge feeds back into the planning process.
At planning time, the system retrieves relevant skills and guardrails using a hybrid similarity score:
$$ \psi(c_t, e) = \alpha \cdot cos(E(c_t),E(e)) + \beta \cdot \delta(h_{cond}(c_t),e) $$
This blends semantic similarity with structural condition matching.
The retrieved knowledge is then injected into the LLM planner’s context window.
Two mechanisms guide the resulting plan:
| Knowledge type | Effect on planning |
|---|---|
| Skills | Demonstrate successful workflows |
| Guardrails | Prevent risky or unproductive actions |
If repeated failures occur during execution, the system triggers local replanning.
The agent halts, diagnoses the failure, generates new constraints, and resamples a plan.
The loop continues indefinitely.
Experience → Knowledge → Better Planning → New Experience.
Findings — Performance in long‑horizon tasks
The framework was evaluated in the Minecraft MCU technology tree benchmark, which includes 70 tasks across seven progression stages.
| Task Group | Description |
|---|---|
| Wooden | basic resource gathering |
| Stone | early tools |
| Iron | advanced crafting |
| Golden | specialized items |
| Redstone | circuitry |
| Diamond | high‑tier equipment |
| Armor | defensive equipment |
Results show consistent improvements over existing agents.
Overall success rate comparison
| Model Backbone | Jarvis‑1 | Optimus‑1 | Steve‑Evolving |
|---|---|---|---|
| Qwen‑3.5‑flash | 41.75% | 45.83% | 50.09% |
| Qwen‑3.5‑plus | 42.59% | 47.42% | 52.52% |
| GLM‑4.7 | 40.79% | 45.23% | 48.43% |
| Gemini‑3‑flash | 42.04% | 46.73% | 52.04% |
| Gemini‑3‑pro | 42.67% | 47.63% | 53.37% |
The most striking improvements appear in later stages such as Iron, Diamond, and Armor.
These tasks require long dependency chains, making accumulated experience especially valuable.
Ablation studies further show that removing knowledge injection dramatically reduces performance—confirming that structured experience, not model size, drives the improvement.
Implications — Toward self‑evolving AI systems
The deeper message of this work is architectural rather than algorithmic.
The future of capable AI agents may depend less on bigger models and more on better experience management.
Three implications stand out for industry systems.
1. Memory should evolve into knowledge
Most agent frameworks treat memory as searchable logs.
Steve‑Evolving suggests that memory must transform itself—progressively abstracting raw events into skills and constraints.
2. Failure is a first‑class signal
Many systems discard failed trajectories.
Here, failure becomes a structured resource for generating safety guardrails.
This design aligns naturally with AI safety and reliability requirements.
3. Continuous improvement without retraining
Because knowledge is injected through prompts rather than parameters, the system evolves without expensive model updates.
That makes the approach attractive for production agents operating in dynamic environments.
The architecture resembles a professional training loop:
| Stage | Human analogy |
|---|---|
| Experience recording | field notes |
| Skill extraction | best practices |
| Guardrail creation | safety regulations |
| Replanning | adaptive problem solving |
Machines, it seems, are finally discovering the corporate handbook.
Conclusion — Experience is the new training data
Large language models already reason well.
What they lack is institutional memory.
Steve‑Evolving demonstrates that when interaction experience is organized, distilled, and reinjected into planning, agents can steadily improve over time—even without retraining their underlying models.
The result is not just a smarter agent.
It is an agent that grows wiser with experience.
And that may be the missing ingredient for truly autonomous systems.
Cognaptus: Automate the Present, Incubate the Future.