Opening — Why this matters now

AI agents are graduating from chat windows into worlds.

Robots assemble parts. Digital assistants browse the web. Game agents mine diamonds in Minecraft with suspiciously human determination. Yet as soon as these agents face long-horizon tasks—problems that require dozens or hundreds of coordinated actions—they tend to collapse under their own memory of mistakes.

Not because they cannot reason.

But because they cannot learn from experience in a structured way.

A recent research paper introduces Steve‑Evolving, a framework designed to solve precisely this bottleneck. Instead of relying on model retraining or massive trajectory logs, the system continuously converts experience into structured knowledge—skills and constraints—that directly guide future planning.

In short: the agent learns the way professionals do.

Not by remembering every event, but by extracting the rules that matter.


Background — The limits of memory‑based agents

Many modern LLM agents already operate in complex environments. Systems such as Voyager, Jarvis‑1, and Optimus‑1 attempt to handle long‑horizon tasks by storing trajectories or building knowledge graphs.

But these approaches share a subtle limitation.

They treat experience as data to retrieve, not knowledge to evolve.

Approach Memory Strategy Limitation
Trajectory storage Store successful task histories Hard to generalize patterns
Reflection agents Use language summaries of failures Weak attribution of root causes
Skill libraries Encode successful behaviors Failures often discarded

This becomes particularly problematic in embodied environments—physical or simulated worlds where errors arise from navigation, spatial interaction, resource constraints, or environmental hazards.

In such environments, simply telling an agent “that failed” is rarely useful.

What matters is why.


Analysis — The Steve‑Evolving framework

The proposed architecture introduces a self‑evolution pipeline built around three stages:

  1. Experience Anchoring
  2. Experience Distillation
  3. Knowledge‑Driven Closed‑Loop Control

Together they form a loop where interaction continuously generates knowledge.

1. Experience Anchoring

Every action taken by the agent is recorded as a structured event rather than a raw trajectory.

Each experience is stored as:

$$ e_t = \langle s_{pre}^{(t)}, a_t, D(s_t,a_t), s_{post}^{(t)} \rangle $$

Where the diagnostic function returns detailed signals about what happened during execution.

These include:

  • state differences
  • failure categories
  • continuous indicators
  • loop detection signals

The result is a high‑information execution record, far richer than simple success or failure.

To manage scale, the system organizes experience into a three‑tier structure:

Layer Role
Document layer raw interaction records
Index layer searchable spatial and semantic metadata
Summary layer compressed historical patterns

This architecture enables the agent to recall past events efficiently while preserving detailed diagnostics.


2. Dual‑Track Experience Distillation

The key innovation lies in how experience is transformed into reusable knowledge.

Instead of treating success and failure symmetrically, the framework extracts two different types of knowledge.

Positive track: Skill distillation

Successful action sequences are generalized into reusable skills.

Each skill contains:

Component Meaning
Preconditions environmental requirements
Action flow stable step sequence
Verification rule how success is confirmed
Effects resulting state change

This converts episodic experience into procedural expertise.

Negative track: Guardrail extraction

Failures are not discarded.

Instead, repeated failures produce guardrails—explicit rules that prevent dangerous actions.

Example guardrail:

Trigger Forbidden Action Reason
Near lava pool Mining movement loops Navigation hazard

These guardrails act as negative prompts injected into the planner.

The result is a planning system that simultaneously knows:

  • what usually works
  • what definitely fails

Humans would call this “experience.” AI systems call it “structured constraints.” Same concept.


3. Knowledge‑Driven Closed‑Loop Planning

Once distilled, knowledge feeds back into the planning process.

At planning time, the system retrieves relevant skills and guardrails using a hybrid similarity score:

$$ \psi(c_t, e) = \alpha \cdot cos(E(c_t),E(e)) + \beta \cdot \delta(h_{cond}(c_t),e) $$

This blends semantic similarity with structural condition matching.

The retrieved knowledge is then injected into the LLM planner’s context window.

Two mechanisms guide the resulting plan:

Knowledge type Effect on planning
Skills Demonstrate successful workflows
Guardrails Prevent risky or unproductive actions

If repeated failures occur during execution, the system triggers local replanning.

The agent halts, diagnoses the failure, generates new constraints, and resamples a plan.

The loop continues indefinitely.

Experience → Knowledge → Better Planning → New Experience.


Findings — Performance in long‑horizon tasks

The framework was evaluated in the Minecraft MCU technology tree benchmark, which includes 70 tasks across seven progression stages.

Task Group Description
Wooden basic resource gathering
Stone early tools
Iron advanced crafting
Golden specialized items
Redstone circuitry
Diamond high‑tier equipment
Armor defensive equipment

Results show consistent improvements over existing agents.

Overall success rate comparison

Model Backbone Jarvis‑1 Optimus‑1 Steve‑Evolving
Qwen‑3.5‑flash 41.75% 45.83% 50.09%
Qwen‑3.5‑plus 42.59% 47.42% 52.52%
GLM‑4.7 40.79% 45.23% 48.43%
Gemini‑3‑flash 42.04% 46.73% 52.04%
Gemini‑3‑pro 42.67% 47.63% 53.37%

The most striking improvements appear in later stages such as Iron, Diamond, and Armor.

These tasks require long dependency chains, making accumulated experience especially valuable.

Ablation studies further show that removing knowledge injection dramatically reduces performance—confirming that structured experience, not model size, drives the improvement.


Implications — Toward self‑evolving AI systems

The deeper message of this work is architectural rather than algorithmic.

The future of capable AI agents may depend less on bigger models and more on better experience management.

Three implications stand out for industry systems.

1. Memory should evolve into knowledge

Most agent frameworks treat memory as searchable logs.

Steve‑Evolving suggests that memory must transform itself—progressively abstracting raw events into skills and constraints.

2. Failure is a first‑class signal

Many systems discard failed trajectories.

Here, failure becomes a structured resource for generating safety guardrails.

This design aligns naturally with AI safety and reliability requirements.

3. Continuous improvement without retraining

Because knowledge is injected through prompts rather than parameters, the system evolves without expensive model updates.

That makes the approach attractive for production agents operating in dynamic environments.

The architecture resembles a professional training loop:

Stage Human analogy
Experience recording field notes
Skill extraction best practices
Guardrail creation safety regulations
Replanning adaptive problem solving

Machines, it seems, are finally discovering the corporate handbook.


Conclusion — Experience is the new training data

Large language models already reason well.

What they lack is institutional memory.

Steve‑Evolving demonstrates that when interaction experience is organized, distilled, and reinjected into planning, agents can steadily improve over time—even without retraining their underlying models.

The result is not just a smarter agent.

It is an agent that grows wiser with experience.

And that may be the missing ingredient for truly autonomous systems.

Cognaptus: Automate the Present, Incubate the Future.