Learning From the Punches: How AI Agents Turn Mistakes into Skills

Opening — Why this matters now

AI agents are graduating from chat windows into worlds.

Robots assemble parts. Digital assistants browse the web. Game agents mine diamonds in Minecraft with suspiciously human determination. Yet as soon as these agents face long-horizon tasks—problems that require dozens or hundreds of coordinated actions—they tend to collapse under their own memory of mistakes.

Not because they cannot reason.

But because they cannot learn from experience in a structured way.

A recent research paper introduces Steve‑Evolving, a framework designed to solve precisely this bottleneck. Instead of relying on model retraining or massive trajectory logs, the system continuously converts experience into structured knowledge—skills and constraints—that directly guide future planning.

In short: the agent learns the way professionals do.

Not by remembering every event, but by extracting the rules that matter.

Background — The limits of memory‑based agents

Many modern LLM agents already operate in complex environments. Systems such as Voyager, Jarvis‑1, and Optimus‑1 attempt to handle long‑horizon tasks by storing trajectories or building knowledge graphs.

But these approaches share a subtle limitation.

They treat experience as data to retrieve, not knowledge to evolve.

Approach	Memory Strategy	Limitation
Trajectory storage	Store successful task histories	Hard to generalize patterns
Reflection agents	Use language summaries of failures	Weak attribution of root causes
Skill libraries	Encode successful behaviors	Failures often discarded

This becomes particularly problematic in embodied environments—physical or simulated worlds where errors arise from navigation, spatial interaction, resource constraints, or environmental hazards.

In such environments, simply telling an agent “that failed” is rarely useful.

What matters is why.

Analysis — The Steve‑Evolving framework

The proposed architecture introduces a self‑evolution pipeline built around three stages:

Experience Anchoring
Experience Distillation
Knowledge‑Driven Closed‑Loop Control

Together they form a loop where interaction continuously generates knowledge.

1. Experience Anchoring

Every action taken by the agent is recorded as a structured event rather than a raw trajectory.

Each experience is stored as:

$$ e_t = \langle s_{pre}^{(t)}, a_t, D(s_t,a_t), s_{post}^{(t)} \rangle $$

Where the diagnostic function returns detailed signals about what happened during execution.

These include:

state differences
failure categories
continuous indicators
loop detection signals

The result is a high‑information execution record, far richer than simple success or failure.

To manage scale, the system organizes experience into a three‑tier structure:

Layer	Role
Document layer	raw interaction records
Index layer	searchable spatial and semantic metadata
Summary layer	compressed historical patterns

This architecture enables the agent to recall past events efficiently while preserving detailed diagnostics.

2. Dual‑Track Experience Distillation

The key innovation lies in how experience is transformed into reusable knowledge.

Instead of treating success and failure symmetrically, the framework extracts two different types of knowledge.

Positive track: Skill distillation

Successful action sequences are generalized into reusable skills.

Each skill contains:

Component	Meaning
Preconditions	environmental requirements
Action flow	stable step sequence
Verification rule	how success is confirmed
Effects	resulting state change

This converts episodic experience into procedural expertise.

Negative track: Guardrail extraction

Failures are not discarded.

Instead, repeated failures produce guardrails—explicit rules that prevent dangerous actions.

Example guardrail:

Trigger	Forbidden Action	Reason
Near lava pool	Mining movement loops	Navigation hazard

These guardrails act as negative prompts injected into the planner.

The result is a planning system that simultaneously knows:

what usually works
what definitely fails

Humans would call this “experience.” AI systems call it “structured constraints.” Same concept.

3. Knowledge‑Driven Closed‑Loop Planning

Once distilled, knowledge feeds back into the planning process.

At planning time, the system retrieves relevant skills and guardrails using a hybrid similarity score:

$$ \psi(c_t, e) = \alpha \cdot cos(E(c_t),E(e)) + \beta \cdot \delta(h_{cond}(c_t),e) $$

This blends semantic similarity with structural condition matching.

The retrieved knowledge is then injected into the LLM planner’s context window.

Two mechanisms guide the resulting plan:

Knowledge type	Effect on planning
Skills	Demonstrate successful workflows
Guardrails	Prevent risky or unproductive actions

If repeated failures occur during execution, the system triggers local replanning.

The agent halts, diagnoses the failure, generates new constraints, and resamples a plan.

The loop continues indefinitely.

Experience → Knowledge → Better Planning → New Experience.

Findings — Performance in long‑horizon tasks

The framework was evaluated in the Minecraft MCU technology tree benchmark, which includes 70 tasks across seven progression stages.

Task Group	Description
Wooden	basic resource gathering
Stone	early tools
Iron	advanced crafting
Golden	specialized items
Redstone	circuitry
Diamond	high‑tier equipment
Armor	defensive equipment

Results show consistent improvements over existing agents.

Overall success rate comparison

Model Backbone	Jarvis‑1	Optimus‑1	Steve‑Evolving
Qwen‑3.5‑flash	41.75%	45.83%	50.09%
Qwen‑3.5‑plus	42.59%	47.42%	52.52%
GLM‑4.7	40.79%	45.23%	48.43%
Gemini‑3‑flash	42.04%	46.73%	52.04%
Gemini‑3‑pro	42.67%	47.63%	53.37%

The most striking improvements appear in later stages such as Iron, Diamond, and Armor.

These tasks require long dependency chains, making accumulated experience especially valuable.

Ablation studies further show that removing knowledge injection dramatically reduces performance—confirming that structured experience, not model size, drives the improvement.

Implications — Toward self‑evolving AI systems

The deeper message of this work is architectural rather than algorithmic.

The future of capable AI agents may depend less on bigger models and more on better experience management.

Three implications stand out for industry systems.

1. Memory should evolve into knowledge

Most agent frameworks treat memory as searchable logs.

Steve‑Evolving suggests that memory must transform itself—progressively abstracting raw events into skills and constraints.

2. Failure is a first‑class signal

Many systems discard failed trajectories.

Here, failure becomes a structured resource for generating safety guardrails.

This design aligns naturally with AI safety and reliability requirements.

3. Continuous improvement without retraining

Because knowledge is injected through prompts rather than parameters, the system evolves without expensive model updates.

That makes the approach attractive for production agents operating in dynamic environments.

The architecture resembles a professional training loop:

Stage	Human analogy
Experience recording	field notes
Skill extraction	best practices
Guardrail creation	safety regulations
Replanning	adaptive problem solving

Machines, it seems, are finally discovering the corporate handbook.

Conclusion — Experience is the new training data

Large language models already reason well.

What they lack is institutional memory.

Steve‑Evolving demonstrates that when interaction experience is organized, distilled, and reinjected into planning, agents can steadily improve over time—even without retraining their underlying models.

The result is not just a smarter agent.

It is an agent that grows wiser with experience.

And that may be the missing ingredient for truly autonomous systems.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The limits of memory‑based agents#

Analysis — The Steve‑Evolving framework#

1. Experience Anchoring#

2. Dual‑Track Experience Distillation#

Positive track: Skill distillation#

Negative track: Guardrail extraction#

3. Knowledge‑Driven Closed‑Loop Planning#

Findings — Performance in long‑horizon tasks#

Overall success rate comparison#

Implications — Toward self‑evolving AI systems#

1. Memory should evolve into knowledge#

2. Failure is a first‑class signal#

3. Continuous improvement without retraining#

Conclusion — Experience is the new training data#