Learning From the Punches: How AI Agents Turn Mistakes into Skills

Mistakes are cheap until an agent repeats them.

A human worker who keeps failing at the same task usually leaves traces: a blocked aisle, a missing tool, a wrong form field, an error message, a process exception. A competent manager does not simply tell the worker to “try again with more confidence.” The useful move is more boring and more valuable: identify the pattern, write the repair rule, and make sure the next attempt starts from the point of failure rather than from the beginning.

AI agents are still learning this extremely unglamorous lesson.

The paper behind today’s article, MineEvolve: Self-Evolution with Accumulated Knowledge for Long-Horizon Embodied Minecraft Agents, studies this problem in Minecraft, where an agent must collect resources, craft tools, navigate terrain, handle GUI states, and recover from missing prerequisites across long dependency chains.¹ That sounds like a game. It is also a clean laboratory for a business problem: how do we build agents that improve from their own operational history without constantly retraining the model or stuffing the prompt with a landfill of past logs?

The paper’s answer is MineEvolve, a four-part framework that turns execution traces into structured behavioral knowledge. The important word is structured. The system does not merely remember that something failed. It records what changed, what did not change, what failure type appeared, whether progress stalled, and whether the next plan should reuse a skill or insert a remedy.

That is the useful part. Not the Minecraft. Not the diamond pickaxe. Certainly not the heroic fantasy that “memory” automatically becomes wisdom. Memory, left unattended, is just clutter with timestamps.

Self-evolution here does not mean fine-tuning the model

A common misunderstanding is that an agent “evolves” only when its neural weights change. MineEvolve uses a different path. The language model planner is not retrained. Capability improves through an external behavioral knowledge system that is continually updated from interaction.

The loop is simple enough to state, but difficult to implement well:

subgoal execution
      ↓
Monitor: typed feedback from state, inventory, failures, progress, stagnation
      ↓
Inducer: skills from success, remedies from failure or stagnation
      ↓
Curator: validate, merge, filter, retrieve under a prompt budget
      ↓
Adaptor: repair only the unfinished part of the plan

This mechanism-first view matters because the paper is not merely saying “add memory and scores go up.” We have heard that song before; it has many remixes and most of them end in a bloated context window.

MineEvolve makes a sharper claim: experience becomes useful when it is converted into typed, verifiable, executable knowledge. The distinction is small in wording and large in engineering.

Module	What it does in MineEvolve	What it prevents	Business analogue
Monitor	Converts each subgoal execution into typed feedback	Treating success/failure as a one-bit signal	Instrumentation and exception logging
Inducer	Converts success into skills and failure/stagnation into remedies	Keeping raw traces without operational meaning	Standard operating procedures and repair playbooks
Curator	Validates, merges, filters, and retrieves knowledge	Polluting memory with vague or conflicting advice	Knowledge-base governance
Adaptor	Repairs the remaining plan after repeated failure	Restarting the whole workflow or blindly retrying	Local workflow recovery

The paper’s practical contribution lives in this pipeline. Each module removes one way an agent can waste experience.

Monitor: the agent needs telemetry, not vibes

MineEvolve begins with a Monitor that compresses each subgoal execution into typed feedback. The paper includes fields such as subgoal success, state change, inventory change, failure type, execution progress, and stagnation. In Minecraft terms, this means the system can see whether the agent actually gained logs, changed position, opened the correct GUI, crafted the target item, or simply wandered near an obstacle while accomplishing nothing useful.

That last case is more important than it sounds. Many failures in long-horizon agents are not dramatic. They are quiet non-events. The agent moves, calls tools, waits, retries, and produces the appearance of work. In a dashboard, it looks busy. In reality, no state variable that matters has changed.

MineEvolve explicitly treats low progress and stagnation as signals. If the agent keeps moving near a target block without inventory gain, the system can mark the behavior as a navigation failure rather than merely recording another unsuccessful attempt. If the agent fails to craft an item because it lacks a crafting table, the missing prerequisite can become a future repair condition.

For business-process agents, this is the first uncomfortable lesson: prompts alone are not enough. A production agent needs domain telemetry. A customer-service agent needs ticket state transitions, escalation outcomes, policy-rule hits, user confirmation signals, and unresolved-loop detectors. A procurement agent needs vendor-status changes, missing-document flags, approval bottlenecks, and payment exceptions. A software agent needs test results, stack traces, dependency changes, and file-level diffs.

Without that telemetry, the agent can still produce reflections. They will just be the usual corporate poetry: “be more careful next time.” Very inspiring. Not executable.

Inducer: success becomes skills, failure becomes remedies

Once typed feedback exists, MineEvolve separates experience into two kinds of knowledge.

Successful executions become skills. A skill records the trigger context, preconditions, steps, verification rule, observed effects, confidence, and supporting feedback. This is procedural knowledge: when the context matches and the preconditions hold, reuse the behavior.

Failed or stagnant executions become remedies. A remedy records the trigger context, failure type, risk pattern, repair action, scope, confidence, and supporting feedback. This is recovery knowledge: when the same kind of failure appears, change the unfinished plan instead of repeating the broken subgoal.

That split is one of the paper’s strongest design choices. Many agent-memory systems overvalue successful trajectories because they are pleasant to store and easy to narrate. MineEvolve treats failure as equally productive, provided the failure is converted into a repairable pattern. In difficult tasks, that matters more than another polished success story.

The paper’s examples make the distinction concrete. A successful sequence for collecting logs, crafting planks, and crafting sticks can become a reusable skill. A repeated navigation stall can become a remedy: clear the blocking block or select another route before retrying. A crafting failure caused by the absence of a crafting table can become a remedy: obtain or place the crafting table before attempting the recipe again.

The business translation is direct:

Operational event	Weak memory version	MineEvolve-style knowledge
A support bot fails to resolve billing disputes when invoice metadata is missing	“Billing dispute failed before”	Remedy: if invoice ID is absent, request or retrieve invoice metadata before applying refund policy
A coding agent repeatedly fails tests after editing only one dependent file	“Tests failed”	Remedy: if API signature changed, inspect callers and update dependent tests before rerun
A procurement agent cannot complete vendor onboarding because tax documents are absent	“Vendor onboarding failed”	Remedy: insert missing-document collection before approval submission
A warehouse robot cannot reach a target bin through a blocked route	“Navigation failed”	Remedy: reroute or clear obstruction before retrying pickup

The value is not that the agent “remembers.” It is that the remembered event has a trigger, a scope, and a repair action. That is the difference between a diary and a playbook.

Curator: the memory bank needs a bouncer

A self-improving agent can become worse if it stores every lesson it thinks it learned. MineEvolve addresses this with a Curator that validates and maintains the knowledge store.

The Curator checks whether a candidate knowledge entry is complete, matchable to current state fields, executable by the low-level executor, specific enough to matter, and not in conflict with higher-confidence knowledge. Overly generic advice such as “try again” or “be careful” is rejected. A small mercy for everyone who has ever read an action item from a bad meeting.

This part of the paper is easy to underappreciate because curation does not sound as exciting as planning. But for business use, it may be the most important part.

Any agent that learns from operations needs memory hygiene. Otherwise, the system will accumulate contradictory rules, outdated workarounds, duplicated procedures, brittle one-off fixes, and overgeneralized lessons. A failed refund in Brazil becomes a global refund policy. A temporary API outage becomes a permanent coding superstition. A one-time user typo becomes a “known risk pattern.” The agent becomes experienced in the same way an office rumor mill is experienced.

MineEvolve’s Curator is not a complete solution to enterprise knowledge governance, but the design points in the right direction: structured entries, validation checks, support counts, confidence, usage history, retrieval constraints, and conflict filtering.

The broader principle is simple: learning agents require deletion, rejection, and compression, not just accumulation.

Adaptor: repair the suffix, do not restart the universe

The Adaptor is where MineEvolve turns stored knowledge back into action. When repeated failures or stagnation occur, the system does not regenerate the entire plan. It preserves the completed or still-valid prefix and repairs only the unfinished suffix.

This is a quiet but important engineering decision.

Long-horizon tasks create partial progress. In Minecraft, the agent may already have collected wood, crafted basic tools, mined stone, and smelted iron before failing on a later crafting step. In business workflows, an agent may already have verified the customer, retrieved the account, checked policy rules, and drafted a response before discovering a missing approval. Restarting the whole process is wasteful and sometimes dangerous. Blindly retrying the failed step is not much better.

MineEvolve’s repair strategy keeps what still works and changes what remains. If a remedy says that a missing crafting table caused the failure, the Adaptor inserts the prerequisite step into the unfinished plan. If a navigation remedy says the path is blocked, the Adaptor inserts a corrective subgoal before retrying the resource collection.

This is the operational heart of the paper: improvement happens not through inspirational self-reflection, but through local plan surgery.

The main evidence: gains are largest where dependency chains hurt

The paper evaluates MineEvolve on a 70-task subset of the Minecraft MCU tech-tree benchmark. The task groups range from easier Wooden, Stone, and Gold tasks to harder Iron, Redstone, Diamond, and Armor tasks. The hard groups are where missing prerequisites, failure propagation, and recovery planning become more visible.

MineEvolve is compared against DEPS, JARVIS-1, Optimus-1, and Optimus-2 across multiple planner backbones. The compact results show that MineEvolve achieves the highest overall success rate and hard-task average across the tested planner backbones.

A useful way to read the result is not “MineEvolve beats everything everywhere by a dramatic margin.” The easy tasks are already near saturation for stronger systems. The signal is clearer in the hard groups, where long dependency chains create more opportunities for structured remedies to matter.

Planner backbone	MineEvolve Overall SR	Strongest baseline Overall SR	MineEvolve Hard Avg.	Strongest baseline Hard Avg.
Qwen3.5-Flash	50.09%	48.84%	34.06%	32.11%
Qwen3.5-Plus	52.52%	49.75%	37.13%	33.25%
GLM-4.7	48.43%	48.06%	32.06%	31.25%
Gemini-3-Flash	52.04%	50.45%	36.42%	34.19%
GPT-5.5	54.93%	51.66%	40.23%	35.70%

The “strongest baseline” in this table is Optimus-2, which uses its native GOAP-based policy rather than the same STEVE-1 low-level policy used by several other methods. That makes it a strong system-level comparison but not a perfectly isolated same-controller comparison. The paper is aware of this and separately controls the interface, action primitives, memory-token budget, retrieval top-k, and evaluation-time LLM-call budget for STEVE-1-based methods.

The interpretation should therefore be careful. MineEvolve is not proving that its framework is the only route to better Minecraft agents. It is showing that across multiple planner backbones, converting execution feedback into curated skills and remedies provides consistent gains, especially where recovery matters.

The ablation result says the quiet part clearly: binary feedback is too poor

The most useful evidence in the paper is not the leaderboard table. It is the ablation logic.

The authors test whether the quality of remedies depends on the granularity of execution feedback. The comparison is revealing:

Source of remedies	Iron	Redstone	Diamond	Armor	Hard Avg.
Binary Feedback	42.36%	24.72%	10.28%	18.36%	27.06%
Trajectory-level Reflection	46.88%	26.79%	11.94%	20.36%	29.98%
Typed Execution Feedback	55.83%	31.24%	17.06%	27.63%	37.13%

This table is doing more than adding a nice ablation checkbox. It directly supports the paper’s core mechanism.

Binary feedback tells the system whether the attempt succeeded or failed. That is not enough. Trajectory-level reflection gives the LLM more language to work with, and it does improve over binary feedback. But typed execution feedback performs substantially better because it gives the system fields that are matchable, verifiable, and actionable.

In plain business language: “failed” is not a root cause. “Failed because the required document was missing, no state transition occurred, and the workflow retried the same API call three times” is closer to one.

The paper also reports that Skills Only and Remedies Only each improve over reflection, with the full system performing best. Remedies Only outperforms Skills Only on hard tasks, which is exactly what we should expect in long dependency chains: the rare skill is useful, but the repeated repair often saves the day.

Knowledge accumulation improves performance, but the budget still matters

MineEvolve is an accumulated-knowledge system, so one practical concern is obvious: does the knowledge base keep growing until retrieval becomes slow and the prompt turns into a municipal archive?

The supplementary experiments address this with a knowledge-accumulation analysis. On held-out hard tasks, MineEvolve’s hard average rises from 25.13% at zero episodes to 35.02% after 400 episodes. Static Store reaches only 28.22%, and Text Reflection reaches 29.56% at the same checkpoint.

Method	Hard Avg. at 0 episodes	Hard Avg. at 400 episodes	Interpretation
Static Store	25.13%	28.22%	Some benefit from stored knowledge, but limited without active structured evolution
Text Reflection	25.13%	29.56%	Free-form reflection helps, but weakly
MineEvolve	25.13%	35.02%	Typed, curated skills and remedies accumulate into stronger transfer

The overhead table is also important. After 400 episodes, the knowledge base contains 143 skills and 191 remedies. Average retrieved tokens reach 1,216, retrieval latency is 0.25 seconds, and evaluation-time LLM calls rise from 7.82 to 8.31. That does not prove the method scales indefinitely. It does show that under the paper’s prompt-budget control, the knowledge store does not immediately explode into unusable context.

For production systems, this is the part to copy with discipline. Accumulated knowledge is valuable only if retrieval remains selective. A system that remembers everything equally remembers nothing operationally.

The appendix is not a second thesis, but it usefully tests the learning pathway

The paper includes a curriculum-style experiment on Diamond tasks. This is not the main MCU evaluation. It is better read as an exploratory extension that asks a narrower question: does the path by which the knowledge base is constructed matter?

The answer is yes.

Cold Start remains below 3% across checkpoints and ends at 2.9%. Diamond-only Self-learning rises to 13.9%. Curriculum PretrainFreeze, where lower-tier knowledge is built first and then frozen, reaches 16.4%. Mixed Sampling, where lower-tier and Diamond experience are jointly updated, reaches 17.6%.

Strategy	Final Diamond SR	What it suggests
Cold Start	2.9%	High-tier tasks remain extremely hard without cross-episode behavioral knowledge
Diamond-only Self-learning	13.9%	Target-task experience helps, but early learning is inefficient
Curriculum PretrainFreeze	16.4%	Lower-tier skills and remedies transfer to harder tasks
Mixed Sampling (1:1)	17.6%	Combining foundational and target-specific experience works best in this setup

This is a useful business analogy, though not a proof of enterprise transfer. Agents that learn only from rare, high-stakes exceptions may improve slowly because they see too few successful recoveries. A better route may be to accumulate knowledge from simpler adjacent workflows first, then use mixed updates as harder cases arrive.

Translated badly, this becomes “train on easy tasks first.” Translated properly, it becomes: construct the agent’s knowledge base along the dependency structure of the work.

What each experiment supports—and what it does not

The paper is strongest when read as a mechanism study, not as a universal claim about autonomous agents. The evidence map looks like this:

Evidence item	Likely purpose	What it supports	What it does not prove
Main MCU results across planner backbones	Main evidence and system comparison	MineEvolve improves success rates, especially on hard dependency-heavy tasks	That the same gains will appear in every business domain
Feedback-granularity ablation	Core ablation	Typed execution feedback is more useful than binary feedback or free-form reflection	That the chosen feedback fields are universally optimal
Skills/remedies/curator/adaptor ablations	Component ablation	The full loop matters; remedies and local repair are important	That no simpler architecture could approximate the result
Knowledge accumulation and overhead	Robustness and operational feasibility check	Curated knowledge can grow while retrieval remains bounded in this setup	That long-term enterprise memory will remain clean without stronger governance
Diamond curriculum experiment	Exploratory extension	Knowledge-construction pathway affects hard-task learning	That a specific curriculum rule transfers directly outside Minecraft
Qualitative case studies	Mechanism illustration	Typed feedback can produce executable repairs in concrete failures	Statistical generality

This classification matters because it protects the article from the usual research-blog disease: turning every appendix table into a revolution. The paper’s main claim is already interesting enough. No need to make it wear a cape.

The business value is cheaper diagnosis, not magical autonomy

For Cognaptus readers, the paper is most relevant to agentic business-process automation. Not because companies secretly want Minecraft bots, although some quarterly meetings do feel like survival mode. The relevance comes from the architecture of learning from execution.

A business agent that follows MineEvolve’s logic would need four layers.

First, it needs operational telemetry. Every workflow step should produce structured state changes: completed fields, failed validations, missing inputs, API responses, user confirmations, approval states, and tool-call outcomes. The agent must know not only whether a step failed, but what failed to change.

Second, it needs a knowledge induction layer. Successful workflows should become reusable procedures with preconditions and verification checks. Failed or stagnant workflows should become remedies with triggers and repair scope. A remedy should not say “handle exception better.” It should say “if policy lookup returns no match because product category is missing, retrieve category from invoice metadata before retrying eligibility check.”

Third, it needs curation. Candidate knowledge should be rejected if it is vague, non-executable, unsupported, obsolete, or in conflict with higher-confidence rules. In regulated domains, this layer would also need human review, audit trails, approval states, and rollback.

Fourth, it needs local repair. When the agent fails at step 7 of a 12-step process, it should not restart at step 1 unless the earlier state is invalid. It should preserve the valid prefix, insert the repair, and continue from the smallest safe point.

MineEvolve idea	Business-process equivalent	ROI relevance
Typed execution feedback	Structured logs and state transitions	Reduces diagnosis cost
Skills	Verified reusable workflows	Reduces repeated planning and variation
Remedies	Exception-handling playbooks	Reduces repeated failure and escalation
Curator	Governed operational knowledge base	Reduces memory pollution and compliance risk
Adaptor	Local workflow repair	Reduces rework and wasted tool calls

The likely economic benefit is not that the agent becomes charmingly autonomous. Charm is abundant. Reliability is scarce. The nearer-term value is cheaper diagnosis, fewer repeated failures, better exception handling, and a more reusable operational knowledge base.

The boundary: Minecraft gives the agent unusually clean feedback

The main limitation is not that Minecraft is a game. Games can be excellent testbeds because they expose state, rules, actions, and outcomes with unusual clarity. The limitation is that real business environments rarely provide feedback this cleanly by default.

MineEvolve can observe inventory changes, GUI state, coordinates, task progress, and failure categories. Many enterprise workflows cannot even agree on what “done” means across three systems and a spreadsheet maintained by someone named Linda.

To transfer the idea, a business system needs domain-specific equivalents:

Minecraft signal	Business analogue
Inventory change	Database record, document set, account status, order state
GUI state	Application screen, form state, modal dialog, validation result
Failure type	Error code, policy conflict, missing field, timeout, rejected approval
Progress signal	Completed subtasks, accepted tool output, user confirmation
Stagnation	Repeated tool calls, no state transition, looped conversation, unchanged ticket status
Verification rule	Test pass, audit check, policy compliance, final user acceptance

This is where many agent deployments will fail. They will buy the planner and forget the instrumentation. Then they will wonder why the agent’s memory consists of verbose little essays about its feelings.

The paper also does not prove that the method handles adversarial users, changing regulations, multi-agent organizational politics, or messy partial observability in enterprise systems. Nor does it remove the need for human governance. In high-stakes workflows, remedies should not be allowed to silently rewrite policy. They should propose repairs within an approved operational envelope.

So the practical lesson is bounded but valuable: MineEvolve gives a design pattern for turning execution history into usable agent knowledge. The pattern still needs domain telemetry, validation rules, safety constraints, and auditability before it becomes a production architecture.

The agent that learns is the agent that edits its playbook

MineEvolve is a useful paper because it refuses a lazy answer. It does not say that longer context is enough. It does not say that generic reflection is enough. It does not say that the model must be retrained after every mistake.

Instead, it shows a more operational route:

observe execution with structured signals;
convert success into reusable skills;
convert failure into scoped remedies;
curate the knowledge base aggressively;
repair only the unfinished part of the plan.

That is not glamorous. It is almost bureaucratic. Which is precisely why it matters.

Real organizations do not improve by remembering every bad day. They improve by converting repeated problems into procedures, controls, and repair paths. MineEvolve applies the same logic to embodied agents. The agent gets punched by the environment, but the useful outcome is not pain. It is the updated playbook.

The business question, then, is not whether your agent has memory. It is whether the memory has been promoted into operational knowledge.

Most systems are not there yet.

But at least now we have a clearer way to ask why.

Cognaptus: Automate the Present, Incubate the Future.

Zhengwei Xie, Zhisheng Chen, Ziyan Weng, Jinhan Li, Chenglong Li, Zikai Xiao, Jingwei Song, Jinhao Jing, Vireo Zhang, and Kun Wang, “MineEvolve: Self-Evolution with Accumulated Knowledge for Long-Horizon Embodied Minecraft Agents,” arXiv:2603.13131v3, May 10, 2026, https://arxiv.org/abs/2603.13131. ↩︎

Self-evolution here does not mean fine-tuning the model#

Monitor: the agent needs telemetry, not vibes#

Inducer: success becomes skills, failure becomes remedies#

Curator: the memory bank needs a bouncer#

Adaptor: repair the suffix, do not restart the universe#

The main evidence: gains are largest where dependency chains hurt#

The ablation result says the quiet part clearly: binary feedback is too poor#

Knowledge accumulation improves performance, but the budget still matters#

The appendix is not a second thesis, but it usefully tests the learning pathway#

What each experiment supports—and what it does not#

The business value is cheaper diagnosis, not magical autonomy#

The boundary: Minecraft gives the agent unusually clean feedback#

The agent that learns is the agent that edits its playbook#