A support ticket goes wrong. A workflow agent chooses the wrong tool. A finance assistant misses a procedural step. The usual response is familiar: add the failure to memory, rewrite a prompt, perhaps ask the agent to “reflect” before trying again. This is useful, in the same way that putting a sticky note on a broken machine is useful. It may prevent the same mistake next time. It does not prove the machine has learned how to improve.
That distinction is the center of Training Language Agents to Learn from Experience, a new paper by Yuval Shalev, Zifeng Ding, and Mateja Jamnik.1 The paper is not mainly about giving agents a longer memory. It is not merely another reflection trick. Its sharper contribution is a testbed for a harder question: can a language agent observe experience from tasks it will not repeat, extract a reusable lesson, and improve on future unseen tasks?
That is the business-relevant version of the problem as well. Most operational failures are not replayable. A customer conversation cannot be reset. A warehouse exception cannot be repeated under lab conditions. A compliance review cannot be run ten times until the AI finally discovers the right policy interpretation. If an agent only improves by retrying the same instance, it is less an autonomous learner than a very patient intern with infinite do-overs.
The paper proposes a cleaner loop. An actor attempts tasks. A reflector reads the actor’s trajectories and rewrites the actor’s system prompt. The next actor attempt happens on new tasks, not the same task. The reflector is then trained with reinforcement learning to become better at writing those prompt updates. In other words, the paper studies whether “learning from experience” can itself be trained as a capability.
The answer is cautiously positive. The trained reflectors outperform an untrained Qwen2.5-7B-Instruct reflector on most held-out task families in ALFWorld and MiniHack. The evidence is meaningful, but narrow. This is not a proof that production agents can safely self-improve in the wild. It is a mechanism for studying how that kind of improvement might be built without pretending that memory alone is learning.
The paper replaces “try again” with “learn for the next unseen case”
Most reflection methods for language agents follow a convenient pattern. The agent fails a task, writes a reflection, and tries the same task again. This is sensible for benchmarks. It is also a little too generous. The exact same problem appears again, usually with the same hidden structure, and the agent can use the previous failure as a direct hint.
The paper’s In-context Training, or ICT, changes the test. The actor does not get to repeat the same task instance. At each meta-turn, it faces a new batch of tasks sampled from a training set. The reflector observes what happened, writes a new system prompt, and that prompt is evaluated on a fixed held-out validation set. The target is not “solve this failed example.” The target is “write a better general instruction after seeing these examples.”
The mechanism is simple enough to draw as a loop:
| Step | Component | What happens | Why it matters |
|---|---|---|---|
| 1 | Actor | Runs a batch of task episodes under the current system prompt | Produces behavioral evidence, not just final scores |
| 2 | Reflector | Reads the prompt and full trajectories | Diagnoses how the instruction shaped behavior |
| 3 | Reflector | Generates an improved system prompt | Converts experience into reusable natural-language policy |
| 4 | Evaluation | Tests the new prompt on held-out tasks | Measures transfer, not repeated-task correction |
| 5 | Meta-loop | Repeats the process for several turns | Tests whether prompt improvement compounds |
This is a small change with large consequences. In a repeated-task setup, the agent can overfit to the specific failed instance. In ICT, that shortcut is less useful. A prompt that only solves yesterday’s exact case will not necessarily help on tomorrow’s unseen case.
The paper formalizes the reflector’s objective as finding a system prompt that maximizes the actor’s reward across validation tasks:
$$ sp^* = \arg\max_{sp} \sum_{e_i \in E_{val}} R(LLM_{act}, e_i, sp) $$
The notation is less important than the design principle. The prompt is treated as the policy surface. The actor’s model weights remain frozen. The reflector’s job is to improve the actor by rewriting the instructions that condition its behavior.
That makes the experiment unusually relevant for business systems. Most companies will not fine-tune a full agent model after every week of operational failures. They might, however, update role instructions, escalation rules, tool-use policies, exception-handling playbooks, and domain-specific “what to do when…” guidance. ICT is basically a research-grade version of that loop.
The actor acts; the reflector turns behavior into instructions
The actor and reflector have different jobs. The actor is the operational agent. It interacts with ALFWorld or MiniHack using a ReAct-style prompt: observe, think, act, repeat. The reflector is not acting in the environment. It receives the previous system prompt and the actor’s trajectories, then outputs two things: an analysis of what went right or wrong, and a full improved system prompt.
This division matters. Many agent frameworks blur action, memory, critique, and planning into one large context. The paper deliberately separates them. The actor performs. The reflector audits performance. The output of the audit is not a private thought, not a vector memory, and not a loose suggestion. It is a revised system prompt that can be inspected.
That inspectability is not a decorative feature. It is one reason prompt-based self-improvement remains attractive despite its obvious limitations. A learned weight update is hard to audit. A changed instruction can be read. It may still be wrong, vague, or overfitted, but at least the failure is visible in human language. In regulated or high-friction operational settings, visible failure is still preferable to invisible cleverness. A low bar, yes, but a useful one.
The paper’s examples show the reflector doing something more specific than motivational self-critique. In MiniHack-Read, the reflector observes that the actor succeeds when it picks up and reads a resource, then writes a prompt telling the actor to continue exploring until it finds an item and to pick up or read resources depending on the situation. In ALFWorld Cool and Place, it notices that the actor fails to identify where cooled objects should be handled and adds guidance about locations such as fridges and side tables.
The important detail is that the training data did not contain tasks requiring cooling or reading. These examples are not proof of broad general intelligence. They are evidence that the reflector can infer a task-relevant procedural rule from trajectories and express it as a prompt update.
That is closer to what a business would want from post-incident learning. Not “remember ticket #8472.” Rather: “When a refund request includes contradictory delivery data, first reconcile shipment status before escalating.” The value is not the stored event. The value is the operational rule extracted from the event.
The training signal is clever, but it is also a compromise
The reflector is trained using reinforcement learning, specifically GRPO, with Qwen2.5-7B-Instruct as both the base actor and base reflector. The actor is frozen. Only the reflector is fine-tuned.
Here is the slightly awkward part. The evaluation objective is future unseen tasks, but the training reward is computed by replaying the previous batch of tasks under each candidate prompt. In plain English: the reflector sees a batch of trajectories, proposes candidate prompt updates, and those candidates are scored by rerunning the same batch with the frozen actor.
At first glance, that sounds like the paper has quietly returned to repeated-task learning through the side door. The authors handle this by distinguishing the training reward from the ICT evaluation goal. The replayed batch provides a reward signal during training because reinforcement learning needs something to score. The reflector is still prompted to produce instructions for future tasks, and final evaluation uses held-out task types.
This design should be read as an implementation compromise, not a philosophical contradiction. The paper needs a scalable reward signal without human labels. Replaying the recent batch is a practical way to ask: “Did this prompt actually fix the behaviors revealed by the trajectories?” If a candidate prompt cannot improve the situations it was supposed to diagnose, it is unlikely to be a good general lesson.
But the boundary is real. A reward based on replayed tasks can encourage narrow fixes. The paper tries to control this with held-out task-type evaluation. That is why the task partitioning matters so much.
Task-type splits are the quiet guardrail against fake learning
A weak version of this experiment would split task instances randomly into train and test. That would look rigorous while allowing an easier shortcut: the reflector could learn a prompt that works for a known task type and reproduce it later. Nice benchmark number. Thin claim.
The paper instead partitions by task type. In ALFWorld, the meta-train set includes Pick and Place, Examine in Light, Clean and Place, and Heat and Place. The meta-test set includes Cool and Place and Pick Two and Place. In MiniHack, the meta-train set includes Room-Random, Eat, Wield, and Wear. The meta-test set includes PutOn, Zap, Read, and Room-Dark.
This is the right experimental pressure. It asks whether the reflector has learned a transferable reflection strategy, not whether it memorized a prompt for one familiar family.
The evidence table should be read through that lens. The trained reflector beats the untrained Qwen2.5-7B reflector in most held-out task settings:
| Environment | Task | Initial prompt | Untrained reflector, $k=3$ | Trained reflector, $k=3$ | Untrained reflector, $k=5$ | Trained reflector, $k=5$ |
|---|---|---|---|---|---|---|
| ALFWorld | Cool and Place | 13.3±4.3% | 42.4±10.9% | 49.2±10.1% | 38.8±12.0% | 48.0±15.0% |
| ALFWorld | Pick Two | 27.7±6.5% | 34.6±5.6% | 42.5±7.2% | 28.3±6.1% | 41.2±7.6% |
| MiniHack | Dark | 35.0±7.2% | 48.4±5.8% | 51.9±4.0% | 42.8±9.7% | 48.1±3.5% |
| MiniHack | Read | 29.1±6.4% | 34.7±4.1% | 38.4±5.4% | 32.2±5.9% | 40.0±5.4% |
| MiniHack | PutOn | 24.8±5.2% | 26.6±9.0% | 30.0±4.7% | 22.8±10.0% | 34.1±2.2% |
| MiniHack | Zap | 10.4±5.8% | 18.8±5.6% | 14.1±3.5% | 23.1±7.9% | 27.8±4.5% |
The result is not uniformly heroic. The trained $k=3$ reflector underperforms the untrained reflector on MiniHack-Zap. The authors interpret Zap as a harder task where small batches may not provide enough successful examples or reward signal. This matters because sparse success is common in real workflows too. If almost every attempt fails, a reflective learner may have little positive evidence to imitate. It can diagnose pain beautifully and still not discover the cure. Very human, unfortunately.
The more important pattern is not that every cell improves. It is that trained reflection improves performance across most held-out task types, including task types not seen during reflector training. That supports the paper’s main claim: learning from experience can be trained as a meta-skill.
More turns test whether improvement compounds rather than merely appears
The paper also studies what happens across meta-turns. The trained reflector was built from ICT loops with $N=4$ turns during dataset creation. At evaluation time, the ICT loop runs for 10 turns. If performance stopped improving after turn 4, the model might simply be reproducing the horizon it saw during training. Instead, the best winning-prompt success rate continues to rise beyond that point.
The reported detail is useful: in ALFWorld, 55% of best prompts for $k=3$ and 45% for $k=5$ are achieved after turn 4. In MiniHack, 32.5% for $k=3$ and 37.5% for $k=5$ emerge after turn 4. For ALFWorld $k=3$, 25% of the best prompts are not reached until turn 10.
This is not a minor appendix curiosity. It is one of the stronger mechanism checks in the paper. The reflector is not merely producing a one-shot improvement. In many runs, the loop continues to extract incremental value from additional experience.
For business readers, this is the difference between a static postmortem and a living operating manual. A static postmortem says, “Here is what went wrong last time.” A living manual changes as more edge cases accumulate. The paper’s evidence suggests that prompt-level operating rules can improve across multiple cycles. It does not show that such improvement remains stable indefinitely. That distinction should stay visible.
Batch size is not just a cost parameter; it changes the learning signal
The paper tests batch sizes $k=3$ and $k=5$. The obvious interpretation is cost: larger batches mean more episodes, more tokens, more execution. But the paper points out a second role. Batch size changes the learning signal.
A larger batch gives the reflector more diverse trajectories to compare. It also makes RL training less brittle when rewards are sparse. If every candidate prompt gets zero reward, GRPO has no useful advantage signal. More episodes increase the chance that at least some candidate produces measurable success.
The results are mixed in a way that is operationally useful. In ALFWorld, $k=3$ and $k=5$ are broadly similar, and $k=3$ may be preferable once the cost of additional solving attempts is considered. In MiniHack, $k=5$ often performs better, especially where tasks are harder and sparse rewards make learning from small batches more difficult.
That is a practical lesson disguised as a benchmark detail. More experience is not always better. More experience is better when it changes the quality of diagnosis or supplies enough positive signal to learn. Otherwise, it is just a larger bill with academic formatting.
A business translation would look like this:
| Operational condition | Better evidence batch | Reason |
|---|---|---|
| Routine workflow with clear success/failure patterns | Smaller batch may be enough | Failures are interpretable and positive examples are common |
| Sparse-success process, rare edge cases, hard recovery paths | Larger batch may help | The learner needs more variation to find actionable signal |
| Long or complex trajectories | Larger batch may hurt smaller models | Too much context can overload diagnosis |
| Expensive human review loop | Smaller, curated batch may dominate | Quality of examples beats raw quantity |
The paper’s untrained Qwen2.5-7B reflector also performs worse under $k=5$ across almost all tasks, except Zap. The likely explanation is context workload: processing five episodes at once may be too much for the untrained model. This is a useful warning. Feeding more logs into an agent is not the same as giving it better judgment. Sometimes it merely gives the model a larger space in which to be confused.
Cross-benchmark transfer is promising, but it is the most fragile claim
The paper includes a cross-benchmark test: a reflector trained only on MiniHack is evaluated on ALFWorld. This is a stronger transfer setting because MiniHack’s grid-world tasks differ substantially from ALFWorld’s household manipulation tasks.
The $k=5$ MiniHack-trained reflector outperforms the untrained Qwen2.5-7B baseline on 4 of 6 ALFWorld tasks and is comparable on the remaining two. It performs especially well on Examine in the Light and Heat and Place. The $k=3$ reflector, however, does not show the same pattern and sometimes performs worse than the untrained model.
This test is best read as an exploratory extension, not the central proof. It suggests that the reflector may learn a general habit of reading trajectories and rewriting instructions, rather than merely learning benchmark-specific tricks. But the result depends on batch size, model scale, task selection, and the structure of the environments. It is encouraging. It is not yet a license to deploy cross-domain self-improving agents into procurement, logistics, compliance, and customer service while everyone goes for lunch.
Still, the direction is important. If the transferable object is not the task solution but the reflection procedure, then enterprise agents could eventually learn better operational playbooks across related workflows. A claims-processing agent might improve from denial disputes, then apply some of the same diagnostic discipline to prior-authorization exceptions. Not because insurance is a dungeon game, despite appearances, but because both require turning trajectories into procedural updates.
MetaGym is infrastructure for studying improvement loops, not a side gift
The paper’s third contribution is MetaGym, a Python library for constructing meta-environments around agentic tasks. It wraps Gym-like environments so that the meta-action is a system prompt and the meta-observation is a batch of actor rollouts. Internally, it manages parallel task execution and trajectory collection.
This may sound like tooling, but it matters to the research agenda. Without infrastructure, every paper on self-improving agents invents its own loop, evaluation convention, and data collection process. Then everyone gets numbers, and nobody knows which numbers mean anything. A familiar ceremony.
MetaGym makes ICT-style experimentation easier to reproduce and extend. It supports ALFWorld and MiniHack in this paper, but the concept is broader: construct a meta-environment where an agent’s “action” is not a physical move but an update to the instruction policy that governs future action.
For business teams, the equivalent would be an internal learning harness:
| MetaGym concept | Enterprise analogue |
|---|---|
| Actor rollout | Workflow execution trace, ticket transcript, tool-call log |
| System prompt | Operating policy, agent role instruction, escalation rule |
| Reflector | Postmortem model or review agent |
| Meta-observation | Batch of recent successes and failures |
| Validation tasks | Held-out cases, simulation suite, golden-process tests |
| Meta-turn | Review cycle that updates the operating manual |
This is where the paper becomes more than a benchmark result. It offers a vocabulary for agent improvement pipelines. The real enterprise question is not whether to copy MetaGym directly. It is whether AI systems should be evaluated at the level of individual task performance or at the level of improvement loops. The second level is harder. It is also closer to how operational value accumulates.
What the paper directly shows
The paper directly shows four things.
First, ICT provides a structured way to test cross-task self-improvement. The actor gets one attempt per task batch. The reflector must convert experience into a prompt that helps on future tasks. This is a better match for non-repeatable operational settings than repeated self-correction.
Second, trained reflectors outperform an untrained Qwen2.5-7B-Instruct reflector on most held-out task families in ALFWorld and MiniHack. The evidence supports the claim that reflection can be trained as a meta-skill, at least under these benchmark conditions.
Third, performance often continues to improve beyond the four-turn horizon used in dataset construction. This suggests that the trained reflector can participate in a recursive improvement loop rather than simply producing a one-shot prompt edit.
Fourth, cross-benchmark transfer appears possible in some settings, especially with the $k=5$ MiniHack-trained reflector evaluated on ALFWorld. This is suggestive evidence that the learned behavior may include general reflection skill, not only environment-specific prompt patches.
That is enough to make the paper interesting. It is not enough to make the result universal.
What Cognaptus infers for business use
The business interpretation is not “let agents rewrite themselves.” That is the kind of sentence that keeps security teams employed.
A more reasonable inference is: operational traces can become reusable process instructions if the learning loop is designed explicitly for transfer.
Most companies already collect logs. Support transcripts, CRM notes, task exceptions, tool-call histories, and failed automation runs are everywhere. The problem is that logs are usually treated as records, not training material for better procedures. A human team may review them during incident analysis, but the output is often scattered: a Slack warning, a playbook edit, a meeting note, a new checklist line that may or may not be used.
ICT points toward a more disciplined loop:
- Collect trajectory evidence from agent behavior.
- Diagnose how the current instruction produced success or failure.
- Rewrite the operating instruction.
- Test the new instruction on held-out or simulated cases.
- Keep the best-performing instruction only if it improves transfer.
The ROI pathway is not mysterious. Better instructions reduce repeated failures, lower human escalation load, and shorten the time between incident discovery and process improvement. The catch is that the validation layer must exist. Without it, self-improvement becomes self-confident prompt drift. Very efficient, very dangerous, very on brand for badly governed automation.
The boundary conditions are not footnotes; they define the deployment gap
The authors state two practical limitations directly. They train only the reflector while keeping the actor frozen, which isolates prompt evolution but requires maintaining separate actor and reflector models. That is computationally expensive. Their experiments used four NVIDIA GH200 Grace Hopper Superchips, with each reflector trained for about 20 hours and total training and evaluation across experiments requiring about 168 node hours.
Second, ICT uses a fixed held-out task set to identify the best-performing prompt. Real operations may not have a clean validation set. A company can build test suites and replay simulations, but many high-value workflows have shifting distributions, ambiguous success criteria, and delayed outcomes. The paper mentions lifelong learning as one possible direction, but long-term prompt stability remains unresolved.
There are also interpretation boundaries beyond the explicit limitations.
The actor is Qwen2.5-7B-Instruct. The environments are ALFWorld and MiniHack. These are useful agentic benchmarks, not full business processes. Success rates are task-specific and should not be translated into enterprise productivity percentages. The reflector writes system prompts, not database schema changes, API policies, compliance decisions, or human staffing plans. The loop improves instructions inside a controlled experimental harness. Production environments add adversarial users, messy tools, conflicting objectives, privacy constraints, and boring things like audit committees.
That does not weaken the research contribution. It prevents category error. The paper is evidence for an architecture of learning, not a turnkey deployment pattern.
The real contribution is a better unit of analysis
The easiest way to misread this paper is to focus on the prompt. The prompt is visible, so it attracts attention. But the deeper unit of analysis is the improvement loop.
A static prompt asks: “What instruction works now?”
An ICT-style loop asks: “What process turns experience into a better instruction for cases we have not seen yet?”
That second question is far more important for agentic AI. As agents move from answering questions to operating workflows, the hard part is not a single perfect prompt. The hard part is maintaining a controlled learning process: what evidence is collected, what gets updated, how updates are validated, when adaptation stops, and how humans inspect the change.
This is where businesses should pay attention. Not because ALFWorld robots are about to reorganize customer operations. They are not. But because the paper gives a precise form to an emerging management problem: agents will generate traces; traces will contain lessons; organizations will need a governed way to convert those lessons into better behavior.
The companies that treat this as “memory” will store more context and wonder why failures repeat in slightly different forms. The companies that treat it as “training from experience” will build feedback loops, validation suites, and review mechanisms around instruction updates.
The difference is small in vocabulary and large in consequence. Memory preserves the past. Learning changes the next action.
Cognaptus: Automate the Present, Incubate the Future.
-
Yuval Shalev, Zifeng Ding, and Mateja Jamnik, “Training Language Agents to Learn from Experience,” arXiv:2605.20477, 2026, https://arxiv.org/abs/2605.20477. ↩︎