Failure is usually treated as waste. The demo breaks, the agent apologises, someone adds a prompt patch, and everyone pretends the next retry will be more mature. Very enterprise. Very ceremonial.

The SaMuLe paper makes a more useful claim: failed agent runs are not just embarrassing logs. They are the curriculum.1 More precisely, they are raw material for a structured reflection pipeline that turns messy trajectories into error taxonomies, cross-task lessons, and finally a small retrospective model trained to diagnose future failures.

That distinction matters. The paper is not saying, “Ask the model to reflect harder.” It is saying that reflection only becomes useful when it has a supply chain: reference-guided diagnosis, error classification, cross-trajectory clustering, merged feedback, and supervised fine-tuning into a dedicated model. The lesson for builders is less mystical than the word “self-learning” suggests. Agents do not improve because they gaze inward. They improve because someone has built a better error-accounting system.

The real unit of learning is the failed trajectory

Most agent reflection systems follow a simple loop. The agent tries a task. If it fails, another prompt asks it what went wrong. The resulting “reflection” is appended to the next attempt.

This can work on small tasks. It also produces a lot of premium-grade corporate mush: “be careful with constraints”, “verify the final answer”, “plan step by step”. All technically correct. All operationally anaemic.

SaMuLe starts from a sharper observation. In complex planning tasks, success can be rare, but failure is abundant. If a method depends on successful trajectories, it is trying to learn from the scarce part of the distribution. That is tolerable when the agent already succeeds often. It is a poor strategy when the base agent keeps violating budgets, dates, policies, locations, and tool constraints.

The paper’s target is exactly that high-error regime: travel planning, natural-language planning, and interactive customer-service-style tasks where agents must satisfy multiple constraints over long trajectories. The mechanism is built around the idea that a failed run contains several types of information:

Failure signal What it can teach Why generic reflection misses it
A wrong step in one trajectory The local correction needed for the next attempt The model may not know which step caused the failure
Repeated errors across trials of the same task A task-specific failure pattern Single-run reflection sees each mistake in isolation
Similar errors across different tasks A transferable rule Prompt memory rarely clusters failures by type
Mismatch between predicted and actual user response A live reason to change course Post-hoc reflection arrives too late

That is the paper’s core move. It does not merely store experience. It classifies experience.

SaMuLe turns reflection into a three-level production line

The system has two main stages. First, it synthesises high-quality reflections from training trajectories. Then it fine-tunes a smaller retrospective model to generate those reflections at inference time, when reference answers are no longer available.

The synthesis stage has three levels.

Micro-level reflection looks at a single failed trajectory. The agent’s attempt is compared with a valid reference output. The reference is not treated as the only possible solution; it is a diagnostic aid. The goal is to identify concrete errors and produce a concise corrective plan.

Meso-level reflection looks across multiple trials of the same task. Here the system builds an error taxonomy and labels actions with error types. In TravelPlanner, example taxonomy entries include budget constraint violations, accommodation minimum-stay violations, restaurant selection errors, invalid location selection, travel-time planning errors, and policy violations. This is where vague “do better” feedback becomes typed failure analysis.

Macro-level reflection groups trajectories from different tasks that share the same error type. The system then synthesises broader reflections for those clusters. A budget error in one itinerary and a budget error in another become evidence for a reusable planning principle, not just two isolated disappointments.

Finally, SaMuLe merges the micro, meso, and macro reflections into a final target reflection for each trajectory. That merged reflection becomes training data for a retrospective model. The paper uses Qwen 2.5 3B, trained with LoRA and DeepSpeed, while Claude 3.5 Sonnet-v2 is used as the actor and reflection synthesiser in the main setting.

The practical architecture looks like this:

Pipeline step Technical role Operational consequence
Log failed trajectories Preserve the agent’s actual reasoning and actions Makes failure inspectable rather than anecdotal
Compare with a valid reference Locate concrete deviations Reduces hallucinated blame
Build an error taxonomy Standardise failure language Enables measurement and repeated correction
Cluster same-typed errors across tasks Extract reusable patterns Turns local mistakes into domain playbooks
Merge reflections Combine specific and general feedback Avoids both overfitting and empty generality
Fine-tune a retrospective model Generate trajectory-specific reflections at inference Removes the need for references during deployment

This is why the paper is more interesting than another “LLM agents can self-improve” headline. SaMuLe’s improvement does not come from a more dramatic agent loop. It comes from treating error analysis as training data engineering.

The misconception: reflection is not introspection

The most tempting misread is that SaMuLe proves models can simply think about their mistakes and improve. That is not what the method shows.

SaMuLe’s strongest reflections are not spontaneous inner monologues. They are produced through a structured, reference-guided synthesis process. During training, the system uses valid reference outputs to help diagnose failed trajectories. It builds and applies an error taxonomy. It clusters failures by type. Then it trains a separate retrospective model to imitate the resulting reflections.

That matters because raw self-reflection is fragile. A model asked to diagnose its own failure may blame irrelevant factors, especially when the task is long and constraint-heavy. The paper’s qualitative appendix makes this concrete. In a TravelPlanner example, the failed plan selected accommodation requiring a two-night minimum while scheduling only one night, and the total cost exceeded the $1,900 budget. SaMuLe’s reflection identified those two actual problems. The Retroformer variant instead focused on issues such as geographic coordination and meal scheduling, which were not the implicated errors in that output.

That example is not the main evidence, but it explains the mechanism. The advantage is diagnostic precision. Better reflection is not more verbose reflection. It is reflection that points to the right failure mode.

The main results show a pattern, not just a leaderboard

The paper evaluates SaMuLe on TravelPlanner, NATURAL PLAN, and Tau-bench. The benchmarks differ enough to make the evidence useful: TravelPlanner stresses constraint-heavy itinerary planning; NATURAL PLAN tests structured planning over real-world tool-derived information; Tau-bench tests task completion in simulated user-agent interactions with APIs and policies.

The most important results are not just that SaMuLe wins. The useful pattern is where it wins and against whom.

Test Likely purpose Key result What it supports What it does not prove
TravelPlanner with Claude 3.5 Sonnet-v2 Main evidence on a difficult planning benchmark SaMuLe reaches 20.00% pass rate vs ReAct 4.44%, Reflexion 5.56%, Inter-task Error Reflection 9.44%, Expel 0.00%, Retroformer variant 12.78% Multi-level, failure-centric reflection helps most when ordinary reflection struggles Production readiness for open-ended travel agents
NATURAL PLAN Trip / Meeting Main evidence on structured planning SaMuLe reaches 60.31% / 48.50% vs Reflexion 50.00% / 40.50% Gains transfer beyond TravelPlanner Robustness across all planning domains
TravelPlanner with Claude 3.7 Sonnet Robustness across actor model SaMuLe reaches 29.44% vs Retroformer variant 21.67% and Reflexion 16.67% The mechanism is not tied to one actor model Full model-family generality
Tau-bench non-interactive and interactive Extension to dialogue-style agents In non-interactive Retail/Airline, SaMuLe reaches 87.83% / 66.00%; in interactive mode, 75.97% / 55.32% Reflection helps both after complete trajectories and during interaction That online reflection is always better than multiple offline retries
Reference-placement ablation Ablation on where references help Reference at micro only: 20.00%; no reference: 18.33%; reference at micro + meso: 15.56% References help local diagnosis but can over-constrain taxonomy formation That more ground truth is always better
Error reduction analysis Mechanism validation SaMuLe error reduction: 0.67 / 0.73 / 0.53 vs Reflexion 0.13 / 0.42 / 0.22 on TravelPlanner, NATURAL PLAN Trip, NATURAL PLAN Meeting Reflections reduce identified error types, not just improve headline scores That all remaining errors are captured by the taxonomy

The TravelPlanner numbers are the clearest demonstration of the paper’s thesis. Reflexion barely improves over ReAct. Expel collapses to zero. The Retroformer variant improves but remains below SaMuLe. That ordering is informative because the methods differ in what they learn from.

Expel depends heavily on successful or paired success-failure trajectories. In a hard setting where success is scarce, that becomes a weakness. The appendix examples show Expel producing broad TravelPlanner advice focused mainly on cost management, missing other important error classes such as accommodation policy violations and travel-time planning errors. SaMuLe, by contrast, treats the failed attempts themselves as the dataset.

The comparison with the Retroformer variant is also important. Retroformer-style methods use more complex reinforcement-learning machinery. The paper replaces PPO with DPO for memory efficiency because long TravelPlanner trajectories can exceed 10,000 tokens; it reports that the DPO variant is comparable to original Retroformer on HotPotQA, with 44% versus 43% success. That appendix result is a comparison with prior work, not the paper’s central claim. Its purpose is to justify using the DPO variant as the practical baseline under long-context constraints.

The larger lesson is blunt: if your reflection data is poor, a fancier optimiser mostly learns to be wrong with better posture.

Static reflections are useful, but trajectory-specific reflections are better

One especially useful comparison is between SaMuLe and “Inter-task Error Reflection.” Both use cross-task error information. The difference is that Inter-task Error Reflection applies static, precomputed error-type reflections, while SaMuLe trains a model to generate a reflection for the current trajectory.

That distinction shows up in the numbers. On TravelPlanner, Inter-task Error Reflection reaches 9.44%, while SaMuLe reaches 20.00%. On NATURAL PLAN Trip, it is 51.56% versus 60.31%. On NATURAL PLAN Meeting, 42.00% versus 48.50%.

The operational interpretation is simple. Static playbooks help, but they are blunt. A general “check accommodation policies” rule is useful. A trajectory-specific reflection that says the chosen hotel violates a two-night minimum stay, and that the cost exceeds the stated budget, is more useful.

This is where the paper becomes relevant to enterprise systems. Many companies already have policy documents, escalation playbooks, and best-practice prompts. Those are static reflections. SaMuLe suggests a more adaptive layer: keep the taxonomy, but train a model that maps the live trajectory to the relevant subset of lessons.

In business language, this is not “agent consciousness.” It is targeted exception handling.

The reference ablation is the quiet warning label

The most interesting ablation asks when the reference output should be added during reflection synthesis. The answer is not “everywhere.”

On TravelPlanner, providing no reference gives an 18.33% pass rate. Providing the reference during both Single-Trajectory and Intra-Task Learning drops performance to 15.56%. Providing the reference only at the micro level gives the best result: 20.00%.

That is a small table with a large implication. A reference plan helps when the system needs to diagnose a concrete failed step. But at the taxonomy-building level, too much reference exposure can narrow the model’s attention to one valid plan. In planning tasks, one reference output is not the only correct solution. Overusing it can turn error analysis into imitation.

For enterprise deployment, this is a useful constraint. If you are building a reflection system for claims processing, travel operations, compliance workflows, or customer support, gold examples should be treated as diagnostic instruments, not as sacred templates. The system should ask: “What constraint did the agent violate?” not “Why did the agent fail to copy this one historical resolution?”

That difference determines whether the taxonomy generalises.

Foresight reflection moves the method into live interaction

The non-interactive setup is familiar: complete a task, observe failure, generate reflection, retry. Real enterprise agents often do not get that luxury. A support bot cannot always run four full conversations with the same customer and choose the best one. Humans are famously unwilling to sit through controlled ablation studies disguised as customer service.

SaMuLe extends its framework with foresight-based reflection. At each turn, the agent predicts the user’s response. When the actual response diverges meaningfully from the prediction, it triggers a reflection step and adds the generated feedback into the ongoing interaction.

The Tau-bench results support the extension, with SaMuLe outperforming ReAct and Reflexion in both non-interactive and interactive settings. But the numbers should be read carefully. SaMuLe’s non-interactive three-trial performance is higher than its interactive performance: 87.83% versus 75.97% in Retail, and 66.00% versus 55.32% in Airline. That is not a failure of the method. It reflects the difference between getting multiple full attempts and having to recover mid-conversation.

The business inference is narrower and more useful: live reflection can improve adaptive dialogue, but it is not magic retry without the retry cost. It is a recovery mechanism. It should be evaluated on fewer derailments, faster correction, lower escalation rate, and fewer repeated policy violations—not only final pass rate.

What Cognaptus infers for builders

The paper directly shows that multi-level, failure-centric reflection improves benchmark performance against several reflection baselines. It also shows that a compact SFT-trained retrospective model can outperform a more complex RL-flavoured retrospective baseline when the reflection data is better structured.

The business inference is that failed workflows can become a reusable learning asset. Not merely a monitoring dashboard. Not a graveyard of stack traces. A training substrate.

A practical enterprise version would look like this:

  1. \ast\astInstrument agent trajectories.\ast\ast Log actions, observations, tool calls, user messages, constraint checks, and final outcome. Without step-level traces, “reflection” is just theatre.

  2. \ast\astDefine a domain error taxonomy.\ast\ast Start small: policy violation, missing information, invalid tool parameter, budget breach, timing conflict, eligibility mismatch, location mismatch, unsupported user request.

  3. \ast\astLabel failures at the action level.\ast\ast Attach error types and short critiques to the steps that caused downstream failure.

  4. \ast\astCluster errors across cases.\ast\ast Identify repeated failure classes across customers, workflows, and task variants.

  5. \ast\astGenerate operational reflections.\ast\ast Keep them concise: what failed, why it failed, what invariant must be checked next time.

  6. \ast\astTrain or distil a retrospective model.\ast\ast The model should map the current trajectory to targeted feedback. It should not dump the entire playbook into context like a nervous consultant.

  7. \ast\astWire reflection into control flow.\ast\ast A reflection should trigger concrete behaviour: rerun a validation tool, ask a clarification question, change a plan, escalate, or stop.

The ROI logic is not just higher task success. It is cheaper diagnosis. If a system can classify repeated failures and generate targeted corrective instructions, teams spend less time manually reading logs and more time fixing the actual failure modes. That is where “self-learning” becomes operationally meaningful.

Where the result should not be over-sold

The paper is promising, but its boundaries matter.

First, the results are benchmark results. TravelPlanner, NATURAL PLAN, and Tau-bench are useful tests, but they are not messy production deployments with adversarial users, partial observability, shifting policies, and changing databases.

Second, the TravelPlanner setup uses the sole-setting, where relevant background information is provided directly. The authors avoid the two-stage setting because trajectories are already long, sometimes above 10,000 tokens, and the more tool-heavy version would be computationally expensive. That is a reasonable experimental choice, but it means the result is not a full demonstration of open-ended tool-search planning.

Third, the synthesis pipeline uses references during training. The paper’s own ablation shows reference placement is delicate. Many enterprise workflows do not have clean reference outputs, and when they do, those references may encode old policy, human inconsistency, or one valid resolution among many.

Fourth, the error taxonomy is static at inference. The authors note this limitation directly. A static taxonomy can drift as products, policies, tools, and customer behaviour change. In production, taxonomy refresh has to be part of the system, not a quarterly archaeological dig.

Fifth, the offline synthesis cost is real. The final retrospective model may be lightweight, but creating the reflection dataset requires trajectory generation, reference comparison, taxonomy construction, classification, clustering, and summarisation. That is not free. It may still be cheaper than repeatedly fine-tuning large agents or manually debugging every failure, but it belongs in the cost model.

The strategic lesson: build the error system before the agent learns the error

SaMuLe is useful because it reframes agent improvement as an information architecture problem. The agent does not simply need more attempts. It needs better labels for what went wrong, better abstractions over repeated failures, and a way to inject the right lesson at the right moment.

That is also the likely direction for serious enterprise agents. The next step is not a bot that says “I will reflect on this” with greater emotional range. The next step is a failure intelligence layer: taxonomies, trajectory diagnostics, reflection models, and live triggers.

The uncomfortable part is that this sounds less glamorous than autonomous self-improvement. Good. Glamour is usually where the budget goes to hide.

If SaMuLe’s thesis holds beyond benchmarks, the winners will not be the teams with the longest prompts or the most theatrical agent loops. They will be the teams that turn their failures into structured data before their competitors finish writing another “be more careful” instruction.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast


  1. Yubin Ge, Salvatore Romeo, Jason Cai, Monica Sunkara, and Yi Zhang, “SaMuLe: Self-Learning Agents Enhanced by Multi-level Reflection,” arXiv:2509.20562, 2025. https://arxiv.org/abs/2509.20562 ↩︎