Sink or Skill: Why Agent Experience Needs Governance

TL;DR for operators

AI agents do not become useful by remembering everything. That is not intelligence; it is a data landfill with a chatbot interface.

Two recent arXiv papers, one on medical reasoning agents and one on physically based swimming control, make a shared operational point from very different directions. SkeMex shows how a medical agent can improve after deployment by converting interaction trajectories into structured, evaluated, and governed clinical skills.¹ SWIM shows how a simulated swimmer can learn robust control from a single reference motion when body-fluid interaction is represented at the right level and scarce experience is sampled efficiently.²

The common lesson is not “agents need memory” or “one example is enough.” The sharper lesson is that experience must be engineered. Useful agent experience has to be captured selectively, compressed into reusable structure, scored by outcomes, retrieved under context, retired when harmful, and tested outside the cosy little garden where it was born.

For business leaders, this points to a new operating discipline: experience operations. The advantage will not go to teams that merely store every trace, prompt, tool call, transaction, and customer case. It will go to teams that know which parts of experience are reusable, which are misleading, and where reuse stops being competence and starts becoming liability.

The problem now: agents are leaving the lab and meeting reality

The first wave of enterprise AI treated models as clever answer engines. Feed them a document, ask a question, enjoy the synthetic confidence. That era is not over, regrettably, but the target has moved. Businesses now want AI systems that handle multi-step workflows, gather evidence, operate tools, learn from prior cases, adapt to changing settings, and avoid repeating the same mistakes with artisanal enthusiasm.

That changes the problem. In a static benchmark, the model answers and the evaluator scores. In an operational setting, the system accumulates experience. A support agent sees repeated edge cases. A clinical assistant sees similar but not identical patient presentations. A robotic or simulated controller discovers that a behaviour working in one environment fails when the medium, load, target, or body changes. The issue is no longer only model capability. It is whether prior interaction improves future behaviour without poisoning it.

This is where the two papers form a useful chain. SkeMex lives at the symbolic reasoning layer: it asks how a medical agent can turn clinical trajectories into reusable procedural memory without updating the base model. SWIM lives at the embodied control layer: it asks how a physically simulated swimmer can learn stable behaviour from almost no motion data while interacting continuously with a dynamic fluid environment. Medicine and swimming are not the same domain, unless your hospital has a very unusual triage protocol. But the systems face the same operational structure: interaction is expensive, raw experience is noisy, and generalization depends on selecting the right abstraction.

The shared logic chain

The relationship among the papers is best understood as a complementary logic chain, not as two parallel paper summaries.

Chain step	SkeMex contribution	SWIM contribution	Business translation
1. Static performance is insufficient	Clinical tasks require multi-step reasoning, tool use, evidence gathering, and revision.	Swimming requires continuous full-body control under fluid forces, not just imitation in air.	Benchmarks and demos do not prove operational maturity.
2. Experience is expensive	Medical feedback and clinical trajectories are too valuable to waste as raw logs.	Body-fluid simulation is costly, and full-body swimming motion data is scarce.	Every useful interaction should be treated as a scarce asset.
3. Raw traces are noisy	SkeMex distils trajectories into structured skills with triggers and clinical steps.	SWIM uses smoothed per-body force and torque rather than raw volatile fluid contacts.	Logs need abstraction before reuse.
4. Reuse must be selective	Skills are retrieved by semantic relevance, utility, and memory strength.	The replay buffer keeps informative successes and failures while evicting less useful middle cases.	More memory is not automatically better memory.
5. Generalization needs boundary tests	SkeMex tests in-domain, out-of-domain, online, and cross-backbone transfer.	SWIM tests unseen goals, trajectories, fluid properties, waves, perturbations, and body changes.	Transfer is a claim that must be stress-tested, not implied.
6. Deployment needs governance	The skill repository merges, matures, deprecates, and capacity-controls memories.	The policy still fails under severe body-geometry changes and strong environment shifts.	Experience systems need maintenance, not vibes.

That chain matters because businesses are building agent systems as if the hard part is giving the model more context. It is not. The hard part is deciding what experience deserves to survive.

SkeMex: memory as a governed skill repository

SkeMex starts from a very practical complaint: many medical agents either solve each case independently or store raw historical traces that are redundant, noisy, and poorly governed. That is a dangerous habit in medicine, where “similar case” is not the same as “safe precedent.” A prior reasoning path may be useful. It may also be a polished route to the wrong conclusion.

The paper’s answer is to treat interaction trajectories as sources of reusable procedural skills rather than as memories to replay wholesale. SkeMex does not update the backbone model. Instead, it attaches an evolving external skill repository to a ReAct-style medical agent. The repository is divided into three branches: general reasoning skills, task-level clinical skills, and action-level tool-use skills. That separation is important. A broad diagnostic strategy, a specialty-specific reasoning pattern, and a drug-interaction tool instruction should not compete as if they are interchangeable memories wearing different hats.

During inference, SkeMex retrieves skills at the episode level. The retrieval score combines semantic match, historical utility, and a temporal memory-strength signal. This is the first useful business lesson: relevance alone is too weak. A memory that sounds similar but has repeatedly produced poor outcomes is not a memory. It is a recurring incident report.

After the agent completes a case, SkeMex uses a Read–Write–Assess–Govern lifecycle. It filters trajectories, distils useful patterns into skill drafts or patches, updates skill utilities from feedback, and periodically governs the repository by merging similar skills, promoting mature ones, deprecating low-utility entries, and enforcing capacity limits. In other words, it gives memory a lifecycle. That should sound obvious. It is also exactly the part most enterprise “memory layer” decks skip, presumably because “we built a semantic cache” looks nicer on a slide.

The experiments support the architectural point. In offline settings, SkeMex improves over ReAct and other memory baselines across multiple medical benchmarks. With DeepSeek-V3.2, the reported average improves from 48.20% for ReAct to 56.08% for SkeMex; with Qwen3.6-Plus, from 48.63% to 59.22%. In online evaluation, where memory updates over streaming tasks, SkeMex rises from 76.39% at the first epoch to 78.56% by the third, while some baselines regress on particular settings after memory updates. That regression is the little gremlin hiding inside naïve experience reuse: learning from the past can make you worse if the past is stored badly.

The ablations are more interesting than the headline numbers. Removing buffer gating drops performance sharply. Removing maturation also hurts. The full three-branch memory outperforms partial branch combinations. The message is consistent: performance comes not merely from writing memories, but from filtering, structuring, validating, and ageing them properly.

SkeMex is not a clinical deployment certificate. The paper uses benchmarks and automated or rubric-based evaluation settings, not real hospital accountability. The authors also show a failure case where early tool-format errors consume the interaction budget and the agent converges prematurely on an insufficient diagnosis. That failure is useful because it shows where governed skills still do not remove the need for robust execution control. Memory can guide. It cannot magically refund wasted steps.

SWIM: experience efficiency in a body that cannot fake physics

SWIM looks unrelated at first glance. It is about physically based character swimming, not clinical reasoning. The method learns controllable swimming from a single reference motion using reinforcement learning, body-fluid simulation, and a structured state representation. If SkeMex is about remembering how to reason, SWIM is about learning how to move when the environment pushes back.

That difference is exactly why the pairing is useful. SWIM shows the same experience-economy logic in a domain where bad abstraction is immediately punished by physics. A swimmer cannot bluff its way through fluid dynamics with a confident paragraph.

The problem is hard for three reasons. First, full-body swimming requires coordinated motion under continuous body-water interaction. Second, motion data is scarce: the paper uses one freestyle and one butterfly motion, each around three to four seconds. Third, simulation is expensive. Pure trial-and-error is not attractive when every sample requires coupled body-fluid dynamics. Reality, as ever, has not read the GPU budget.

SWIM’s first major design choice is state abstraction. The authors test several environment-state representations, from no fluid state to total force/torque, raw per-body force/torque, lightly smoothed values, smoothed values, and a quantized latent representation. Their chosen representation, SmoothFT, uses smoothed per-body force and torque. That sits between two bad extremes. Too little environment state makes the policy insensitive to water. Too much raw force detail exposes learning to volatile contact dynamics. The useful signal is local, structured, and damped.

That is the embodied version of SkeMex’s trajectory-to-skill distillation. Both systems reject raw experience as too noisy. One compresses clinical trajectories into procedural skill items. The other compresses fluid interaction into a manageable body-part force representation. The medium changes; the principle does not.

SWIM’s second major design choice is efficient sampling. The authors build on PPO but add an off-policy replay component with progressive eviction. Early in training, the buffer behaves more like FIFO. Later, it becomes more reward-aware. Crucially, the reward-aware strategy keeps both high-reward successful episodes and low-reward failure episodes, while preferentially evicting mid-reward samples. That is a quietly important design move. Failures are not garbage. Boring ambiguity often is.

The results show that SmoothFT with progressive eviction converges faster and trains more stably than alternatives. In comparison against adapted imitation-learning and RL baselines, SWIM completes all three observed learning stages within a five-million-sample budget: not drifting sideways, swimming toward the goal, and reaching the target. The paper then tests zero-shot generalization beyond the simple training setup: unseen goals, curved trajectory following, different fluid properties, wave conditions, perturbations, and body-geometry modifications.

The limitations are equally instructive. SWIM still relies on adequate motion data. It struggles under severe distribution shifts, including strong waves and amputated or heavily altered bodies. Moderate fins can improve propulsion, but limb removal causes instability. That is not a footnote. It is the boundary of the representation. The system generalizes because its experience is structured well for a range of variations, not because it has discovered universal swimming essence in a three-second clip. There is a difference. Investors occasionally forget this. Engineers rarely do.

The combined thesis: experience needs an operating model

Together, these papers suggest a simple operating equation:

$$ E_{\text{useful}} \approx C \times A \times V \times G \times T $$

where:

$C$ is capture: did the system observe meaningful interaction?
$A$ is abstraction: was raw experience converted into a reusable representation?
$V$ is valuation: was usefulness estimated from outcomes?
$G$ is governance: can bad, stale, duplicate, or immature experience be managed?
$T$ is transfer testing: has the system been evaluated outside its comfort zone?

The multiplication is intentional. If any term collapses, useful experience collapses with it. A huge memory store with no valuation is just hoarding. A clever abstraction with no governance becomes stale doctrine. A replay buffer with no boundary tests becomes a confidence machine trained on its own narrow biography.

This is the business point: agent learning is becoming less like model training and more like operations management. It needs inventory control, quality assurance, lifecycle rules, audit trails, and stress testing. Yes, the language is less glamorous than “autonomous intelligence.” That is usually how you know it might survive procurement.

What the papers show, and what the business interpretation adds

It is worth separating the evidence from the extrapolation.

The papers show the following:

Evidence from the papers	What it supports
SkeMex improves medical-agent benchmark performance using external skill memory without backbone updates.	Post-deployment improvement can come from governed memory, not only retraining.
SkeMex ablations show drops when filtering, maturation, branch structure, or valuation components are removed.	Memory quality depends on lifecycle design, not just storage.
Some memory baselines regress during online updates.	Experience reuse can harm performance when poorly governed.
SWIM learns swimming control from one short reference motion using structured state and efficient sampling.	Scarce demonstrations can be useful when paired with strong representation and sampling design.
SWIM generalizes across several unseen physical conditions but fails under severe shifts.	Transfer has boundaries that must be mapped explicitly.

The business interpretation is broader:

Business interpretation	Why it follows, cautiously
Enterprise agent memory should be treated as a governed asset.	Both systems improve through selected, structured experience rather than raw accumulation.
Feedback attribution is a product requirement, not a research luxury.	SkeMex depends on utility updates; SWIM depends on reward-aware sample retention.
Logs are not automatically reusable knowledge.	Both papers convert raw interaction into more stable abstractions before reuse.
Deployment claims should include transfer boundaries.	Both papers test beyond the training setting and reveal failure zones.
“More data” can be an expensive way to avoid designing the right representation.	SWIM’s single-instance learning and SkeMex’s compact skill repository both point toward experience efficiency.

The extrapolation should not be abused. A medical skill repository is not the same as a replay buffer in fluid simulation. A clinical benchmark is not a hospital ward. A simulated swimmer is not a warehouse robot, a drone, or a financial agent. But the common pattern is strong enough to matter: reusable experience is designed, not dumped.

Experience Operations: a practical framework

For teams building AI agents, the combined lesson can be turned into an operating framework. Call it Experience Operations, or ExpOps, if we must keep feeding the acronym furnace.

Function	Question to ask	Bad default	Better design
Capture	Which interactions are worth learning from?	Store everything.	Filter for informative successes, failures, and edge cases.
Compress	What reusable structure should survive?	Keep raw transcripts and traces.	Distil into skills, patterns, state features, or validated procedures.
Value	How do we know this memory helped?	Retrieve by semantic similarity only.	Track outcome-conditioned utility and context-specific reliability.
Retrieve	When should prior experience influence action?	Inject all similar memories.	Rank by relevance, utility, recency, and task context.
Govern	What happens to stale or harmful experience?	Let memory grow forever.	Merge duplicates, mature validated items, deprecate weak ones, enforce capacity.
Stress-test	Where does reuse fail?	Test on familiar cases.	Evaluate across domains, tools, environments, user types, and perturbations.

This matters most in workflows where mistakes are expensive and feedback is uneven: healthcare, finance, legal operations, industrial control, logistics, insurance, cybersecurity, and customer support at scale. In those settings, the agent’s memory layer is not a convenience feature. It is an operational risk surface.

A customer-service agent that remembers a workaround may reduce handle time. It may also keep applying an obsolete policy after compliance changes. A finance agent that learns from prior portfolio reviews may improve personalisation. It may also overfit to a market regime that has politely died. A clinical agent may reuse a diagnostic workflow. It may also prematurely converge because the old workflow fits the first half of the case and ignores the part where reality becomes rude.

The answer is not to disable memory. The answer is to manage it.

Why “more context” is not the answer

The current industry reflex is to throw larger context windows at memory problems. That helps with some tasks. It does not solve experience governance.

A long context window can carry more information, but it does not decide what is worth carrying. It does not know whether a prior trace was useful, harmful, outdated, redundant, or context-specific unless the system has mechanisms to evaluate those properties. SkeMex makes this explicit by combining semantic match with utility and memory strength. SWIM makes the physical analogue explicit by avoiding raw volatile force signals and using a smoothed, structured representation.

Long context is a bigger suitcase. These papers are about packing discipline.

That distinction becomes more important as agent systems connect to tools. Tool-using agents generate messy traces: failed calls, malformed arguments, partial observations, corrections, retries, irrelevant detours, and occasional success achieved by luck wearing a lab coat. If all of that becomes memory, the agent inherits its own debris. If none of it becomes memory, the agent repeats avoidable mistakes. The value lies in the middle: selective learning with governance.

The failure boundary is the product boundary

Both papers are unusually useful because they do not only ask whether the system improves. They also expose where it weakens.

SkeMex reports strong benchmark gains but still shows a failure case where early execution errors consume the interaction budget and lead to premature diagnostic convergence. That is an execution-control problem, not just a memory problem. It tells operators that skill memory must be paired with robust tool protocols, error recovery, step budgeting, and escalation rules.

SWIM generalizes across several conditions but struggles with severe body geometry changes and strong waves. That tells operators that representation is not neutral. A state abstraction captures some invariances and misses others. SmoothFT helps with body-fluid interaction across useful variation, but it remains tied to the training body geometry. The moment the body changes too much, the learned policy’s assumptions become visible.

For business deployment, this means every agent memory system needs a declared transfer envelope. Not a marketing claim. A testable boundary.

A usable transfer statement sounds like this:

Weak transfer claim	Better transfer claim
“The agent learns from experience.”	“The agent reuses validated skills for these task families and suppresses skills below this utility threshold.”
“The controller generalizes.”	“The controller was trained in condition A and tested under variations B, C, and D; it fails under E.”
“The memory improves performance.”	“The memory improves these metrics versus these baselines, but regresses when filtering or maturation is removed.”
“The system is adaptive.”	“The system updates memory every defined window, with capacity limits and deprecation rules.”

This is less exciting than “self-improving AI.” It is also less likely to embarrass everyone in a quarterly risk review.

What managers should do differently

The operational takeaway is not to copy SkeMex or SWIM directly. Most businesses are not building medical benchmark agents or simulated swimmers. The takeaway is to update the deployment checklist.

First, assign ownership of memory. If an agent has persistent experience, someone owns its lifecycle. That includes retention rules, evaluation criteria, rollback procedures, and auditability. “The vector database has it” is not ownership.

Second, separate raw logs from reusable memory. Logs are evidence. Memory is an operational intervention. Promoting a trace into reusable guidance should require filtering, abstraction, and validation.

Third, score memory by outcomes, not just similarity. A retrieved item should answer two questions: “Is it relevant?” and “Has it helped before in this context?” Similarity without utility is how systems confidently repeat elegant mistakes.

Fourth, keep failures. SWIM’s replay strategy values low-reward failures because they are informative. SkeMex also keeps informative failures for skill evolution. In business systems, failures should not only feed incident reports; they should feed controlled learning loops.

Fifth, retire things. Stale policies, duplicated procedures, old pricing logic, deprecated APIs, and obsolete customer exceptions should not live forever because nobody wanted to touch the memory layer. Memory without deletion is archaeology.

Sixth, test transfer like a product feature. If an agent is expected to operate across regions, customer types, document formats, regulatory regimes, or market conditions, then the memory system must be tested across those variations. Otherwise “generalization” means “worked on the examples we happened to like.”

The uncomfortable strategic implication

The uncomfortable implication is that the next moat in agent systems may not be the foundation model. It may be the experience layer around it: the structured, evaluated, domain-specific operating memory that accumulates over time.

That does not mean every company should rush to build a grand proprietary agent brain. Please do not create another internal platform named something like Cortex360. The point is narrower and more useful. If agents are doing repeated work in a domain, their interactions can become an asset only when converted into governed reusable experience. Without that conversion, the company has logs. With bad conversion, it has liability. With good conversion, it has compounding operational knowledge.

SkeMex shows this at the level of medical reasoning: skill memory can improve agent behaviour without retraining the model, but only when written, valued, and governed. SWIM shows it at the level of embodied control: one reference motion can seed robust behaviour, but only when the system uses the right state abstraction and sampling discipline.

Together, they argue for a less glamorous but more durable view of AI progress. Intelligence is not just the ability to generate an answer or imitate a motion. It is the ability to turn experience into reusable structure while knowing when that structure stops applying.

That is the actual frontier. Not memory as a checkbox. Memory as an operating system with quality control.

Badly governed experience is technical debt with better branding. Governed experience is where agents start to become useful.

Cognaptus: Automate the Present, Incubate the Future.

Haoran Sun et al., “Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory,” arXiv:2606.09365, 2026. https://arxiv.org/abs/2606.09365 ↩︎
Binglun Wang, Edmond S. L. Ho, and He Wang, “SWIM: Single-instance Whole-body Imitation for swiMming,” arXiv:2605.31120, 2026. https://arxiv.org/abs/2605.31120 ↩︎

TL;DR for operators#

The problem now: agents are leaving the lab and meeting reality#

The shared logic chain#

SkeMex: memory as a governed skill repository#

SWIM: experience efficiency in a body that cannot fake physics#

The combined thesis: experience needs an operating model#

What the papers show, and what the business interpretation adds#

Experience Operations: a practical framework#

Why “more context” is not the answer#

The failure boundary is the product boundary#

What managers should do differently#

The uncomfortable strategic implication#