Repetition is where most automation systems quietly embarrass themselves.

Ask an AI agent to book a hotel once, and it may inspect the screen, reason through options, click through menus, and eventually finish the task. Ask it to do something similar tomorrow, and many systems perform the same little theatre again: perceive, reason, click, wait, reason, click, apologize, recover. Very intelligent. Very expensive. Slightly absurd.

That absurdity is the starting point for MobiMem, the system proposed in Beyond Training: Enabling Self-Evolution of Agents with MobiMem.1 The paper’s central claim is not that agents need a larger model, a longer context window, or yet another heroic fine-tuning loop. Its more interesting claim is that deployed agents can improve by changing what they remember.

That sounds simple until the word “memory” gets abused, as it usually does. In many AI products, memory means a longer chat history, a vector database of user facts, or a polite way of saying “we stored something somewhere and hope retrieval works.” MobiMem uses the word more operationally. It separates memory into three kinds: profile memory for user preferences, experience memory for reusable task logic, and action memory for repeated interface operations. Then it wraps those memories in operating-system-like services: scheduling, record-and-replay, and exception handling.

The result is not merely an agent with a notebook. It is closer to an agent with a small operating system around it.

That distinction matters for business automation because most companies do not need agents that become philosophers after deployment. They need agents that stop making the same mistake, stop asking the model to reason through the same form, and stop burning latency on interactions that could have been reused safely. MobiMem offers one concrete architecture for that shift: memory over models.

The paper’s real argument is not “agents need memory”

The obvious summary would say: MobiMem adds memory to GUI agents and improves performance. True, but bland enough to be dangerous.

The paper’s actual argument is more specific. GUI agents fail to self-evolve after deployment along three different axes:

Bottleneck Naive solution MobiMem’s replacement Practical meaning
The agent does not know the user well enough Store more user facts or fine-tune per user Profile Memory with DisGraph Retrieve relevant preferences quickly without LLM-based graph traversal
The agent cannot generalize from prior task executions Train on more traces Experience Memory with multi-level templates Reuse the control logic of task families while filling task-specific parameters
The agent repeats expensive reasoning for familiar UI actions Cache whole trajectories or rerun the model Action Memory with ActTree and ActChain Reuse safe UI actions while checking for stale screens and app changes

This three-part split is the paper’s strongest design move. It refuses to treat “agent improvement” as one problem. Personalization, capability expansion, and execution efficiency are not the same failure mode, so one memory store should not be expected to solve all three.

That is the part product teams should notice. If a deployed workflow agent is slow, inaccurate, and not personalized, the answer is not necessarily “use a bigger model.” It may be that the system lacks the right persistent structures around the model.

Profile Memory: user preference is a retrieval problem before it is a reasoning problem

The first memory type is Profile Memory. Its job is to help the agent infer missing task details from learned user behavior.

A user says: “Book a train ticket for tomorrow.” The agent may need to infer preferred seat class, usual departure time, price sensitivity, station preference, or whether the user tends to avoid transfers. A normal RAG system can store these preferences as chunks and retrieve semantically similar text. The problem is that similarity is not the same as relevance. “Travel” may retrieve many plausible but wrong memories. GraphRAG can add relational structure, but graph construction and traversal often require LLM calls, which makes retrieval slower.

MobiMem’s answer is DisGraph, a distance graph that moves semantic information into nodes and keeps edges lightweight. Nodes are either abstract concepts or concrete entities. Edges mainly indicate membership or relevance, not rich semantic relations. Retrieval starts with embedding search to find relevant nodes, then expands through breadth-first search over nearby graph nodes. In plain English: use embeddings to find the neighborhood, then use graph distance to collect related facts without asking the LLM to think its way through the graph.

That is the mechanism. The evidence is direct. In the paper’s synthetic user-profile benchmark, each profile includes facts, preferences, and behavior patterns across categories such as shopping, hotel booking, travel, food delivery, and entertainment. The benchmark uses 500 historical tasks and 30 ambiguous test tasks per user. The test checks whether the memory system retrieves the profile information needed to rewrite ambiguous tasks into personalized, complete instructions.

The results show the intended trade-off clearly:

Profile system Write latency Retrieval latency Profile alignment
Vanilla RAG 1.76 ms 19.58 ms 66.4%
GraphRAG 37,688.73 ms 6,675.82 ms 81.1%
MobiMem DisGraph 6,138.87 ms 23.83 ms 83.1%

This table is not just “ours is better.” It is a design diagnosis.

Vanilla RAG is fast but loses relational context. GraphRAG recovers alignment but pays for it with LLM-heavy graph operations. DisGraph preserves GraphRAG-level alignment while moving retrieval back into cheap embedding search plus graph traversal. The retrieval latency difference is the business-relevant part: 23.83 ms versus 6.68 seconds is not an academic rounding error. It is the difference between invisible infrastructure and a user watching an agent think about what it should already know.

The scalability test reinforces the same point. As the graph grows from 100 to 100,000 nodes, DisGraph traversal itself remains almost flat, from about 0.15 ms to 0.35 ms. Total retrieval latency grows mainly because vector database search grows from 8.77 ms to 1.29 seconds. Storage reaches about 1.35 GB at 100,000 nodes. So the bottleneck shifts away from graph reasoning and toward conventional retrieval infrastructure.

That is a useful boundary. MobiMem does not make large user memory free. It makes the graph walk cheap and removes LLM calls from retrieval. At very large profile sizes, vector search and storage engineering still matter. The magic, regrettably, remains scheduled for a later quarter.

Experience Memory: capability improves when task logic becomes a template

The second memory type is Experience Memory. This is where the paper moves from remembering facts to remembering how work gets done.

A GUI task often has two layers. One layer is invariant control logic: open the travel app, enter departure and destination, select date, compare options, confirm details. The other layer is variable data: today’s city, tomorrow’s date, the user’s preferred price range, the specific contact, the product name. Training a model on many examples tries to absorb both layers into weights. MobiMem instead stores reusable templates where invariant steps are kept and variable slots are filled at runtime.

This is important because GUI automation suffers from long-tail task families. There are too many apps, screens, layouts, and user variations to fine-tune a model for every practical workflow. But many task families share enough structure to be templated.

MobiMem uses multi-level templates. Higher-level templates generalize more broadly but require more reasoning from the model. Lower-level templates are more concrete and easier to execute, but less flexible. This is a sensible trade-off. Business workflows have the same shape: “process an invoice” is too broad for reliable automation, while “click this exact button at this exact coordinate” is too brittle. Useful automation lives between those two.

The paper evaluates Experience Memory on AndroidWorld, using 116 tasks across 20 Android apps, and compares several GUI agent models with and without experience templates. The headline result is that experience templates improve success rates across all tested agents. The weaker UI-TARS-1.5-7B model receives the largest relative improvement, 50.3%. Stronger models such as Gemini-2.5-Flash and Qwen3-VL-30B-A3B improve by about 21%–22%, while GUI-Owl gains 10.5%.

The interpretation is not simply “templates help smaller models.” It is more nuanced. The paper reports that general-purpose models benefit more from lower-level templates because UI interaction details are difficult to infer through reasoning alone. Domain-specific GUI models already have some interaction skill, so higher-level templates can be enough to reduce planning burden.

That is a useful lesson for deployment. Template granularity should match model competence. A powerful model may only need task-level structure. A smaller or cheaper model may need more concrete UI-level guidance. The cheapest model is not always the best model; the cheapest model plus the right operational memory may be.

The cost-effectiveness comparison makes this sharper:

Method for adding capability Data required Human effort GPU hours Accuracy
Fine-tuning ~100 examples 4.0 person-hours 0.25 58.5%
Manual experience templates ~5 examples 0.2 person-hours 0 63.5%
Automatically synthesized templates ~5 examples 0 0.0027 60.1%

This experiment is a comparison with prior work in spirit, but more importantly it functions as an operational cost test. It asks: if a company wants to add a new agent capability, does it need to build a training pipeline, or can it add structured reusable experience?

The result favors templates in this setting. Manual templates perform best, automatic templates nearly match fine-tuning, and both require far fewer examples. But there is a quiet caveat inside the success: templates work when the task family has separable invariant logic and variable parameters. If the workflow changes heavily across cases, or if success depends on judgment rather than repeatable interaction structure, template memory will be less powerful.

So the business reading is not “replace training forever.” It is: do not train the model to memorize procedures that can be represented as procedures.

Action Memory: latency falls when the agent stops reasoning about muscle memory

The third memory type, Action Memory, is the most practical and perhaps the easiest to underestimate.

Experience Memory remembers task logic. Action Memory remembers concrete interaction sequences. If a user repeatedly performs similar tasks inside the same app, many actions are shared: open the app, navigate to a search field, enter a familiar section, choose a known tab, confirm a recurring workflow. A human does not consciously reason through each of these steps every time. The paper borrows that intuition and turns it into action reuse.

MobiMem uses two structures:

Structure Reuse pattern When it helps
ActTree Prefix reuse Tasks in the same app share early navigation steps
ActChain Prefix-suffix reuse Tasks mapped to an experience template share invariant steps before and after variable slots

ActTree is useful when tasks share an initial path but diverge later. ActChain is more powerful when a template has already separated invariant and variable steps. Invariant actions can be replayed directly. Variable actions are reused only when parameters match. Otherwise the model performs fresh reasoning.

This is where the paper’s OS analogy becomes more than decoration. Action Memory is effectively a cache, and caches become dangerous when stale. Apps update, screens change, buttons move, dynamic content appears. MobiMem therefore checks UI hierarchy information before replaying cached actions. If a target element no longer matches expected properties such as resource ID, class, or text, the system rolls back, falls back to model execution, and updates memory.

That stale-cache handling is not a minor implementation detail. It is the difference between “automation acceleration” and “the agent confidently taps the wrong button because yesterday’s screen looked nicer.”

The evaluation uses 454 tasks across eight categories: email, train ticketing, food delivery, hotel booking, shopping, web browser, media playback, and map navigation. ActTree achieves 37.5% average reuse. ActChain with LLM-generated templates reaches 59.7%. ActChain with human-crafted templates reaches 77.3%.

The gap between LLM-generated and human-crafted templates is worth pausing over. Human-crafted templates separate invariant and variable actions more cleanly, which gives the replay system more safe reusable segments. This is not a failure of the method. It is a reminder that “template quality” is now part of agent performance engineering. In a production setting, the workflow designer may matter almost as much as the model.

Latency results show why this matters. With Action Memory enabled, MobiMind-4B drops from 14.1 seconds to 8.6 seconds, UI-TARS-1.5-7B from 14.7 seconds to 8.8 seconds, and GUI-Owl-7B from 38.0 seconds to 16.2 seconds. The paper reports speedups up to 4.5× for tasks such as hotel query and train ticketing, where reuse rates exceed 92%. Tasks with lower reuse, such as shopping and browser workflows at roughly 70% reuse, still require more LLM inference on cache misses.

The hardware experiment is even more revealing. Without Action Memory, the same MobiMind-4B agent averages 14.1 seconds on an A100 GPU, 27.4 seconds on an Ascend 910B NPU, and 153.2 seconds on a Snapdragon 8 Elite CPU-only mobile setup. With Action Memory, many tasks complete in the 10–50 second range across platforms, and the mobile device sees the largest gains, up to 9×.

This is the business value in one sentence: Action Memory converts repeated LLM inference into verified UI execution.

That does not make the model irrelevant. It makes the model less overworked. The model handles new or uncertain steps; memory handles the repeated, verified ones. That is what mature automation should look like. Not a genius rethinking every click. More like a competent operator who has learned the route to the coffee machine.

The system services are not decoration; they make memory operational

The three memories are the paper’s main architecture, but MobiMem also includes system-level services: an Agent Scheduler, Agent Record-and-Replay, and an Agent Exception Handler. These are easy to skim past. That would be a mistake.

Memory does not help much if it is retrieved, updated, and replayed in the wrong order. MobiMem’s scheduler coordinates profile retrieval, experience retrieval, task rewriting, execution, and memory updates. It also supports parallelism across apps and across fine-grained workflow steps.

The scheduler evaluation uses multi-app tasks involving search, shopping, and social applications. The paper compares serial execution, coarse-grained parallel execution, and fine-grained parallel execution. Fine-grained scheduling achieves up to 1.98× speedup over serial execution. In one representative multi-shop-plus-social task, serial execution takes 48.27 seconds. Coarse-grained parallelism reduces this to 37.65 seconds by querying two shopping apps in parallel. Fine-grained parallelism reduces it further to 29.84 seconds by letting the social app prepare non-dependent steps while price queries are still running.

This is not just a benchmark trick. It points to an important workflow principle: app-level parallelism is weaker than dependency-level parallelism. A business process is rarely just a list of apps. It is a graph of dependencies. If the agent can identify that the message body depends on price data but opening the chat does not, it can do useful work while waiting.

AgentRR, the record-and-replay mechanism, provides the trace capture needed for Action Memory. It records UI states, action contexts, and execution steps, then uses those traces to populate ActTree and ActChain. Unlike traditional record-and-replay, it does not aim to reproduce one exact historical execution forever. It supports reusable action memory while preserving fallback and generalization.

The exception handler deals with user intervention. If a user interrupts the agent, manually corrects a step, or issues a new instruction mid-execution, MobiMem suspends execution, preserves context, lets the user correct the flow, and then analyzes the combined trace. The correction is not blindly pasted into memory. The Experience Generator compares the original plan with the correction and distills why the original plan failed.

That is the right instinct. A user correction is not just an action. It is evidence about intent.

What the experiments support, and what they do not

The paper’s experiments are best read as four different tests, not one giant proof of “memory solves agents.”

Evidence block Likely purpose What it supports What it does not prove
Profile benchmark Main evidence for Profile Memory DisGraph improves the accuracy-latency trade-off versus Vanilla RAG and GraphRAG-style memory Real users will produce equally clean profile signals
Profile scalability test Robustness/sensitivity test Retrieval graph traversal stays cheap as node count grows; vector search becomes the main retrieval cost Infinite profile growth is free or privacy-safe
AndroidWorld success-rate test Main evidence for Experience Memory Templates improve success across several GUI agent models Every enterprise workflow can be templated
Fine-tuning vs template cost table Cost comparison Experience templates can add task-family capability with less data and compute in tested settings Fine-tuning is never useful
Action reuse and latency tests Main evidence for Action Memory Reusing verified action sequences reduces inference and latency, especially on constrained devices Cached actions remain safe without strong UI validation
Multi-task scheduling test System integration evidence Fine-grained dependency scheduling improves cross-app workflow latency Scheduling alone solves accuracy or personalization

This separation matters because papers often mix main evidence, ablations, robustness tests, and implementation results in a way that tempts readers to overgeneralize. Here, the main evidence is fairly coherent: each memory type is tested against the bottleneck it was designed to address. The scalability and scheduling results are supportive, not separate theses.

The strongest claim is that memory-centric design can reduce post-deployment dependence on model retraining for GUI agents. The weaker claim would be that memory-centric design eliminates the need for model improvement. The paper does not show that, and does not need to.

What this means for business automation

For companies building AI workflow automation, the paper suggests a useful architecture checklist.

First, separate user memory from procedural memory. A customer preference, an invoice-processing pattern, and a repeated click path should not live in the same undifferentiated vector store. They have different update rules, retrieval rules, and failure modes.

Second, treat latency as an architecture problem, not only a model-serving problem. Faster inference helps, but the bigger win may be avoiding inference when the next step is already known and verifiable. This is especially important for mobile, edge, and desktop agents where interface operations are frequent and repeated.

Third, design workflow templates as product assets. In MobiMem, manual templates outperform automatically synthesized ones for action reuse because they separate invariant and variable steps more cleanly. In business terms, this means process analysts, operations staff, and automation engineers still have valuable work to do. Sorry, “fully autonomous agents by Friday” people. The spreadsheet refuses to die.

Fourth, build correction loops that learn from interventions. When a user interrupts an agent, that moment is not just an error event. It is training signal, but not necessarily model-training signal. It can update templates, exception handlers, and workflow rules.

For Cognaptus-style automation, the paper’s broader lesson is that agent deployment should not be imagined as a model sitting inside a product. It should be imagined as a memory-and-execution system where the model is only one component. The competitive advantage may come less from having the largest model and more from having the best accumulated operational memory around the model.

Where the result probably transfers, and where it may not

MobiMem is strongest in environments with structured interface introspection and repeatable workflows. The paper focuses on mobile and GUI automation, and its discussion argues that similar mechanisms could transfer to desktop systems through accessibility APIs, to browsers through DOM inspection and tools such as Selenium or Playwright, and to command-line environments through process monitoring and output parsing.

That transfer is plausible, but not automatic.

Enterprise SaaS workflows can be messier than benchmarked mobile apps. Permissions differ by role. Interfaces vary across tenants. Data can be sensitive. Workflows include compliance approvals, ambiguous business judgment, and exception-heavy edge cases. A cached action sequence that is harmless in a media app may be dangerous in payroll, banking, or procurement.

The profile benchmark also uses synthetic users and LLM-based judging. That is reasonable for controlled evaluation, but it does not settle privacy, consent, data retention, or profile drift questions in real deployment. A business system would need governance around what gets stored, how memories expire, how users inspect or correct memory, and how the agent separates preference from policy.

There is also a quality-control issue for experience templates. The paper shows that manual templates can be highly cost-effective, but manual template design introduces human process quality into the automation stack. Bad templates can encode bad workflows very efficiently. This is the classic enterprise automation problem wearing a newer jacket.

So the practical boundary is clear: MobiMem’s architecture is most compelling where tasks are repeatable, interface state is observable, action validity can be checked, and user corrections can be safely incorporated into procedural memory. It is less settled where workflows are highly discretionary, UI state is opaque, or errors carry high legal or financial cost.

The agent grows up by becoming less model-centric

MobiMem is interesting because it changes the metaphor.

A model-centric agent improves like a student: give it more examples, fine-tune it, hope the behavior generalizes. A memory-centric GUI agent improves more like an operating system: cache repeated operations, schedule independent tasks, record and replay safe actions, handle exceptions, and update structured state.

That is a more boring metaphor. It is also more useful.

The paper does not prove that future agents will stop needing model training. It does show that for GUI automation, a large share of post-deployment improvement can be moved out of model weights and into structured memory. Profile Memory makes personalization retrieval fast. Experience Memory makes task-family logic reusable. Action Memory turns repeated reasoning into verified replay. The scheduler and exception mechanisms make those memories usable in real workflows rather than decorative in a diagram.

For businesses, the lesson is not “buy a memory system.” The lesson is sharper: stop treating every agent failure as a model problem. Some failures are memory-design problems. Some are workflow-abstraction problems. Some are cache-validation problems. Some are scheduling problems.

Once you see that, “self-evolving agent” stops sounding like a mystical creature raised in a GPU farm. It starts looking like a practical software system that learns where software systems have always learned best: in state, structure, and repeated use.

Cognaptus: Automate the Present, Incubate the Future.


  1. Zibin Liu, Cheng Zhang, Xi Zhao, Yunfei Feng, Bingyu Bai, Dahu Feng, Erhu Feng, Yubin Xia, and Haibo Chen, “Beyond Training: Enabling Self-Evolution of Agents with MobiMem,” arXiv:2512.15784, 2025. https://arxiv.org/abs/2512.15784 ↩︎