Echoes, Not Amnesia: Teaching GUI Agents to Remember What Worked

Memory is not a folder

A useful employee does not fill out the same form from scratch every morning as if yesterday never happened. They remember which menu hides the export button, which warning can be ignored, which field must be filled before the “Next” button wakes up, and which apparently harmless click sends the process into a small bureaucratic swamp.

Many GUI agents still behave like that employee on their first day. Every day.

They can see the screen. They can click, type, scroll, and reason through a task. Some are impressively competent in isolated demos. But when the next task arrives, much of their previous operational experience disappears. The agent may have just learned how a mobile app behaves, but unless the system has a memory mechanism around it, that lesson is not retained as reusable knowledge. The model is not “growing up.” It is just performing again.

EchoTrail-GUI, proposed by Runze Li and colleagues, targets exactly this problem: the “digital amnesia” of GUI agents.¹ The paper’s central contribution is not simply that retrieval helps, nor that more examples in a prompt improve performance. That would be the boring version, and fortunately this paper is not only that.

The interesting claim is sharper: memory for GUI agents only works when it is earned, filtered, retrieved with relevance, and injected in a form the agent can actually use. Save everything and you create a junk drawer. Retrieve randomly and you create confusion. Add too many examples and the prompt becomes a committee meeting. EchoTrail-GUI is valuable because it treats memory as an operating pipeline, not a sentimental archive.

The real bottleneck is operational experience, not screenshots

Modern GUI agents are built on vision-language models that can interpret screenshots and generate actions. That is the visible magic. The less glamorous problem is trajectory knowledge: how a task actually unfolds across screens.

A GUI task is rarely a single action. It is a sequence:

observe the screen;
infer what matters;
choose an action;
observe the next screen;
avoid loops, dead ends, and irrelevant controls;
know when the task is finished.

The paper formulates this as a partially observable decision problem. At each step, the agent sees a GUI state, receives a task instruction, and chooses an action. A standard agent relies on its frozen base policy and action history. EchoTrail-GUI adds a retrieved memory set, so the next action is conditioned not only on the current screen and task, but also on relevant successful trajectories from the past:

$$ a_t \sim \pi_{aug}(a_t \mid s_t, I, H_t, M_t) $$

That formula matters because it tells us where the intervention sits. EchoTrail-GUI is not fine-tuning the model weights. It is adding non-parametric operating memory around the agent. In business language, it is closer to giving a worker a well-indexed playbook of successful procedures than sending the worker back to university.

The catch is obvious once stated: a playbook is only useful if it contains the right procedures. Bad memory is not neutral. It actively misleads.

EchoTrail-GUI works as a three-stage memory lifecycle

The paper’s mechanism is best read as a memory lifecycle: collect, filter, retrieve, apply. The authors describe three stages: critic-guided self-exploration, dynamic memory injection, and memory-augmented inference.

That sounds tidy, almost too tidy. The details are where the paper earns attention.

Stage	What the system does	Why it matters operationally	Failure mode if done badly
Critic-guided self-exploration	An exploration agent interacts with Android environments and generates candidate trajectories	Reduces dependence on manual demonstrations	Unguided exploration produces noisy, useless traces
Critic filtering	A reward model scores trajectories on a 1–5 scale and keeps only high-quality ones above threshold 4	Prevents memory from becoming a landfill	Bad memories perform worse than no memory
Trajectory abstraction	Stored memories keep interface descriptions, intent summaries, and actions, not raw screenshot chains	Makes memory lighter and more transferable	Raw pixels are expensive, redundant, and device-specific
Hybrid retrieval	Dense retrieval via embeddings is combined with BM25 keyword matching	Balances semantic similarity and literal task matching	Wrong examples give confident but irrelevant guidance
Memory injection	Retrieved trajectories are formatted as step-by-step guidance inside the GUI agent’s prompt	Turns past experience into immediate action guidance	Too many examples dilute context and introduce conflict

This is the first important correction to the casual reading of the paper. EchoTrail-GUI is not saying, “Agents need memory.” Everyone now says that. Usually right before producing a vector database and calling it architecture.

The paper says something more useful: agents need actionable memory. That means the memory must be produced from plausible task behavior, judged for quality, compacted into reusable abstractions, retrieved for the current task, and injected in a bounded way.

Self-exploration creates the raw material, but the critic decides what survives

The first stage is autonomous trajectory generation. An exploration agent, powered in the implementation by Gemini 2.5 Flash, interacts with GUI environments and creates trajectories. The maximum trajectory length is 30 steps. The generated EchoTrail-4K dataset contains 4,143 episodes, with an average valid trajectory length of 4.8 steps and a maximum of 22. The paper reports about 4.3 minutes per valid trajectory on a single device.

The exploration process is not pure wandering. EchoTrail-GUI uses what the authors call Progressive Intent Focus. The agent begins in a curiosity-driven mode, touching novel interface elements. After a few steps, it shifts into target-focused mode, where it forms a concrete sub-goal and tries to complete it.

That design solves a subtle problem. If exploration is too random, it discovers buttons but not workflows. If it is too goal-directed too early, it may never explore enough of the interface to build broad memory. The paper’s solution is a sequence: first discover the environment, then commit to a plausible intent.

But the more important step is the critic.

Each trajectory is evaluated by a reward model, implemented with Gemini 2.5 Flash Lite. The critic assigns a score from 1 to 5, and only trajectories scoring at least 4 enter the permanent memory database. This is the paper’s least glamorous component and probably its most business-relevant one.

Why? Because process memory in enterprise automation is only valuable if it is trustworthy enough to reuse. A memory system that stores every failed click path, half-finished action, and accidental detour does not create institutional knowledge. It creates institutional confusion, but now with embeddings.

The ablation results later make this painfully clear: removing critic-based filtering drops AndroidWorld average success from the full system’s 46.6% to 31.0% on Qwen2.5-VL-72B-Instruct, below the no-memory backbone baseline of 34.1%. In other words, unfiltered memory is worse than amnesia. Elegant, brutal, and useful.

The paper stores compressed procedures, not visual hoarding

A second mechanism is easy to miss: EchoTrail-GUI does not store raw screenshot sequences as memory.

Instead, each trajectory is abstracted into:

a textual description of the interface at each step;
the agent’s intent summary;
the executed action.

The paper reports that an average five-step trajectory occupies around 1,000 tokens in this format, compared with more than 10,000 tokens for an equivalent concatenated screenshot history. That is about a 90% reduction.

This matters for two reasons.

First, it reduces context cost. If memory injection requires stuffing many screenshots into the prompt, the system becomes expensive and brittle. Second, abstraction makes the memory less tied to a specific device layout. The agent does not need to remember that “the blue button was at coordinate x.” It needs to remember that the next step was to open the settings menu, select the relevant option, confirm the change, or finish after the expected state appears.

For enterprise automation, this is the difference between recording a screen video and building an operating procedure. The first is evidence. The second is reusable process knowledge.

Retrieval is where memory becomes relevant instead of merely available

Once the permanent memory database exists, the second stage retrieves relevant trajectories for a new instruction.

EchoTrail-GUI uses a hybrid retrieval score:

$$ Score(\tau, I) = \alpha \cdot S_{dense}(\tau, I) + (1-\alpha) \cdot S_{sparse}(\tau, I) $$

The dense score compares embeddings of the current instruction and the final intent of a stored trajectory, using FAISS. The sparse score uses BM25 lexical matching. This combination is sensible: GUI tasks often contain both semantic similarity and exact interface language. “Add a calendar event” and “create an appointment” are semantically close. But “VLC,” “OsmAnd,” or a specific setting label may require literal matching. Pure semantic retrieval can drift. Pure keyword retrieval can miss paraphrases. The hybrid approach says: please use both eyes, not one.

The paper’s sensitivity test on the number of injected memories is a good example of a result that should not be overread. It is not a second thesis. It is a robustness and configuration test.

On AndroidWorld, with Qwen2.5-VL-72B-Instruct:

Number of retrieved memories $K$	AndroidWorld SR
0	34.1%
1	40.5%
2	46.6%
3	43.1%

The peak occurs at $K=2$. One memory helps. Two help more. Three become worse.

That curve is the paper’s anti-hoarding lesson. More context is not always more intelligence. After a point, the agent receives overlapping or conflicting precedents, the prompt becomes longer, and useful guidance is diluted. Anyone who has seen a meeting with seven “relevant” stakeholders will understand the mechanism immediately.

The main evidence: performance improves without retraining

EchoTrail-GUI is evaluated on AndroidWorld and AndroidLab. The authors test it with GPT-4o and Qwen2.5-VL-72B-Instruct as backbones, emphasizing that the framework is training-free and plug-and-play.

The main results support the mechanism-first story.

On AndroidWorld, GPT-4o alone scores 34.5% success rate. EchoTrail-GUI with GPT-4o reaches 51.7%. For the open-source Qwen2.5-VL-72B-Instruct backbone, the baseline is 35.0%, while EchoTrail-GUI reaches 46.6%, matching UI-TARS-72B-SFT in the table without fine-tuning.

On AndroidLab, the improvement is also meaningful:

Backbone	Benchmark	Baseline SR	EchoTrail-GUI SR	Gain
GPT-4o	AndroidWorld	34.5%	51.7%	+17.2 pp
Qwen2.5-VL-72B-Instruct	AndroidWorld	35.0%	46.6%	+11.6 pp
GPT-4o	AndroidLab	31.2%	48.1%	+16.9 pp
Qwen2.5-VL-72B-Instruct	AndroidLab	23.9%	37.5%	+13.6 pp

The right interpretation is not “GUI agents are solved.” They are not. A 51.7% success rate is impressive relative to the benchmark, but still awkward if imagined as a production system clicking through payroll, procurement, or compliance workflows with no guardrails. Please do not let the demo goblin near your month-end close.

The better interpretation is that memory augmentation can produce substantial gains without retraining the model. That is strategically important because fine-tuning GUI agents is expensive, data-hungry, and often slow to adapt when interfaces change. A memory layer can be updated faster than model weights.

AndroidLab adds more detail because it reports not only final success, but sub-goal success, redundancy, and operation reasonableness. On Qwen2.5-VL, EchoTrail-GUI raises Sub-SR from 26.1 to 41.1, RRR from 68.7 to 89.4, ROR from 81.4 to 92.1, and SR from 23.9 to 37.5. Read the table rather than the celebratory phrasing: the final success rate does not magically become perfect, but the agent makes more intermediate progress, wastes fewer steps, and performs more reasonable operations.

That combination matters. A GUI agent that fails later but makes more correct sub-goal progress may be easier to supervise, interrupt, or repair. A system that reduces redundant operations may also reduce latency and cost. The paper does not provide an enterprise cost model, but the operational direction is clear.

The ablations show that memory has three separate jobs

The ablation table is the most useful part of the paper for builders because it separates the components. On AndroidWorld using Qwen2.5-VL-72B-Instruct, the full EchoTrail-GUI system reaches 46.6% average SR. Removing pieces gives different kinds of damage:

Variant	Easy	Medium	Hard	Avg. SR	Likely purpose of test
Qwen2.5-VL-72B-Instruct baseline	46.7	23.6	13.2	34.1	No-memory baseline
w/o Critic-based Filtering	47.5	13.9	10.5	31.0	Ablation: memory quality
w/o Hybrid Retrieval	60.7	20.8	13.2	40.5	Ablation: retrieval relevance
w/o Real-time Guidance	62.3	25.0	13.2	42.7	Ablation: exploration quality
EchoTrail-GUI	65.6	30.6	15.8	46.6	Full mechanism

This table tells a more specific story than “memory helps.”

Critic-based filtering is the safety valve. Remove it, and performance falls below the baseline. This is the strongest evidence for the paper’s core misconception correction: memory is not just storage. Quality control is part of memory.

Hybrid retrieval is the relevance engine. Remove it, and the system still beats the baseline, but loses a large part of the full gain. That suggests the memory database has value, but access quality controls whether the agent receives useful precedent or vaguely related noise.

Real-time guidance affects memory creation. It improves the trajectories generated during exploration, which later affects inference quality. This matters because the memory system is not merely a database attached after the fact. It shapes the data generation loop itself.

The hard-task numbers also deserve restraint. Full EchoTrail-GUI improves hard-task SR from 13.2 to 15.8. That is a gain, but not a transformation. The larger gains appear in easy and medium tasks. The paper says gains are pronounced on medium and hard tasks, but the table suggests the practical lift is still modest on the hardest category. That is not a fatal weakness. It is a boundary.

The supporting analyses test whether the memory base is plausible

The paper includes several analyses beyond the main benchmark tables. They are useful, but they should be read with the right labels.

Evidence item	Likely purpose	What it supports	What it does not prove
AndroidWorld and AndroidLab benchmark tables	Main evidence	EchoTrail-GUI improves task success and operating metrics under benchmark settings	Production reliability across arbitrary enterprise software
Component ablation table	Ablation	Critic filtering, retrieval, and real-time guidance each matter	That the same component weights are optimal everywhere
$K$ sensitivity test	Robustness / configuration	Two retrieved memories perform best among tested values	Universal optimality of $K=2$
UMAP comparison of generated and real task intents	Exploratory validation	Self-explored trajectories are semantically aligned with benchmark tasks and diverse	That all generated trajectories are useful or safe
High-quality trajectory rate over exploration stages	Mechanism analysis	Real-time guidance improves exploration quality over stages	That exploration converges in all apps or domains
Critic-human agreement and alternative critics	Reliability check	Critic labels are reasonably aligned with human judgment and can use open-weight alternatives	That critic errors are harmless in high-risk workflows

The UMAP figure is particularly easy to misuse. It shows generated task intents overlapping with AndroidLab ground-truth instructions while also extending into new semantic areas. That supports the claim that self-exploration is not simply random button pressing. It is producing task-like trajectories. But UMAP is a visualization, not a guarantee of operational safety. It is evidence of plausible coverage, not certification.

The high-quality trajectory-rate figure shows improvement across four exploration stages for representative apps, with the paper noting roughly 20 percentage-point gains for more complex apps like OsmAnd and VLC. That supports the dual-database story: short-term processing memory helps the exploration agent avoid repeating mistakes and generate better permanent memories over time.

The critic reliability analysis is also pragmatic. Against 100 stratified trajectories, the critic’s decisions achieve Cohen’s $\kappa \approx 0.72$ against human expert labels, which the paper interprets as substantial agreement. Alternative critics are tested against the Gemini baseline. Qwen3-VL-235B-A22B reaches accuracy 0.84, $\kappa=0.68$, precision 0.77, and F1 0.86; GPT-4o mini and GPT-4o are close behind. The takeaway is not that the critic is perfect. It is that the quality-control role does not appear to require one proprietary model in this setup.

For business builders, that distinction matters. A critic can be swapped, audited, or specialized. A memory architecture that depends on one vendor’s private model is a procurement headache wearing a lab coat.

What this means for enterprise GUI automation

The direct result is benchmark performance on Android environments. The business inference is broader but must be handled carefully.

Many organizations still run important work through GUI-heavy systems: ERP screens, CRM dashboards, insurance portals, tax systems, vendor forms, banking interfaces, government platforms, internal admin tools, and legacy software that nobody wants to touch because the person who understands it retired in 2018 and now raises orchids.

GUI agents are attractive because they can automate across systems without waiting for clean APIs. But they are also risky because real software is messy. Buttons move. Login states expire. Pop-ups appear. Form validation is inconsistent. One “almost correct” click may create a bad record rather than a harmless failure.

EchoTrail-GUI points toward a more realistic architecture for enterprise use: not one monolithic model, but a memory-augmented operating system around the model.

What the paper directly shows	Cognaptus business inference	What remains uncertain
Successful Android trajectories can be self-generated and critic-filtered	Firms could build private libraries of successful workflow traces from their own software environments	Enterprise processes may require human approval before traces become reusable memory
Abstracted trajectories reduce token load by around 90% versus screenshot histories	SOP-like memory can be cheaper and more portable than raw screen recordings	Abstractions may omit compliance-critical visual details
Hybrid retrieval plus bounded injection improves benchmark performance	Process memory should be retrieved selectively, not dumped into prompts	Retrieval relevance may degrade when task language is ambiguous or internal jargon-heavy
Bad memory performs worse than no memory	Memory governance is a core safety function, not an optional cleanup step	Automated critics may miss subtle business errors
Gains occur without model retraining	A memory layer may be faster to update than fine-tuned model weights	Long-term maintenance cost of memory databases is not measured

This is where the paper becomes more interesting than another agent benchmark. The practical prize is not merely “better clicking.” It is a route toward reusable operational experience.

A company could imagine a controlled system where agents explore sandbox versions of business applications, generate candidate workflow traces, have critics and human reviewers approve them, and store compact procedural memories. At runtime, the agent retrieves the closest approved memories and uses them as guidance. Over time, the memory library becomes a process asset.

That sounds less glamorous than “autonomous agent that does everything.” Good. Less glamorous systems are sometimes the ones that survive contact with operations teams.

The limits are not footnotes; they define deployment shape

EchoTrail-GUI’s results are promising, but the deployment boundary is clear.

First, the experiments are on AndroidWorld and AndroidLab. These are meaningful benchmarks, but they are not the full diversity of enterprise desktop, browser, mainframe, SaaS, and hybrid workflows. A mobile benchmark does not prove reliability in a finance department’s five-system reconciliation ritual.

Second, the backbones are GPT-4o and Qwen2.5-VL-72B-Instruct. The framework is presented as model-agnostic, and the mechanism supports that claim conceptually, but empirical evidence is still limited to these tested backbones for main inference.

Third, the paper focuses on successful task trajectories. Real enterprise automation also needs negative memory: known traps, forbidden actions, compliance constraints, audit flags, and cases where a human must take over. EchoTrail-GUI’s processing database uses failed in-progress trajectories during exploration, but the permanent memory database is built around high-quality completed trajectories. That is sensible for guidance. It is not a full risk-control layer.

Fourth, the critic is useful but not legally accountable. Cohen’s $\kappa \approx 0.72$ is substantial agreement, not divine judgment. In low-risk workflows, that may be enough to reduce manual labeling. In regulated workflows, it should become a pre-filter, not the final authority.

Finally, the system improves success rates but does not eliminate failure. The best reported AndroidWorld success rate is 51.7% with GPT-4o. That is strong for the benchmark and still far from unattended production autonomy. The sane deployment path is supervised automation, exception handling, and memory governance—not handing an agent the keys and hoping the vector database has good manners.

The deeper lesson: agents need institutional memory

The most useful idea in EchoTrail-GUI is not Android-specific. It is architectural.

A GUI agent should not be treated as a stateless genius. Stateless genius is expensive, unreliable, and frankly exhausting. The more scalable design is an agent surrounded by institutional memory: successful traces, quality filters, retrieval mechanisms, critics, and explicit boundaries on what gets reused.

That architecture changes how businesses should think about automation ROI.

The value is not only in completing more tasks today. It is in accumulating reusable operational knowledge tomorrow. Every successful workflow can become a memory candidate. Every rejected workflow can reveal a gap in exploration, retrieval, or quality control. Every recurring task can become less novel.

But the paper also warns us against the lazy version of that dream. Memory is not a folder of screenshots. It is not “RAG for clicks.” It is not stuffing the prompt with every vaguely similar past action and calling the result experience.

Good memory is selective. Bad memory is sabotage with a timestamp.

EchoTrail-GUI’s contribution is to show a concrete route from stateless GUI action to experience-aware operation: self-exploration creates traces, a critic filters them, retrieval selects the relevant ones, and memory injection turns them into step-by-step guidance. The benchmark gains matter because they validate the loop. The ablations matter more because they show why the loop fails when one part is removed.

For enterprise AI, that is the quieter but more durable lesson. The next generation of GUI agents may not win by seeing more pixels or sounding more confident. They may win by remembering what worked, forgetting what didn’t, and knowing that two good precedents beat three mediocre ones.

Annoyingly human. Which, in this case, is progress.

Cognaptus: Automate the Present, Incubate the Future.

Runze Li, Yuwen Zhai, Bo Xu, Liwu Xu, Nian Shi, Wei Zhang, Ran Lin, and Liang Wang, “EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration,” arXiv:2512.19396v3, 2026. https://arxiv.org/abs/2512.19396 ↩︎

Memory is not a folder#

The real bottleneck is operational experience, not screenshots#

EchoTrail-GUI works as a three-stage memory lifecycle#

Self-exploration creates the raw material, but the critic decides what survives#

The paper stores compressed procedures, not visual hoarding#

Retrieval is where memory becomes relevant instead of merely available#

The main evidence: performance improves without retraining#

The ablations show that memory has three separate jobs#

The supporting analyses test whether the memory base is plausible#

What this means for enterprise GUI automation#

The limits are not footnotes; they define deployment shape#

The deeper lesson: agents need institutional memory#