TL;DR for operators

Software automation usually breaks at the interface between “the process is known” and “the application has changed again.” A button moves. A settings panel is renamed. A vendor ships a redesign with the emotional restraint of a toddler near glitter. The usual answer is more labelled demonstrations, more brittle scripts, or more human babysitting.

SEAgent attacks that bottleneck from the opposite direction: let the agent explore unfamiliar desktop software, build its own task curriculum, learn which actions helped or hurt, and then use those experiences to retrain itself.1 The paper’s important claim is not that GUI agents are solved. They are not. The strongest SEAgent model reaches 34.5% average success on five OSWorld software environments, versus 74.5% reported human performance. That is progress, not coronation.

The useful operator takeaway is narrower and more valuable: SEAgent sketches a training loop for lowering the cost of adapting computer-use agents to specialised software when human-labelled workflows are scarce. It combines four moving parts:

Component What it does Operational meaning
World State Model Judges full trajectories and identifies failed or redundant steps Turns messy GUI interaction into trainable feedback
Curriculum Generator Uses observed GUI changes to generate harder tasks Converts exploration into structured practice
GRPO on positive steps Reinforces actions that contribute to progress Makes successful behaviour more likely
Adversarial imitation on failure steps Pushes the policy away from actions that caused errors Teaches the agent from mistakes, not just victories

The business interpretation is therefore not “autonomous agents can now operate any software.” That would be adorable. The better interpretation is: if your organisation needs agents to learn niche, frequently changing interfaces, the durable asset may be the adaptation loop, not the base model.

The real bottleneck is not clicking; it is knowing whether the click mattered

A GUI agent can look impressive while still being operationally useless. It can move the mouse, click menus, type text, and produce a transcript dense enough to frighten a compliance department. None of that proves it knows whether the task is progressing.

This is the reward problem in computer use. In a mathematical task, the reward can often be checked against a final answer. In a GUI task, the final screen may lie by omission. A screen saying “saved successfully” does not confirm that the agent chose the right file format, used the required template, preserved the right metadata, or avoided a redundant detour that would quietly destroy productivity at scale.

SEAgent’s first meaningful design choice is to judge the whole trajectory, not just the final screenshot. Its World State Model takes sequences of screenshots from an episode and outputs captions, reasoning, and a structured judgment that can include correctness, redundancy, and the first error step. That last field matters. If the agent fails at step nine, punishing all nine steps equally is lazy accounting. Some early steps may have been correct; one later step may have derailed the workflow. SEAgent tries to separate them.

The authors fine-tune this judge from Qwen2.5-VL-7B using GPT-4o-labelled trajectories from Chrome in OSWorld, retaining 860 high-quality annotated trajectories whose judgments matched the benchmark’s ground-truth evaluation. They also add 1,000 before/after screenshot pairs labelled with change descriptions. This is where the “self-evolving” label needs adult supervision. The target software learning loop is autonomous, but the judge itself is bootstrapped with external labelled supervision. The paper does not hide this. Readers should not either.

The reward-model benchmark is best read as a diagnostic test for the system’s foundation. On full-process screenshots, the fine-tuned World State Model with change-description co-training reaches 71.6 precision on AgentRewardBench, 73.9 on OSWorld-Full, and 69.3 on the professional/office subset. GPT-4o with full-process screenshots reaches 72.1, 74.6, and 70.4 respectively. The interesting point is not that the open model beats GPT-4o. It does not. The interesting point is that a 7B open model, after targeted trajectory-judgment training, gets close enough to serve as the training-time reward model.

That is the first business-relevant mechanism: SEAgent tries to make adaptation less dependent on expensive, repeated human annotation for every new software environment by creating a reusable judge that can turn exploration into feedback.

The curriculum is not a syllabus; it is memory under pressure

Once an agent has a judge, it still needs tasks worth attempting. Random exploration is a romantic idea in the same way that letting interns discover the ERP system by clicking everything is romantic. It may produce knowledge. It may also produce an invoice from IT.

SEAgent’s Curriculum Generator is the second major mechanism. It starts with the initial GUI state of unfamiliar software. The World State Model describes the interface: menus, buttons, visible elements, and state changes after actions. The Curriculum Generator then creates initial tasks and a software guidebook. As the actor tries tasks, the judge evaluates each trajectory and describes what changed on screen. The generator updates its guidebook and proposes harder tasks.

The paper’s example from LibreOffice Impress is useful because it is boring in exactly the right way. The agent learns to “add a rectangle.” The judge observes that a property panel appears and that options such as fill, line, colour, width, transparency, and corner style are visible. The generator then creates harder tasks: add a green rectangle, add a red rectangle with 50% transparency, and so on. This is not magic. It is structured exploitation of newly observed affordances.

That detail matters for enterprise automation. Most organisations do not lack workflows because nobody knows that buttons exist. They lack scalable adaptation because workflow knowledge is scattered across training videos, SOPs, vendor documents, support tickets, and the memory of the one employee everyone messages at 11:43 p.m. A guidebook-style curriculum is a crude but plausible analogue of institutional software knowledge: observe capability, document it, create tasks that test it, and update the policy.

The appendix comparison with instruction-generation strategies sharpens the point. Against adapted versions of WebRL and NNetNav task-generation methods, SEAgent’s guidebook-based Curriculum Generator performs best on Celestia, a more out-of-domain scientific application. With Qwen2.5-72B, it reaches 9.09% on Celestia versus 0.00% for WebRL and 0.00% for NNetNav. With Gemini-2.5-Pro-thinking, it reaches 12.12% versus 3.03% and 5.05%. These are not high success rates. They are evidence for a more modest claim: systematic exploration helps when the software is genuinely unfamiliar.

On VSCode, the picture is more mixed. NNetNav with Gemini reaches 43.6%, while the SEAgent Curriculum Generator with Gemini reaches 42.3%. The authors interpret this as a trade-off: reverse instruction generation can exploit known functionality efficiently, while guidebook-based exploration better supports unfamiliar environments. That is a useful distinction. In business terms, exploitative task generation is better when you already know the workflow; exploratory curriculum generation is better when you are still discovering the application surface.

SEAgent learns from success, but the failure signal does the interesting work

SEAgent’s policy learning has two sides.

For actions judged positive, the system applies Group Relative Policy Optimization, using action-specific reward functions. Clicks and mouse movements are scored by coordinate distance. Drag and selection actions use intersection-over-union for regions. Typing, hotkeys, key presses, scrolling, copy, and paste use character-level BLEU-style comparisons. Finish and wait actions get fixed positive rewards.

This is practical engineering, not theological revelation. GUI actions differ. A click is wrong by being in the wrong place. A typed string is wrong by containing the wrong characters. A drag is wrong by selecting the wrong region. Treating them all as generic text outputs would be tidy, and also deeply unhelpful.

The more distinctive part is what SEAgent does with bad actions. If the World State Model identifies a first error step, actions before that step can be treated as useful, while the failure action is labelled negative. The policy is then trained with an adversarial imitation loss that pushes it away from the failure-inducing action. In plain English: do more of what worked, and explicitly become less like the move that broke the task.

That distinction shows up in the VSCode ablation. The untrained UI-TARS baseline has a 13.0% success rate. Using Qwen2.5-VL-72B as a reward model with GRPO yields only 10.1%. Using Qwen2.5-VL-72B with supervised fine-tuning yields 11.6%. With the World State Model and GRPO, success jumps to 34.8%. Adding adversarial imitation raises it further to 37.7%.

Test Likely purpose What it supports What it does not prove
Reward-model benchmark Main evidence for judge reliability Full-trajectory judging is stronger than final-state-only judging for GUI tasks The judge is universally reliable in enterprise software
OSWorld five-software evaluation Main performance evidence SEAgent improves over UI-TARS and adapted RL baselines General-purpose software mastery
VSCode ablation Component attribution World State Model, RFT/GRPO, and failure-action punishment each matter Same component weights will hold across all software
Curriculum-generator comparison Robustness / exploratory extension Guidebook-based task generation helps unfamiliar software exploration It is always better than exploitative task generation
TARS-1.5 / ScienceBoard test Robustness / out-of-domain check SEAgent appears more useful when the base model has not already seen the environment Broad scientific workflow competence

The ablation is not a second thesis. It is a stress test of the mechanism. The important lesson is that reward quality dominates. A bigger or more capable generic vision-language model is not enough if it cannot judge full GUI trajectories reliably. Once the reward model improves, reinforcement fine-tuning becomes useful; once failure actions are identified, punishing them adds further gain. Elegant? Reasonably. Fragile? Also reasonably.

The headline number is 34.5%, and no, that is not human-level automation

The main OSWorld evaluation covers five professional or office-related software environments: VSCode, GIMP, LibreOffice Impress, VLC, and LibreOffice Writer. The initial actor is UI-TARS-7B-DPO, which averages 11.3% success across these tasks.

SEAgent improves performance in stages. Specialist RL agents trained separately for each software average 32.2%. Direct generalist RL across all software averages 30.6%. General supervised fine-tuning reaches 27.9%. The best result comes from specialist-to-generalist training: train five software specialists, distil 3.5K successful trajectories with reasoning traces into a new generalist base model, then refine it with RL across all five software environments. That model averages 34.5%.

The result is genuinely interesting because the generalist beats the ensemble of specialists. Usually, one expects a specialist to win in its own domain and a generalist to trade peak performance for breadth. SEAgent reports the opposite: the specialist-to-generalist model scores 40.5 on VSCode, 42.3 on GIMP, 22.7 on Impress, 35.3 on VLC, and 31.8 on Writer. The specialist ensemble scores 37.7, 38.5, 22.0, 33.3, and 29.0.

This suggests that trajectories from separate software specialists contain transferable interaction knowledge. Menus, panels, document states, selection patterns, save flows, modal windows, toolbar affordances, and correction strategies are not identical across applications, but they rhyme. A generalist trained after specialist experience may inherit enough shared structure to outperform the isolated experts. Apparently, the agent also benefits from cross-training. Somewhere, a productivity consultant just nodded solemnly.

Still, magnitude matters. Human performance in the same table is 74.5% overall. The best SEAgent model is less than half that. Even the strongest software-specific scores remain in the low 40s. On Impress, the best model reaches 22.7%. That is not a deploy-and-forget operator. That is a research prototype showing a better adaptation loop.

For business readers, the right comparison is not against humans today. It is against the cost of building and maintaining labelled GUI automation datasets for every piece of niche software tomorrow.

The specialist-to-generalist result is the paper’s quiet strategic clue

Many agent systems chase the generalist directly. Train one model across everything; hope scale makes the mess beautiful. SEAgent’s experiments suggest a more staged path.

First, train software specialists through autonomous exploration and reinforcement from experience. Each specialist learns a particular environment. Then collect successful trajectories from those specialists. Use them to supervise a new generalist. Finally, run another round of reinforcement across all environments.

This resembles how enterprises actually build operational competence. A company does not begin with a universal “business process genius.” It begins with people who know procurement, finance, design tools, customer support consoles, or legacy reporting systems. General operational intelligence often emerges from combining specialised practice, not from pretending all workflows are one workflow with different wallpaper.

The paper’s specialist-to-generalist result therefore has a practical implication: enterprise GUI automation may benefit from local adaptation first and consolidation later. Instead of deploying one agent across the whole software estate and hoping it learns everything, a firm could train or fine-tune agents around high-value software clusters, collect successful traces, and distil them into a broader internal agent. The broader model would not replace governance, testing, or access control. It would reduce repeated adaptation cost.

That inference goes beyond what the paper directly proves. SEAgent demonstrates the pattern on five benchmark environments. Enterprise software introduces permissions, audit logs, dirty data, custom plugins, network latency, vendor-specific quirks, and users who click “Cancel” with the confidence of Roman emperors. But the direction is credible: specialist experience may be a better source of generalist competence than generic pretraining alone.

What the paper directly shows, and what Cognaptus infers

The distinction matters, because agent papers tend to invite over-reading. A benchmark improvement can easily be inflated into a platform strategy by people with slide decks and suspiciously round TAM numbers.

Layer Claim Status
Direct paper result SEAgent improves UI-TARS-7B-DPO from 11.3% to 34.5% overall on five OSWorld software environments Shown in the paper’s main evaluation
Direct paper result Specialist-to-generalist training outperforms direct general RL and the specialist ensemble Shown in the OSWorld table
Direct paper result The fine-tuned World State Model approaches GPT-4o precision for full-trajectory judgment in the reported settings Supported by the reward-model benchmark
Cognaptus inference The loop could reduce adaptation cost for niche or changing enterprise software Plausible, but not directly tested in enterprise systems
Cognaptus inference Specialist-first training may be a practical deployment pattern for internal GUI agents Plausible, based on the reported transfer result
Still uncertain Whether the method scales to long, security-sensitive, multi-hour workflows Explicitly unresolved

This is the article’s central point: SEAgent is not mainly a new “agent can use software” story. It is a proposal for making software-use agents learnable after the software changes. That is a subtler problem, and a more useful one.

The enterprise value is cheaper adaptation, not magical autonomy

In business settings, GUI automation fails for tedious reasons. The process has exceptions. The interface changes. The software is internally customised. The workflow crosses tools. The agent needs to know not only where to click, but whether the document is now in the right state for the next human or system.

SEAgent’s mechanism maps to three operational problems.

First, it reduces dependence on labelled target-software demonstrations. The system still needs a trained judge, but the target application training loop can generate tasks, attempt them, judge them, and refine the policy without human-labelled demonstrations for every new task. For companies with long-tail internal tools, this is the economic opening.

Second, it creates a software guidebook as a by-product of exploration. That guidebook is not a full SOP library. It is a machine-generated, experience-updated representation of what the software can do and how visible UI states change. In a production version, this could become useful not only for training agents but for documenting automation coverage, debugging failures, and identifying unknown interface regions.

Third, it separates failure diagnosis from final-task outcome. A human supervisor does not want to know merely that the agent failed. They want to know where it failed, whether earlier steps were valid, and which action should be suppressed. The World State Model’s first-error-step framing points toward better audit trails. That is essential if GUI agents are ever expected to touch real workflows rather than toy demos and carefully watered benchmarks.

The likely early use cases are not financial approvals, legal filings, or healthcare record changes. Those domains demand far stronger guarantees. More plausible early targets include internal productivity tools, document formatting, report assembly, multimedia editing, low-risk data entry, software QA workflows, training sandboxes, and back-office tasks where mistakes are recoverable and monitoring is feasible.

Boundaries: this is autonomous practice, not autonomous trust

The paper’s limitations are not cosmetic. They cut straight into deployment.

The first boundary is judge dependence. SEAgent’s learning loop is bounded by the World State Model. If the judge misidentifies success, misses a critical error, or rewards a shortcut that looks right visually but violates business logic, the actor can learn the wrong lesson. In benchmark settings, rule-based evaluation and curated tasks help. In enterprise software, “correct” may depend on hidden state, permissions, external records, policy constraints, or downstream reconciliation. Screenshots are not the whole world.

The second boundary is workflow length. The authors note that although the tested software includes tools such as LibreOffice and GIMP, tasks are still relatively simple, typically taking a human expert fewer than 20 steps. Multi-hour workflows are a different species. Errors compound. Context decays. Intermediate states multiply. A judge must reason across far more actions, more hidden assumptions, and more opportunities for silent failure.

The third boundary is environment cleanliness. OSWorld is designed for benchmarking. Real enterprise desktops are not. They contain pop-ups, lag, custom themes, password prompts, half-installed plugins, spreadsheet files named “final_final_v7_REAL.xlsx,” and occasionally a browser tab that should absolutely not be there. A method that works in controlled GUI environments needs hardening before it becomes operational infrastructure.

The fourth boundary is governance. Autonomous exploration is powerful precisely because it tries things. In production systems, “trying things” needs sandboxing, permission limits, logging, rollback, and human review. Otherwise the agent is not learning; it is vandalising with a research grant.

What to watch next

The next useful SEAgent-style research will not merely report a higher OSWorld score. It will answer harder operational questions.

Can the judge incorporate non-visual state, such as file contents, database records, API responses, or audit logs? Can the curriculum respect permission constraints and avoid unsafe actions during exploration? Can the system learn from partial human review rather than fully labelled trajectories? Can specialist-to-generalist transfer work across proprietary enterprise applications, not just open or benchmarked desktop software? Can the generated guidebook be inspected, corrected, and versioned like a real operational artifact?

The appendix sensitivity analysis hints at another practical issue: context management. Performance improves as more generated tasks are added, plateauing around 100 tasks in the VSCode sensitivity test. For change descriptions, 50 to 100 descriptions work well, while 200 degrades performance, likely because the generator receives too much context. More memory is not automatically better. Sometimes it is just a longer meeting.

That point is worth taking seriously. Enterprise automation systems will not be short of logs, screenshots, documents, and traces. They will be short of disciplined summarisation and feedback selection. SEAgent’s guidebook idea is promising because it compresses exploration into a task-generation memory. Its future usefulness will depend on whether that memory remains clean as the environment grows messier.

The useful lesson is the loop, not the label

“Self-evolving” is an attractive phrase. It sounds like a model quietly improving itself overnight while humans sleep, which is exactly the kind of sentence that causes procurement departments to lose their natural scepticism. SEAgent is more grounded than that.

The paper shows a computer-use agent learning unfamiliar software through a loop: observe the interface, generate tasks, execute them, judge trajectories, label good and bad actions, update the policy, expand the curriculum, then consolidate specialist experience into a stronger generalist. The reported improvement is substantial within the benchmark. The remaining gap to human performance is also substantial. Both facts can be true. Annoying, but useful.

For operators, SEAgent should be read as an architecture for adaptation. It does not prove that GUI agents are ready to run enterprise workflows unsupervised. It does suggest that the path to useful software agents may run through autonomous practice environments, trajectory-level judges, and specialist-first learning.

The old automation model asked humans to describe the workflow in advance. SEAgent’s model asks the agent to discover enough of the software to practise, fail, and improve. That is a meaningful shift. Not a revolution. More like replacing hand-labelled babysitting with structured apprenticeship. Less glamorous, more deployable. A rare combination.

Cognaptus: Automate the Present, Incubate the Future.


  1. Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang, “SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience,” arXiv:2508.04700, 2025. ↩︎