A workflow breaks in a boring way.
The agent found the website yesterday. Today the button moved. Yesterday it parsed the file path correctly. Today the file name has a space, a date, and some human creativity sprinkled in for punishment. Yesterday the chart script worked. Today the data source changed its column names because apparently stability was not on the roadmap.
The usual story says an AI agent should “learn from experience.” In many systems, that means saving a reflection: what worked, what failed, what to try next time. Useful, yes. But also a little like telling a junior analyst, “Remember to be careful with Excel,” and calling that process automation.
The arXiv paper AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse makes a sharper move: it argues that agents should not merely store memories of successful work; they should store executable subagents that can run again, be modified, and eventually be exported into other agent systems.1
That is the real shift: from memory to machinery. Not a better diary. A growing toolbox.
AgentFactory saves the thing that runs, not the story about it
Most self-improving agent designs still treat experience as text. A system solves a task, records a lesson, and later retrieves that lesson when a similar task appears. This is sensible if the task is mostly reasoning. It is less convincing when the task is procedural: open a web page, extract structured data, generate a chart, book something through a web interface, transcribe audio, or produce a file.
For procedural work, the hard part is not always knowing what should happen. The hard part is making the same thing happen again without forcing the main model to reconstruct every step from scratch.
AgentFactory’s answer is simple enough to be dangerous: preserve successful task solutions as Python subagents. A subagent is not just a note saying “use regex if path parsing fails.” It is executable code that performs a specialized function, accompanied by documentation describing what it does, what parameters it expects, and how it should be invoked.
This gives the paper its core mechanism:
| Phase | What AgentFactory does | What gets accumulated | Business interpretation |
|---|---|---|---|
| Install | Builds subagents from scratch when no suitable capability exists | Initial executable skills | Early client work becomes reusable operating machinery |
| Self-Evolve | Reuses existing subagents, diagnoses failures, modifies code, and validates the new version | More robust versions of existing skills | Failure becomes maintenance input, not just another billable rerun |
| Deploy | Exports mature subagents as standalone Python modules with documentation | Portable capabilities | Useful automation can move across agent stacks instead of being trapped in one framework |
The important word is not “agent.” The important word is “executable.” AgentFactory treats solved work as a reusable operational asset. The system still relies on a language model, but the model’s job changes. It becomes less of a heroic improviser and more of an orchestrator that builds, selects, runs, and repairs smaller workers.
A less glamorous description would be: this is code maintenance wrapped inside an agent loop. That description is not an insult. For business automation, it may be exactly the point.
The architecture turns orchestration into skill production
AgentFactory has three main architectural pieces: a Meta-Agent, a Skill System, and a Workspace Manager.
The Meta-Agent is the coordinator. It decomposes a user task into sub-problems, decides which tools are needed, creates subagents when necessary, runs them, and later modifies them if feedback suggests they are too narrow or brittle. This is not unusual by itself; many agent systems already have some version of a planner or orchestrator.
The more interesting part is the Skill System. AgentFactory separates skills into three levels.
First, there are meta skills: operations such as creating a subagent, running a subagent, modifying a subagent, listing saved subagents, viewing subagent code, and finishing a task. These are the lifecycle controls.
Second, there are tool skills: web search, web reading, browser automation, and shell command execution. These are the primitive capabilities that subagents can use.
Third, there are subagent skills: generated Python scripts created during task execution. These are the accumulated capabilities. Unlike a fixed tool library, this layer grows over time.
The Workspace Manager gives each task an isolated execution directory. That detail matters. Self-modifying code is one of those phrases that can sound futuristic until it deletes the wrong folder. Isolation allows the system to test and modify subagents without immediately corrupting the persistent library. Successful changes can then be promoted from the workspace into the saved skill pool.
So the framework is not merely “an agent that writes code.” It is a pipeline for deciding when code should become a reusable skill, when that skill should be revised, and when it should be exported.
Install is where work becomes inventory
In the Install phase, AgentFactory encounters a task for which no relevant subagent exists. The Meta-Agent analyzes the request, decomposes it into smaller parts, creates specialized Python subagents, runs them, and decides which ones deserve to be saved.
This is where the economics begin.
A one-off automation script is useful once. A reusable subagent is useful whenever a sufficiently similar task appears. The difference is not philosophical. It is inventory accounting. The first time the system solves a task, it pays the setup cost. Later, if the created subagent is general enough, the system can reuse it instead of asking the orchestrating model to re-plan the entire workflow.
The paper’s examples make this concrete. A task may require searching the web, extracting information, creating a chart, manipulating files, or controlling a browser. Each solved component can become a saved capability. Over time, the system is no longer starting from an empty workbench.
This is also where the paper differs from reflection-based self-evolution. A textual reflection may help the next run, but it still requires the model to reinterpret the lesson, write or adapt code, and decide whether the prior lesson applies. Executable subagents skip part of that reconstruction. They do not merely remind the model what worked. They carry forward the mechanism that worked.
Self-Evolve is code repair, not motivational journaling
The Self-Evolve phase begins when a saved subagent is relevant but imperfect.
That distinction is important. AgentFactory is not claiming that every generated subagent is magically robust. It expects brittleness. It expects edge cases. It expects that yesterday’s solution may fail on today’s variant. The system’s contribution is to make that failure actionable: retrieve the relevant subagent, run it, inspect the execution feedback, identify the limitation, modify the code, and validate the new version.
The paper’s Figure 2 is best read as an implementation demonstration, not as the main quantitative evidence. It shows a README generation subagent evolving across three runs. In the first run, the path resolution mechanism is hardcoded for a specific project. In the second, the subagent tries LLM-based path parsing but keeps a fragile hardcoded fallback. In the third, the Meta-Agent replaces that with regex-based extraction, making the behavior more robust.
The example is small. Good. Small failures are where automation systems spend most of their lives.
This demonstration supports a narrow but useful claim: AgentFactory can modify generated subagent code in response to observed failure modes. It does not prove universal self-improvement. It does not prove that every modification will be correct. It does show what “learning” looks like when the learned object is executable code rather than a paragraph of advice.
A text-memory system might write, “The previous approach used a fragile fallback; consider regex.” AgentFactory attempts to turn that observation into a revised worker.
That is a practical difference.
Deploy turns local tricks into portable capabilities
The Deploy phase is where the paper becomes more than another agent orchestration design.
Because saved subagents are pure Python code with accompanying SKILL.md documentation, the authors argue that they can be used outside the AgentFactory runtime. A subagent can run as a standalone script, or an external agent framework can be prompted to inspect the documentation and invoke the script when relevant.
The paper’s Figure 3 is a cross-system reuse demonstration. It describes trajectories in which AgentFactory creates and saves subagents such as an Audio Transcriber, a QQ Music Player, and a Document Creator. Later, a different agent system, Claude Code, reads the saved skill documentation and uses the Audio Transcriber and Document Creator to complete a new task without recreating those subagents from scratch.
This demonstration is not a large benchmark. It is better understood as a portability proof-of-concept. It supports the claim that executable subagents can be treated as transferable capabilities, at least when the host environment can run Python and understand the documentation.
For business readers, this matters because many AI deployments die in the gap between “the demo worked once” and “the capability is now part of the operating system.” If a useful agent workflow can be packaged as inspectable Python, documented, versioned, and moved across systems, it starts to look less like ephemeral chat output and more like software infrastructure.
Yes, that is less magical. That is also why it might work.
The experiment measures orchestration effort, not total automation cost
The paper’s quantitative evaluation is focused and should be read carefully.
The authors evaluate three methods: ReAct, a text-based self-evolving agent that saves experience summaries, and AgentFactory. They use two batches of 15 tasks each. Batch 1 contains diverse Python and web-oriented tasks: web information retrieval, browser automation, data visualization, audio processing, mini-game generation, travel planning, and similar work. Batch 2 mirrors the structure of Batch 1 but changes the specific requirements. This mirrored design is meant to test transfer: can skills created in the first batch reduce effort in the second?
The reported metric is average output tokens per task from the orchestrating model. Lower token count means the main coordinator had to do less visible work. The paper explicitly excludes subagent-internal LLM consumption, which is a major interpretation boundary. This is not a full cost accounting exercise. It is a measurement of orchestration burden.
Here are the reported average output tokens per task:
| Method | Task setting | Opus 4.6 | Sonnet 4.6 |
|---|---|---|---|
| ReAct | Batch 1 | 8,298 | 6,893 |
| ReAct | Batch 2 | 7,022 | 7,029 |
| Text-based self-evolving agent | Batch 1, from scratch | 8,608 | 8,163 |
| Text-based self-evolving agent | Batch 2, with saved experience | 6,210 | 8,223 |
| AgentFactory | Batch 1, from scratch | 4,324 | 9,199 |
| AgentFactory | Batch 2, with saved subagents | 2,971 | 3,862 |
The Batch 2 result is the cleanest evidence for the paper’s main claim. With saved subagents, AgentFactory uses 2,971 orchestration tokens per task with Opus 4.6, compared with 7,022 for ReAct and 6,210 for the text-experience baseline. That is about 58% lower than ReAct and about 52% lower than the text-based self-evolving agent.
With Sonnet 4.6, AgentFactory uses 3,862 tokens in Batch 2, compared with 7,029 for ReAct and 8,223 for the text-experience baseline. That is about 45% lower than ReAct and about 53% lower than the text-based self-evolving agent.
So the evidence supports the claim that saved executable subagents can reduce orchestration effort on structurally similar tasks.
But the Batch 1 results are asymmetric. With Opus 4.6, AgentFactory already uses far fewer orchestration tokens in Batch 1: 4,324 versus 8,298 for ReAct. The authors interpret this as evidence that a stronger model can begin reusing subagents even within the initial batch, despite limited overlap.
With Sonnet 4.6, however, AgentFactory is worse in Batch 1: 9,199 tokens versus 6,893 for ReAct and 8,163 for the text-experience baseline. That suggests the machinery has overhead. When the model is less effective at opportunistic reuse or the task sequence has not yet produced mature subagents, creating and managing subagents can cost more than solving directly.
This is not fatal to the paper. It is useful. It tells us the framework is not free magic. It has a setup cost, and the payoff appears strongest when reuse becomes real.
The evidence table is strongest when it is not overread
The paper includes method design, demonstrations, quantitative evaluation, and appendix task lists. These pieces do different jobs.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Three-phase Install–Self-Evolve–Deploy method | Main mechanism | AgentFactory’s lifecycle for creating, refining, and exporting subagents | That the lifecycle is optimal or safe in all environments |
| Figure 2 README/path refinement example | Implementation demonstration | Subagent code can be modified through execution feedback | General reliability across arbitrary workflows |
| Figure 3 cross-system reuse example | Portability demonstration | Saved Python subagents can be reused by another agent system under the right conditions | Broad interoperability across all enterprise platforms |
| Table 1 token results | Main quantitative evidence | Saved executable subagents reduce orchestration-token burden on mirrored Batch 2 tasks | Total cost reduction, ROI, accuracy superiority, or production readiness |
| Appendix task lists | Evaluation design transparency | Batch 2 mirrors Batch 1 structurally, making transfer plausible to test | Generalization to non-web, non-Python, high-risk operational domains |
This separation matters because the tempting headline is: “Agents can now improve themselves.” That is a bit too convenient. The more accurate headline is: “Under Python/web-oriented tasks with reusable structure, an agent framework can preserve successful procedural work as executable subagents and reduce future orchestration effort.”
Less sexy. More useful.
The business value is a capability library, not a cheaper prompt
The most obvious business reading is cost reduction. That is fair, but incomplete.
The deeper business implication is that repeated AI work can become an accumulating capability base. Every recurring workflow becomes a candidate for a reusable subagent. Every failure becomes a candidate for code revision. Every mature subagent becomes something that can be inspected, versioned, tested, and potentially reused across clients or internal teams.
This is especially relevant for business process automation, where the valuable work is often repetitive but not perfectly identical. Reports change slightly. Websites shift layouts. Input files arrive with new column names. Managers ask for the same analysis, except this time grouped by region, or filtered by client type, or exported to a different format because chaos has a calendar invite.
AgentFactory suggests a practical pathway:
| Business layer | What the paper directly shows | Cognaptus inference | Boundary |
|---|---|---|---|
| Workflow automation | Subagents can be generated for procedural tasks and reused later | Repeated client workflows can become reusable automation modules | The paper does not test long-term client deployments |
| Cost control | Orchestration output tokens fall sharply in Batch 2 | Mature skill libraries may reduce repeated planning cost | Subagent-internal LLM cost is excluded |
| Reliability improvement | Subagents can be modified based on execution feedback | Failure logs can drive concrete code hardening | No guarantee that autonomous edits are always correct |
| Governance | Saved subagents are readable Python plus documentation | Review, version control, and audit become more concrete than text memory review | Generated code still needs security controls |
| Platform strategy | Subagents can be exported and reused across another agent system in demonstration | Firms may build internal skill libraries independent of one front-end agent framework | Portability depends on runtime, permissions, dependencies, and host-agent competence |
This is where the article’s title earns its keep. “Learning to write themselves” does not mean agents are becoming autonomous software companies. Please, we have enough LinkedIn poetry already. It means that useful procedural knowledge can be turned into executable modules, then repaired and reused.
The business asset is not the conversation. The asset is the working capability left behind after the conversation.
Executable memory makes governance more concrete, not automatically easy
There is a governance angle here that should not be missed.
Textual memory is difficult to audit. A reflection may be ambiguous, incomplete, or context-dependent. Executable code is also risky, but it is at least inspectable in a familiar way. Teams can read it, test it, diff it, version it, restrict it, and decide whether it should be promoted into production.
AgentFactory’s design leans into that advantage. The subagents are saved as Python code with SKILL.md documentation. The paper also notes safety concerns around autonomous code generation and execution, especially shell commands and browser automation. Its proposed mitigations include checks for destructive shell operations, human-readable saved code, documentation for inspection, and proper authorization for web interactions.
This does not solve enterprise AI governance. It gives governance something more concrete to hold.
For a company deploying agentic automation, that distinction matters. Reviewing an opaque model “memory” is hard. Reviewing a generated script that books meetings, downloads data, or edits spreadsheets is still hard, but at least it resembles ordinary software review. The future of agent governance may look suspiciously like pull requests, access controls, test cases, and boring logs.
Boring logs are underrated. They are where accountability goes to become useful.
The boundary: efficient reuse under mirrored tasks is not production proof
The paper is strongest when interpreted as a mechanism and early evidence for executable skill accumulation. It is weaker if treated as a finished enterprise automation blueprint.
Several boundaries matter.
First, the evaluation tasks are Python and web-oriented. That is a reasonable testbed for agent automation, but not the same as regulated banking workflows, hospital systems, ERP migrations, or cross-department business processes with messy permissions and legal exposure.
Second, Batch 2 mirrors Batch 1 structurally. This is exactly what a transfer evaluation should do, but it also means the results should not be read as proof of broad generalization. AgentFactory performs well when future tasks resemble earlier tasks enough for saved subagents to matter.
Third, the metric excludes subagent-internal LLM consumption. The paper measures orchestration-token efficiency, not full end-to-end compute cost. If subagents themselves call LLMs heavily, total cost may be less dramatic than the coordinator-token table suggests.
Fourth, the paper reports that all tasks completed without runtime errors, but the main quantitative metric is not output quality, business accuracy, security exposure, or maintenance burden over months. Those are the questions a production buyer would ask after the demo, preferably before signing anything expensive.
Fifth, autonomous code modification introduces its own risks. A subagent can become more general, or it can become more creatively wrong. The framework’s workspace isolation and inspectable code help, but production systems would still need approval gates, tests, rollback mechanisms, dependency management, permission boundaries, and monitoring.
None of these limitations invalidate the paper. They locate it. AgentFactory is not a proof that agents can safely run a company. It is evidence that agent systems may become more economically interesting when they preserve working procedures as reusable, modifiable machinery.
That is enough to pay attention.
The next agent stack may remember less and install more
AgentFactory’s most useful idea is not that agents should become more “autonomous.” That word has been stretched so far it now needs ergonomic support.
The useful idea is that agent learning should leave behind artifacts that run.
A saved reflection can advise. A saved subagent can execute. A modified subagent can incorporate a fix. A documented Python module can be inspected by another system. When enough of these pieces accumulate, an agent platform starts to look less like a chat interface and more like a skill production line.
For Cognaptus, the business lesson is straightforward: the next serious phase of AI automation will not be won merely by asking larger models to think harder. It will be won by building systems that remember in executable form, harden what fails, and turn repeated work into reusable capability.
The agent does not need to become a genius every morning.
It needs to stop rebuilding the same ladder.
Cognaptus: Automate the Present, Incubate the Future.
-
Zhang Zhang, Shuqi Lu, Hongjin Qian, Di He, and Zheng Liu, “AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse,” arXiv:2603.18000, 2026, https://arxiv.org/html/2603.18000. ↩︎