Agents That Build Agents: The ALITA-G Revolution

A good employee does not only finish the task. A good employee leaves behind a better way to do it next time.

Most enterprise AI agents do not. They solve a ticket, answer a question, call a tool, browse a page, generate a report, and then politely forget the operational trick that made the task work. The transcript may be logged. The result may be saved. But the capability itself usually evaporates into the great corporate compost heap of “learnings”. Very nourishing. Not especially executable.

Alita-G, introduced in Alita-G: Self-Evolving Generative Agent for Agent Generation, attacks that waste directly.¹ The paper’s central idea is not that agents should think harder. It is that agents should convert successful work into reusable machinery. When a generalist agent solves a task, Alita-G encourages it to externalise useful sub-solutions as Model Context Protocol, or MCP, tools. Those raw tools are then abstracted, cleaned, documented, stored in an MCP Box, and retrieved later when similar tasks appear.

That is the mechanism. And the mechanism matters more than the slogan.

The phrase “self-evolving agent” is easy to overread. It invites visions of recursive model improvement, autonomous retraining, or a small digital creature quietly upgrading itself in the server room. Delightful for pitch decks; less delightful for governance. Alita-G is more restrained and therefore more interesting. It does not retrain the base model. It does not claim open-ended general intelligence. It performs non-parametric capability accumulation: successful task behaviour is distilled into callable tools, and future agents retrieve the most relevant ones.

In business terms, this is not “AI becomes alive”. It is closer to “automation finally stops throwing away its own reusable work”.

The agent’s real memory is not text; it is reusable procedure

The paper begins from a familiar weakness in LLM agent systems. A generalist agent can reason, call tools, browse, decompose a problem, and recover from mistakes. But its improvement is often shallow. Some systems revise prompts. Others retry failed actions. Others store reflections in memory. These help, but they rarely transform a one-time execution into a durable operational capability.

Alita-G changes the unit of learning. The object being accumulated is not merely a note, an embedding, or a summary of a past experience. It is an executable MCP component.

The framework has three main stages:

Stage	What happens	Why it matters
Task-driven MCP generation	A master agent repeatedly executes tasks and generates candidate MCPs from successful trajectories	Solving a task becomes an opportunity to create reusable capability
MCP abstraction and box construction	Raw MCPs are generalised, stripped of task-specific details, standardised, and documented	One-off scripts become reusable primitives rather than brittle souvenirs
RAG-enhanced MCP selection	At inference time, a new task is matched against MCP descriptions and use cases to retrieve relevant tools	The specialist agent receives a focused toolkit instead of drowning in tool clutter

The shift is subtle but important. A normal agent pipeline treats tools as fixed infrastructure. Alita-G treats tools as accumulated by-products of successful work. The agent does not merely consume its environment; it leaves the environment more capable than before.

This is why the paper’s mechanism-first reading is essential. If we start with the benchmark scores, Alita-G looks like another “our agent beats your agent” paper, a genre now so crowded it should probably have zoning regulations. The more useful interpretation is that Alita-G proposes a repeatable loop for turning task execution into agent infrastructure.

The abstraction step is where the trick becomes reusable

Raw tool generation is not enough. A tool created during one task may contain hard-coded values, assumptions about the original question, brittle formatting, or undocumented interfaces. That is not capability. That is a script with abandonment issues.

Alita-G therefore adds an abstraction step. The paper describes four transformations:

hard-coded values are replaced with configurable parameters;
task-specific references are removed;
interfaces are standardised for FastMCP compatibility;
docstrings and type annotations are improved.

This is the part enterprise readers should not skim. It is also the part that makes the “self-evolution” claim less mystical and more operational.

In ordinary software terms, Alita-G is doing automated refactoring after successful execution. The agent produces an ad hoc solution, then another model converts it into a generalised callable primitive. The resulting MCP is stored with metadata: a functional description and the use case that triggered its creation. Later, retrieval uses both to decide whether the tool belongs in the current task context.

That dual metadata design turns out to matter. In the paper’s GAIA ablation, using both description and use case for retrieval performs best, reaching 83.03% average accuracy. Description alone is close at 81.82%, while use case alone falls to 77.57%. The interpretation is straightforward: descriptions provide the general semantic signal; use cases add situational grounding. Use cases by themselves are too tied to the original task.

For enterprise use, this is a neat warning. A reusable agent capability should not be indexed only by the incident that created it. “Used during customer refund dispute 48291” is not the same as “extract invoice mismatch fields and reconcile against payment status”. One is a memory. The other is a capability.

Retrieval prevents the toolbox from becoming a junk drawer

Generated tools create a new problem: tool sprawl.

Anyone who has watched enterprise automation mature knows the pattern. First there are no tools. Then there are a few useful tools. Then there are seventeen versions of “final_invoice_parser_v3_really_final.py”, and everyone pretends this is a platform.

Alita-G addresses this with RAG-enhanced MCP selection. For each new query, the system embeds the task and compares it with MCP representations built from description plus use case. It can then use either threshold-based selection or top-k selection.

The distinction is operationally important:

Selection method	Behaviour	Practical trade-off
Threshold-based	Selects all MCPs above a similarity threshold	Adapts the number of tools to task complexity, but requires threshold tuning
Top-k	Always selects a fixed number of the most similar MCPs	Predictable cost, but can include irrelevant tools or exclude useful extras

The paper’s sensitivity test favours threshold-based selection. On a 25-question GAIA validation subset, a threshold of 0.70 gives the best reported accuracy at 84.0%. Lower thresholds include irrelevant MCPs; higher thresholds exclude useful ones. Top-k selection is generally weaker in that test, with top-2 and top-3 reaching 80.0%, and larger values declining.

This is not a universal law of retrieval. It is a useful implementation signal. If tasks vary in complexity, fixed tool budgets can be too rigid. Some questions need one specialised primitive. Others need several. A threshold gives the agent a chance to scale its working toolkit with the task, provided the embedding signal is good enough.

That last clause is doing work. The embedding encoder analysis shows exactly that. On the same 25-question subset, OpenAI’s text-embedding-3-large reaches 84.0%, text-embedding-3-small reaches 80.0%, Qwen3-Embedding-8B reaches 76.0%, and NV-Embed-v2 and BGE-M3 reach 72.0%. The paper uses this as evidence that retrieval quality materially affects downstream agent performance.

So the business lesson is not “just add a vector database”. That sentence has caused enough architectural vandalism already. The lesson is that reusable capability only compounds if retrieval can select the right capability at the right moment.

The main results show performance and cost moving in the right direction

The core empirical evidence comes from three benchmarks: GAIA, PathVQA, and Humanity’s Last Exam. GAIA is evaluated on the full validation set of 466 questions. PathVQA and HLE are evaluated on randomly sampled 100-example subsets due to resource constraints.

The paper compares Alita-G against several baselines, including Octotools, ODR-smolagents, and the original master agent system without the specialised MCP Box. The strongest comparison is with the original agent system because that isolates the value of the MCP Box and retrieval mechanism around the same underlying agent architecture.

On GAIA, the original agent system achieves 75.15% pass@1 accuracy with an average of 12,305 tokens. Alita-G with three MCP-generation passes reaches 83.03% pass@1 with 10,394 average tokens. That is a gain of 7.88 percentage points while reducing average token consumption by about 15.5%.

The pass@3 result also improves: the original system reaches 87.27%, while Alita-G 3× reaches 89.09%. The gain is smaller because pass@3 already gives the baseline three chances. This is exactly what one would expect: reusable tools help most when the system has fewer attempts and less room to recover through repetition.

The same pattern appears in the sampled PathVQA and HLE evaluations. On PathVQA, the original pass@1 result is 52%, while Alita-G 3× reaches 60%. On HLE, the original pass@1 result is 24%, while Alita-G 3× reaches 33%. The paper also reports token reductions in these settings: PathVQA drops from 12,542 average tokens to 10,574, and HLE drops from 14,730 to 11,956 under Alita-G 3× pass@1.

The result is not just “better accuracy”. It is better accuracy with less generated text. That combination matters commercially because agent systems are not priced only by cleverness. They are priced by latency, token usage, orchestration overhead, debugging cost, and how many times the agent wanders around the tool shed before finding the wrench.

The ablations explain why the result is not accidental

The paper’s analysis section is especially useful because it separates the main evidence from the machinery that supports it. These tests should not be treated as independent proof of a second thesis. They are mostly ablations, sensitivity checks, and mechanism probes.

Test	Likely purpose	What it supports	What it does not prove
Main benchmark comparison across GAIA, PathVQA, and HLE	Main evidence	Alita-G specialists outperform the original generalist while using fewer tokens	Broad production readiness across real enterprise domains
Description vs use case vs combined retrieval	Ablation	Tool metadata design affects retrieval quality; description plus use case works best	That this metadata scheme is optimal in all domains
MCP Box generation iterations	Scalability and sensitivity test	More MCP-generation passes improve coverage until redundancy accumulates	That three passes is universally optimal
Threshold vs top-k retrieval	Retrieval sensitivity test	Adaptive thresholding can outperform fixed top-k selection	That thresholding always beats top-k in other retrieval systems
Embedding encoder comparison	Implementation sensitivity test	Better semantic retrieval improves agent performance	That one encoder will remain best as models change
MCP behaviour analysis	Mechanism validation	Improved answers use MCPs more heavily and regressions are rare	That MCP usage is always causally sufficient for correctness
Case study figures	Illustrative extension	Shows how an abstracted MCP can flip a failed answer into a correct one	Statistical generalisation by itself

The scalability result deserves particular attention. With one MCP-generation pass, Alita-G reaches 80.00% average GAIA accuracy and curates 26 MCPs. With two passes, it reaches 81.82% and 46 MCPs. With three passes, it reaches 83.03% and 74 MCPs. Four passes stay flat at 83.03% with 102 MCPs. Five passes nudge to 83.63% with 128 MCPs.

The paper interprets this as diminishing returns. The similarity and clustering statistics support that reading: as the MCP Box grows, the number of independent clusters rises more slowly than the number of total MCPs. In plain English, later passes increasingly generate near-duplicates or narrow variants. The system keeps learning, but less efficiently.

For enterprises, this is the beginning of an MCP governance problem. Capability accumulation sounds excellent until the repository becomes redundant, inconsistent, or full of overlapping tools. The paper’s answer is partial: preserve diversity, use abstraction, measure similarity, and rely on retrieval. That is promising. It is not yet a complete lifecycle policy for enterprise tool libraries.

The behaviour analysis shows the MCP Box is actually being used

One common weakness in agent papers is that the proposed mechanism is plausible but not directly connected to observed behaviour. Alita-G includes a useful behavioural analysis on GAIA to address this.

As MCP Boxes mature from one to three generation rounds, average MCP calls per question rise from 1.9 to 2.4. For improved questions—cases where the baseline was wrong but the MCP-equipped agent becomes correct—the average MCP calls are higher, rising from 2.7 to 3.4. This suggests that the MCP Box is not merely decorative. The improved cases are precisely the cases where tool usage becomes more active.

The correctness flips are also informative. With one generation, 9 baseline-wrong questions become correct, while 1 baseline-correct question becomes wrong. With two generations, the pattern is 12 wrong-to-right and 1 right-to-wrong. With three generations, it is 13 wrong-to-right and 0 right-to-wrong.

That does not prove MCP usage is the sole cause of improvement in every case. Agent execution is stochastic, and the paper itself attributes the rare regressions to reasoning errors rather than faulty MCP integration. But the pattern is consistent with the mechanism: a richer MCP Box creates more useful intervention opportunities, and those interventions are concentrated on harder questions.

The case study makes the mechanism concrete. A raw MCP created during a task involving measurement extraction from scientific PDFs is abstracted into a reusable tool. Later, in a thermodynamics question, the baseline agent fails, while the MCP-equipped agent retrieves the abstracted measurement-extraction tool and answers correctly. The example is not the proof; it is the microscope slide. It shows what the aggregate numbers are probably made of.

The business value is capability reuse, not agent mythology

The practical interpretation of Alita-G is strongest when stripped of mythology. This is not a production recipe for autonomous enterprise self-improvement. It is a research demonstration of a useful pattern:

Convert repeated successful workflows into reusable tool primitives, index them with operational context, retrieve them selectively, and let specialised agents inherit the resulting capability library.

That pattern maps well onto enterprise environments because much business work is repetitive without being identical. Claims handling, compliance review, procurement analysis, contract redlining, finance reconciliation, sales operations, and technical support all contain subroutines that recur across cases. Human teams already recognise these subroutines and gradually turn them into templates, macros, checklists, scripts, and standard operating procedures.

Alita-G suggests how agent systems might automate part of that conversion.

Paper mechanism	Enterprise analogue	Business interpretation
Successful trajectories generate raw MCPs	Analysts discover repeatable workarounds during real tasks	Execution becomes a source of process discovery
Abstraction removes task-specific details	Engineers refactor scripts into reusable services	One-off automation becomes maintainable capability
MCP Box stores curated tools	Internal library of approved operational components	Agent competence becomes inspectable infrastructure
RAG selects relevant MCPs	Staff choose the right SOP, template, or system tool	Context-aware reuse reduces wasted exploration
Token reduction accompanies accuracy gains	Less wandering, fewer retries, faster completion	Lower inference cost and potentially better latency

The inferred business value is therefore not just cheaper inference. It is organisational memory in executable form.

That phrase needs discipline. A normal knowledge base stores facts and documents. An MCP Box stores procedures. A policy document can tell an agent how invoice reconciliation should work. An MCP can actually perform the reconciliation step, provided the interface, permissions, and data access are correctly governed. This is where the paper’s idea becomes commercially interesting: the reusable artefact is closer to software than to memory.

Naturally, this is also where risk enters. Executable organisational memory is more powerful than textual memory. It can also be wrong faster.

The uncertainty boundary is narrow but useful

The paper’s results are strong, but their interpretation has boundaries.

First, the evaluations are benchmark-based. GAIA is a full validation evaluation, but PathVQA and HLE use 100-example samples. That does not invalidate the results. It does mean the cross-domain generalisation claim should be read as suggestive, not exhaustive.

Second, Alita-G relies on strong underlying agents. The Manager Agent uses Claude-Sonnet-4, and the Web Agent uses GPT-4.1. The framework is not showing that weak models become brilliant because they collect tools. It is showing that a strong generalist can become a better domain specialist when it accumulates and retrieves reusable MCPs.

Third, retrieval quality matters. The embedding encoder ablation makes this explicit. If the retrieval layer selects the wrong tools, the specialised agent may receive irrelevant context or miss the needed primitive. In production, this becomes a monitoring problem: retrieval precision, tool-call success, regression rates, and repository redundancy all need measurement.

Fourth, MCP governance is underdeveloped relative to enterprise needs. The paper abstracts, standardises, and documents MCPs, but production systems would also need permission controls, security review, versioning, deprecation, provenance tracking, audit logs, testing suites, and rollback mechanisms. “The agent generated a useful tool” is not the same as “the organisation should let it run against customer data”. One hopes this distinction is obvious. Experience suggests otherwise.

Finally, the method is domain-specialisation, not broad general automation. Alita-G works by harvesting tools from a target task distribution. It is strongest where future tasks resemble past successful trajectories enough for retrieval and reuse to help. If the domain shifts sharply, the MCP Box may become less useful or actively distracting.

The real revolution is boring, which is why it may matter

The old version of the agent story was: give the model tools, memory, and a goal, then let it act. Alita-G adds a more mature question: after the agent acts successfully, what infrastructure remains?

That question is more important than it sounds. Enterprises do not only need agents that complete tasks. They need systems that accumulate operational competence without turning every improvement into a bespoke engineering project. Alita-G is one research answer: let the agent generate tools from successful executions, abstract them into reusable MCPs, store them in a curated box, and retrieve them when context demands.

The paper’s results support that mechanism. On GAIA, Alita-G improves pass@1 accuracy from 75.15% to 83.03% while reducing average token use by about 15.5%. On sampled PathVQA and HLE tasks, it also improves accuracy and lowers token consumption. The ablations show that metadata design, retrieval strategy, embedding quality, and MCP Box size all matter. The behaviour analysis suggests the tools are being used where improvement actually occurs.

The broader lesson is not that agents have become self-improving in the grand philosophical sense. They have not. The useful lesson is narrower and more practical: agent systems can turn solved work into reusable capability, and that capability can be retrieved rather than rediscovered.

That is less cinematic than recursive intelligence. It is also closer to how real organisations improve.

Every business already knows the value of a good employee who finishes the job and leaves behind a better process. Alita-G asks what happens when agents are designed to do the same. Not merely to answer. Not merely to act. To leave tools behind.

The future of enterprise agents may not be a single omniscient model. It may be a growing library of well-indexed, well-governed, machine-generated procedures.

Less oracle. More workshop.

Cognaptus: Automate the Present, Incubate the Future.

Jiahao Qiu et al., “Alita-G: Self-Evolving Generative Agent for Agent Generation,” arXiv:2510.23601, 2025. https://arxiv.org/abs/2510.23601 ↩︎

The agent’s real memory is not text; it is reusable procedure#

The abstraction step is where the trick becomes reusable#

Retrieval prevents the toolbox from becoming a junk drawer#

The main results show performance and cost moving in the right direction#

The ablations explain why the result is not accidental#

The behaviour analysis shows the MCP Box is actually being used#

The business value is capability reuse, not agent mythology#

The uncertainty boundary is narrow but useful#

The real revolution is boring, which is why it may matter#