Agent teams are easy to draw and hard to run.

On a slide, the architecture looks comforting: a planner, a researcher, a coder, a reviewer, perhaps a compliance agent standing in the corner with a clipboard. Everyone has a role. Everyone collaborates. The diagram is tidy, which is usually the first warning sign.

The problem is that real tasks do not politely arrive in the shape of an org chart. A customer-support escalation may require database inspection, contract interpretation, account-history summarization, and a carefully worded response. A software-maintenance task may require reading an issue, locating the relevant file, reproducing a failure, patching the code, and testing the fix. A research task may require search, document extraction, image inspection, code execution, and a final answer. The subtask changes every few minutes. The correct helper is not always “the research agent” or “the coding agent.” Sometimes it is “the temporary executor that should see only these three facts, use only these tools, and run on a model strong enough for this step but not absurdly expensive.”

That is the useful shift in AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration.1 The paper is not merely another entry in the “more agents, more intelligence” parade, a parade already crowded enough to need traffic control. Its more interesting claim is architectural: sub-agents should not be treated as fixed roles. They should be treated as recipes created at runtime.

The recipe has four ingredients:

Ingredient Operational meaning Why it matters
Instruction What this sub-agent must accomplish Prevents vague delegation
Context The task-relevant working memory it should see Avoids both amnesia and context pollution
Tools The actions it is allowed to take Reduces unnecessary capability exposure
Model The underlying LLM used for execution Enables cost-performance routing

This is the paper’s central mechanism. A sub-agent is not “the analyst.” It is an instantiation of (Instruction, Context, Tools, Model). The orchestrator’s job is to repeatedly create these instantiations, delegate execution, read the returned result, and decide whether to finish or create another executor.

That sounds almost too simple. It is not. The simplicity is the point.

The important design move is separating orchestration from execution

Most agent systems blur two jobs. One component plans the task and also performs environment actions. It searches, edits files, runs commands, reads results, reasons about next steps, and eventually submits an answer. This can work for short tasks. For long-horizon tasks, it often becomes a messy inner monologue with tool calls attached.

AOrchestra separates the two responsibilities. The orchestrator does not directly act in the environment. It has only two system-level actions: delegate a subtask, or finish. When it delegates, it constructs the four-tuple: instruction, context, tools, and model. The sub-agent then executes in the environment and returns a structured observation.

That separation is more than software hygiene. It changes what can be optimized.

If the orchestrator is also the executor, improvement means making one agent better at everything. If the orchestrator only delegates, improvement can focus on a narrower skill: deciding what work should be done next, what information should be passed forward, which tools should be exposed, and which model should be used. In business terms, this is the difference between hiring a mythical employee who can do everything and building a dispatcher that knows when to call accounting, legal, engineering, or a cheap script.

The paper’s formalism supports this view. It frames complex tasks as multi-step interactions with an environment and treats delegation as a structured action. The orchestrator’s policy chooses between delegating and finishing. The executor’s capabilities are determined by the delegated tuple. This makes orchestration itself a learnable policy, rather than an accidental side effect of prompt writing.

That is the part worth noticing. AOrchestra is not just a wrapper around tool use. It is an attempt to make delegation explicit enough that it can be evaluated, trained, and cost-optimized.

Fixed-role agents solve yesterday’s decomposition problem

The paper contrasts AOrchestra with two common sub-agent patterns.

The first pattern treats sub-agents as context-isolated threads. This helps avoid context rot: the long conversation does not drown the worker in irrelevant history. But isolation alone does not create specialization. A blank helper with a narrow context may still lack the right tools or task framing.

The second pattern uses static specialist roles. A system might have prebuilt agents for coding, research, testing, or documentation. This looks mature because it resembles a company org chart. It also inherits the usual org-chart problem: real work leaks across boxes. A fixed set of roles can be too rigid for open-ended environments, and designing those roles manually becomes another engineering burden.

AOrchestra’s alternative is runtime specialization. The orchestrator does not choose from a small list of permanent staff members. It composes a temporary worker for the current subtask.

Here is the core correction for business readers:

Reader belief Better interpretation Practical consequence
“Multi-agent systems mean more named agents.” The useful unit is a delegated execution recipe. Count fewer agents; define better delegation parameters.
“Context isolation is enough.” The worker needs the right context, not merely less context. Build context selection and compression into the workflow.
“A stronger model should run everything.” Different subtasks may justify different model costs. Route difficult steps to stronger models and routine steps to cheaper ones.
“Tool access should be broad so the agent has options.” Tool access should match the delegated subtask. Reduce tool clutter and operational risk.

This is why the “recipe, not role” framing matters. Roles are persistent identities. Recipes are composed for use.

The benchmark results support the mechanism, but they should not be read as a deployment certificate

The main experiments evaluate AOrchestra on three agentic benchmarks: GAIA, Terminal-Bench 2.0, and SWE-Bench-Verified. These are not toy multiple-choice tests. They involve multi-step tool use, terminal operations, coding environments, file handling, and real-world-style workflows.

In the training-free setting, Table 1 compares AOrchestra with ReAct, OpenHands, Mini-SWE-Agent, and Claude Code under different model backbones. With Gemini-3-Flash, AOrchestra reports:

Benchmark AOrchestra Pass@1 Strongest same-model baseline in table Interpretation
GAIA 80.00 66.06, OpenHands Strong gain on general tool-using tasks
Terminal-Bench 2.0 52.86 34.29, Mini-SWE-Agent Strong gain on terminal workflows
SWE-Bench-Verified 82.00 64.00, ReAct Strong result on software-engineering tasks
Average Pass@1 71.62 49.49, Mini-SWE-Agent average Broad improvement in the reported setting

The magnitude is meaningful, especially because the benchmarks cover different environments. GAIA stresses information seeking, file processing, multimodal operations, and multi-hop reasoning. Terminal-Bench stresses command-line workflow execution. SWE-Bench-Verified stresses real repository bug fixing and test-passing patches.

Still, this evidence should be read carefully. The paper’s benchmark results show that AOrchestra performs strongly under the authors’ experimental setup. They do not show that every enterprise workflow will benefit by the same amount. Enterprise tasks have messy permissions, unreliable internal documentation, latency constraints, compliance boundaries, and unhappy humans who do not grade with Pass@1. A minor inconvenience, apparently.

The more useful reading is not “AOrchestra wins benchmarks, therefore buy orchestration.” The better reading is: across diverse long-horizon tasks, dynamic delegation appears to outperform several fixed or less adaptive agent frameworks. That supports the mechanism-first argument.

The context ablation is small, but it explains the whole paper

The most revealing experiment is not the biggest benchmark table. It is the context-control ablation.

The authors isolate the effect of the Context field by comparing three ways of invoking sub-agents on a 50-sample subset of GAIA:

Setting Level 1 Level 2 Level 3 Average
No-Context 89.47 81.48 75.00 86.00
Full-Context 94.74 77.78 75.00 84.00
AOrchestra curated context 100.00 88.89 75.00 96.00

This is an ablation, not the main benchmark claim. Its purpose is to test whether curated context actually matters when instruction, tools, and model are otherwise held constant.

The result is exactly the kind of result enterprise teams should care about. No-context delegation loses critical traces from prior work. Full-context delegation gives the sub-agent everything, including irrelevant details. Curated context does better because the orchestrator passes enough history to continue intelligently without burying the executor under debris.

This is painfully familiar in business automation. Teams often treat retrieval or memory as a dump truck: if the system might need a document, include it; if it might need a prior chat, include that too; if it might need a policy PDF from 2021, why not toss it in? The model then becomes a very expensive intern sitting under a pile of folders.

AOrchestra’s result supports a more disciplined pattern: context should be selected, compressed, and scoped to the subtask. Not too little. Not everything. The right slice.

That principle may matter more than the particular framework name.

The learnable orchestrator turns delegation into a skill

A second contribution is that AOrchestra makes orchestration trainable. Because delegation is represented as a structured action over the four-tuple, the system can learn how to produce better delegations.

The paper tests this in two ways.

First, it fine-tunes Qwen3-8B as an orchestrator while keeping Gemini-3-Flash as the sub-agent executor. This is not the main performance result; it is a test of whether orchestration quality can be improved by supervised fine-tuning. The reported GAIA accuracy rises from 56.97 to 68.48. The cost also increases from $0.36 to $0.68 average cost per task, so the gain is not free. Reality insists on appearing in the invoice.

Second, the paper uses iterative in-context learning for cost-aware routing. Here the system optimizes the orchestrator instruction through repeated interaction, using both performance and monetary cost feedback. In the mixed-model setting, the result improves from 72.12 accuracy at $0.70 average cost to 75.15 accuracy at $0.57.

That is important because it separates two ideas that are often conflated:

Optimization target What is being improved Evidence role in the paper
Task orchestration via SFT Decomposition, context selection, tool allocation Shows orchestration can be learned through training
Cost-aware routing via in-context learning Choosing models more efficiently across subtasks Shows orchestration can improve the cost-performance frontier
Main benchmark comparison End-to-end performance across agentic environments Provides primary evidence of system effectiveness
Context-control ablation Whether curated context matters Supports the mechanism behind the system
Plug-and-play sub-agent tests Whether the orchestrator can work with different executor backends Tests framework-level flexibility

For business use, the implication is subtle but powerful. The orchestrator is not merely a prompt. It is a policy surface. It can be trained, monitored, compared, and optimized. That creates a path toward operational learning: better delegations after observing which subtasks fail, which context packages help, which tools are overexposed, and which model choices waste money.

Model routing is where the cost argument becomes credible

Many enterprise AI cost arguments are glorified wishful thinking. “Use a smaller model for simple tasks” is easy to say and hard to implement safely. The interesting part is not knowing that cheap models exist. Everyone knows that. The hard part is deciding, at runtime, which step can safely be cheap.

AOrchestra’s four-tuple includes the model as part of the delegated sub-agent. That means model choice becomes a normal orchestration decision rather than a global system setting. A simple extraction step can use a cheaper model. A complex reasoning or code-repair step can use a stronger model. A failed attempt can be retried with a different executor recipe.

The paper’s cost-aware in-context learning experiment is therefore more than a pricing anecdote. It demonstrates a design path: collect trajectories, measure outcome and cost, revise the orchestrator’s routing behavior, and move toward a better Pareto frontier.

Business translation:

  1. Do not begin with “Which model should our company use?”
  2. Begin with “What subtasks does this workflow contain?”
  3. Then ask which subtasks need expensive reasoning, which need cheap extraction, which need tools, and which need no model at all.

The model bill is not only a vendor-pricing issue. It is an orchestration-design issue.

Plug-and-play executors matter because enterprise systems are never clean

The paper also tests whether AOrchestra depends on one particular executor implementation. In Terminal-Bench, the authors replace the sub-agent backend with different styles while keeping the orchestrator fixed. ReAct-style and Mini-SWE-style sub-agents both improve over their standalone baselines in the reported setup.

This is best read as a robustness and implementation-flexibility test. It does not prove that any arbitrary executor can be plugged in and magically improved. It does suggest that the orchestrator abstraction is not tightly bound to one sub-agent implementation.

That matters in enterprise systems because few organizations have the luxury of a clean agent stack. Some workflows use browser automation. Some use internal APIs. Some use document extraction. Some require code execution in a sandbox. Some are trapped inside a legacy system that nobody wants to discuss but everybody depends on.

A useful orchestration layer should not care whether the executor is a ReAct agent, a code agent, a retrieval agent, or a narrow deterministic tool chain. It should care about the contract: what instruction is given, what context is passed, which tools are exposed, which model is used, and what observation comes back.

This contract-based view is where AOrchestra becomes more interesting for system design than for benchmark watching.

What Cognaptus would infer for enterprise automation

The paper directly shows benchmark improvements, a context ablation, learnability through SFT, cost-aware routing through prompt optimization, and executor pluggability tests.

Cognaptus would infer a broader enterprise design pattern:

Paper result Business interpretation Boundary
Four-tuple sub-agent abstraction Treat each delegated step as a controlled execution package Requires workflow instrumentation and good task decomposition
Curated context beats no-context and full-context in the ablation Memory management should be selective, not maximal The ablation uses a 50-sample GAIA subset
SFT improves a weaker orchestrator Delegation quality can become an internal capability Training data quality and governance matter
Mixed-model ICL improves accuracy while reducing cost Cost control should be learned at the subtask level Pricing and model behavior change over time
Plug-in sub-agents outperform standalone baselines in tests Orchestration can sit above heterogeneous executors Integration cost may dominate in real deployments

The practical path is not to create a colorful family of named agents and hope they form a functioning department. The practical path is to instrument workflows into delegable units.

A useful enterprise implementation would likely start with a small set of high-value workflows:

  • customer support escalation;
  • internal compliance review;
  • financial document extraction;
  • software maintenance triage;
  • research-to-report generation;
  • sales operations cleanup;
  • procurement or vendor-screening workflows.

For each workflow, the design question becomes: what are the recurring subtask types, what context does each need, which tools should be exposed, and what model tier is justified?

That turns “agentic AI” from a branding category into an operating design.

The boundaries are real, and they are mostly operational

AOrchestra is promising, but the paper’s results should not be stretched into claims it does not make.

First, the evidence is benchmark-based. GAIA, Terminal-Bench, and SWE-Bench-Verified are valuable because they are harder and more realistic than simple Q&A tests. But they are still controlled evaluations. Enterprise deployment adds permission design, audit trails, user acceptance, exception handling, data residency, security review, and integration with systems that were apparently designed to test human patience.

Second, some evaluations are sampled for cost reasons. Terminal-Bench uses a random sample of 70 tasks from a 89-task split in the main experiments, and SWE-Bench-Verified uses 100 sampled tasks from a 500-task test split. That does not invalidate the results, but it does affect how confidently one should generalize exact magnitudes.

Third, cost results depend on model prices, model behavior, and routing assumptions. The paper’s cost-aware routing result is valuable as a system-design demonstration. It should not be copied as a fixed pricing forecast.

Fourth, the system depends heavily on the orchestrator’s ability to summarize history and create good delegation instructions. Bad orchestration can still fail elegantly, which is a fancy way of failing.

Finally, tool safety becomes more important, not less. AOrchestra’s design restricts sub-agent tools per task, which is good. But a production system still needs policy enforcement outside the model. The model should not be the only adult in the room.

The real lesson is not “more agents.” It is better delegation.

AOrchestra’s contribution is useful because it moves the conversation away from agent headcount.

The paper’s strongest idea is that sub-agents should be dynamically created as execution recipes. The orchestrator decides the instruction, context, tools, and model. The executor performs the work. The returned observation informs the next delegation. This makes context management explicit, tool exposure controlled, and model routing optimizable.

That is a cleaner architecture than a static cast of heroic agents with job titles.

For business readers, the takeaway is practical: do not ask whether your workflow needs a “research agent,” a “coding agent,” or a “review agent” first. Ask what temporary executor should be created for the next step, what it should know, what it should be allowed to do, and how much intelligence that step is worth.

That is less glamorous than a multi-agent org chart. It is also more likely to survive contact with work.

Cognaptus: Automate the Present, Incubate the Future.


  1. Jianhao Ruan et al., “AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration,” arXiv:2602.03786, 2026. https://arxiv.org/html/2602.03786 ↩︎