Right Tool, Right Thought: Difficulty-Aware Orchestration for Agentic LLMs

Tickets are not equal.

Some user requests are glorified form-filling. Some are ambiguous investigations with missing context, tool calls, intermediate checks, and enough failure modes to keep a compliance officer quietly blinking at the ceiling. Yet many agentic systems still behave as if every query deserves the same ritual: summon the agents, run the workflow, pass outputs around, maybe add a debate round for theatrical effect, and hope the bill does not look too much like modern art.

That is the problem Difficulty-Aware Agentic Orchestration, or DAAO, tries to solve.¹ Its central claim is not that multi-agent systems need more agents. That would be the obvious, expensive, and faintly desperate answer. The paper’s claim is sharper: an agentic workflow should grow with the difficulty of the query. Easy questions should receive shallow, cheap workflows. Hard questions should earn deeper reasoning, broader operator use, and stronger model allocation.

The difference matters because enterprise AI does not fail only by giving wrong answers. It also fails by answering trivial questions with absurd machinery, routing everything through premium models, and treating latency as somebody else’s character-building exercise. DAAO is interesting because it puts cost and capability into the same control loop.

The expensive mistake is treating every query like a crisis

The paper positions DAAO against three families of prior approaches.

First, there are single-agent and prompting methods such as Chain-of-Thought, ComplexCoT, and Self-Consistency. These improve reasoning behaviour, but they do not really design a workflow. They are tactics, not an operating model.

Second, there are automated workflow systems such as ADAS, AFlow, and MaAS. These are closer to the target: they search over agentic structures and can build more sophisticated pipelines. But the paper argues that task-level workflows often become too uniform. A workflow optimised for a benchmark category may perform well on average while over-processing easy examples and under-serving unusually hard ones.

Third, there are LLM routers such as PromptLLM, RouteLLM, and MasRouter. These focus on choosing models, often with cost-performance trade-offs in mind. But routing alone does not decide whether the task needs CoT, ReAct, Debate, Self-Consistency, Testing, Review, or Ensemble. Choosing the model without choosing the operator is like assigning a very talented employee to a meeting whose agenda nobody wrote. Technically possible. Organisationally suspect.

DAAO’s intervention is to connect these decisions. It asks three questions per query:

Decision	What DAAO controls	Operational meaning
Difficulty	How hard does this query appear to be?	Decide whether the system should stay shallow or escalate.
Operator allocation	Which reasoning or collaboration operators should be activated, and at what depth?	Choose the workflow structure, not just the model.
LLM routing	Which model should run each selected operator?	Spend expensive model calls only where they are useful.

The paper’s useful idea is not any one row in that table. It is the coupling. Difficulty becomes the prior that shapes the workflow, and the workflow then shapes the model-routing decision.

DAAO starts by pricing the question before building the workflow

DAAO uses a query difficulty estimator as the first mechanism in the pipeline. The estimator is implemented as a variational autoencoder with a learned difficulty head. It encodes the input query into a latent representation $z$ and decodes a scalar difficulty score $d \in (0,1)$.

That scalar is not just decorative telemetry. It conditions the workflow depth, the operator allocation, and the model router. Higher difficulty encourages larger-capacity workflows: more layers, more active operators, and potentially stronger model assignment. Lower difficulty encourages conservative workflows.

The paper defines the overall optimisation objective as a utility-minus-cost problem:

$$ \max_{P(G \mid Q)} \mathbb{E}_{(Q,a)\sim D,,G\sim P(G\mid Q)} \left[ U(G;Q,a) - \lambda C(G;Q) \right] $$

In plain English: find query-specific workflows that perform well without spending as if every input were a doctoral qualifying exam.

The important part is the feedback loop. DAAO updates difficulty estimates using workflow outcomes. If a workflow succeeds, the estimated difficulty can be lowered, making future workflows simpler. If it fails, difficulty is raised, encouraging more capable workflows. This is not manual difficulty labelling. It is outcome-conditioned calibration.

That calibration is the paper’s quiet centre of gravity. Without it, “difficulty-aware” becomes just another nice phrase on a slide, possibly between “AI-native” and “synergistic,” where many good words go to lose meaning.

The mechanism is depth, width, and model choice—not just routing

DAAO’s workflow is represented as a directed acyclic graph, layered from earlier operators to later ones. Each operator pairs a model with a reasoning or collaboration protocol. The operator inventory used in the experiments includes Chain-of-Thought, LLM-Debate, Review, Ensemble, ReAct, Self-Consistency, and Testing.

The architecture adapts in three dimensions.

First, depth changes with predicted difficulty. The paper sets a maximum layer count and uses the scalar difficulty to determine how many workflow layers to activate. In the case study figure, a simple query with estimated difficulty $d=0.2$ receives one layer; a medium query at $d=0.5$ receives three layers; a hard spreadsheet-style query at $d=0.7$ receives four layers. This figure is best read as an explanatory case study, not as independent proof. Its purpose is to make the mechanism visible.

Second, operator width changes layer by layer. DAAO scores candidate operators based on the query, the difficulty representation, and the workflow history. It then activates operators using a cumulative-threshold rule. The threshold acts like a budget control: allow enough operators to cover the query, but do not keep adding machinery simply because the operator catalogue exists.

Third, model choice is made per selected operator. The LLM router uses embeddings of the query, difficulty, operator, and candidate model to assign different models to different workflow components. This matters because “use the best model” is rarely a strategy. It is a procurement confession. Some steps may need strong planning, others cheap generation, others final verification. DAAO treats those as different routing decisions.

The main evidence says orchestration and routing work better together

The main benchmark comparison covers MMLU, GSM8K, MATH, HumanEval, and MBPP. DAAO is compared against vanilla model use, prompting methods, automated workflow systems, and LLM routers. The model pool includes GPT-4o-mini, Gemini-1.5-Flash, Llama-3.1-70B, and Qwen-2-72B.

The headline result is straightforward: DAAO reports the highest average score across the five main benchmarks.

Method	MMLU	GSM8K	MATH	HumanEval	MBPP	Average
Best single listed vanilla model	80.22	87.45	48.00	85.71	73.90	74.08
AFlow, best listed variant	83.10	91.16	52.00	90.93	81.67	79.73
MaAS, best listed variant	83.42	92.30	52.25	92.85	82.69	80.43
MasRouter	84.25	92.00	52.42	90.62	84.00	80.66
DAAO	84.90	94.40	55.37	94.65	86.95	83.26

The most useful comparison is not with weak baselines. It is with MasRouter and MaAS, because those already represent more serious attempts at model routing and query-specific agentic structure. DAAO’s average of 83.26 beats MasRouter’s 80.66 and MaAS’s strongest listed average of 80.43. On MATH, DAAO reaches 55.37 versus MasRouter’s 52.42.

That is not an earth-shattering gap in absolute terms. Good. We should distrust earth-shattering gaps by default; they tend to come with footnotes, benchmark quirks, or a magician’s handkerchief. The result is more credible as an incremental but consistent gain from coupling decisions that are often separated: how much workflow, which operators, and which models.

GAIA shows where deeper orchestration is supposed to matter

The GAIA benchmark is used as a comparison with prior agentic systems in more complex, tool-augmented settings. This is main comparative evidence for the paper’s claim that DAAO helps when tasks require multi-step planning, tool use, and more realistic workflows.

Method	Level 1	Level 2	Level 3	Average
GPT-4o-mini	7.53	4.40	0.00	4.65
ADAS	13.98	4.40	0.00	6.69
AFlow	10.75	8.81	4.08	8.00
MaAS	20.45	18.61	6.25	17.64
DAAO	30.42	24.00	8.50	25.97

This table is easy to over-read. DAAO’s GAIA average of 25.97 is materially higher than MaAS at 17.64 and AFlow at 8.00. That supports the paper’s argument that dynamic workflow construction and per-operator routing help in complex settings.

But Level 3 remains low at 8.50. That matters. The right interpretation is not “DAAO solves realistic agentic work.” It is “DAAO improves over the compared systems on a difficult benchmark, while the hard end of the benchmark remains very hard.” For enterprise use, that distinction is the difference between a useful orchestration pattern and a premature deployment memo.

The cost result is the business hook, but only because accuracy does not collapse

The MATH cost comparison is especially relevant for business readers because it measures training, inference, total cost, and accuracy together. This is the evidence for DAAO’s efficiency claim.

Method	Training cost	Inference cost	Overall cost	Accuracy
AFlow	$22.50	$1.66	$24.16	51.82
MaAS	$3.38	$0.42	$3.80	51.82
MasRouter	$3.56	$0.65	$4.21	52.42
DAAO	$2.34	$0.27	$2.61	55.37

The business meaning is not merely “DAAO is cheaper.” Cheap systems are easy. A blank response is very cheap. The meaningful claim is that DAAO reports the lowest total cost in this comparison while also reporting the highest MATH accuracy.

Against AFlow, the cost reduction is dramatic: $2.61 versus $24.16 overall. Against MasRouter, the comparison is more strategically interesting: $2.61 versus $4.21, while accuracy rises from 52.42 to 55.37. That suggests the saving is not just from using cheaper models. It comes from avoiding redundant collaborative steps and not over-building workflows for easy queries.

For enterprise AI teams, this is the part worth taking seriously. The question is not whether every organisation should reproduce DAAO exactly. The question is whether their agent stack currently has any principled way to decide that a query deserves one cheap pass, three tool calls, or a full multi-agent workflow. In many cases, the answer is a very expensive shrug.

The ablations identify the actual control knob

The ablation study is not a second thesis. Its purpose is to test whether DAAO’s main components contribute to the reported performance-cost balance.

The paper removes three components: the difficulty-aware module, the LLM selector, and the cost-awareness term. The results are reported on HumanEval and MATH.

Variant	HumanEval Pass@1	HumanEval cost	MATH accuracy	MATH cost
Full DAAO	94.65	1.10	55.37	0.55
Without difficulty awareness	92.21	1.64	52.18	0.88
Without LLM selection	92.69	1.38	53.24	0.79
Without cost-awareness term	94.72	1.88	55.40	1.00

Costs in this table are reported in $10^{-3}$ dollars per query.

The most important row is “without difficulty awareness.” Removing it lowers performance and increases cost. On MATH, accuracy drops from 55.37 to 52.18 while cost rises from 0.55 to 0.88. That supports the paper’s core mechanism: estimating query difficulty is not an ornamental classifier sitting politely beside the system. It is the control knob.

The “without cost-awareness” row is also instructive. Performance barely changes, and in the table it even ticks up slightly on MATH from 55.37 to 55.40, but cost nearly doubles from 0.55 to 1.00. This is exactly what one would expect if the cost term disciplines the architecture rather than creating reasoning ability by itself. It keeps the system from buying marginal gains with unnecessary computation.

That is the kind of result business readers should like: not magical, but legible.

The sensitivity tests say there is a budget frontier, not a free lunch

The sensitivity analysis examines parameters such as maximum layer count, cost penalty, sample count, and threshold for activating operators. This is robustness and configuration evidence, not the main proof.

One result is especially practical. Increasing the maximum workflow depth from 4 to 5 improves HumanEval performance from 92.9% to 94.6%, but further increases produce only marginal gains while raising inference cost. Similarly, the operator activation threshold shows a familiar pattern: higher thresholds can improve performance slightly, but they also activate more operators and increase cost.

For HumanEval, cost rises from 0.86 at threshold 0.1 to 2.30 at threshold 0.7, while performance moves from 92.80 to 94.80. For GSM8K, cost rises from 0.40 to 0.96 over the same threshold range, while performance moves from 91.99 to 94.40, with the best reported value of 94.75 at threshold 0.6. The authors choose 0.3 as the balance point.

The lesson is not that 0.3 is some universal constant sent down from the cloud infrastructure gods. It is that agent orchestration needs an explicit budget frontier. Without one, every improvement invites another operator, another layer, another model call, and eventually another invoice.

The inductive tests are promising, but they are not deployment proof

The paper also evaluates cross-domain optimisation and router adaptation to an unseen model.

In cross-domain training, MATH training transfers well to GSM8K, producing 95.44 on GSM8K. Joint MATH+GSM8K training slightly improves MATH to 56.42 and GSM8K to 95.70. HumanEval training transfers less strongly to MATH, with 54.46 on MATH, which the paper interprets as domain-specific overfitting. Joint HumanEval+MATH training improves HumanEval to 95.00 and MATH to 55.50.

This is useful exploratory evidence. It suggests the difficulty and workflow policy can carry some shared structure across related domains, especially where reasoning patterns overlap. But the gains are modest, roughly in the 0.35 to 1.05 range for joint training in the paper’s discussion. That makes the result encouraging, not definitive.

The router analysis adds DeepSeek-v3 to the model pool. DAAO selects the new model 29% of the time on MATH and 15% on MMLU, improving MATH from 55.37 to 56.20 and MMLU from 84.90 to 85.66. This supports the idea that the router can incorporate a new model without being completely retrained around it.

Again, the enterprise interpretation should be restrained. This is evidence for adaptive model selection in the tested setup. It is not proof that a production router will behave safely across proprietary models, changing vendor APIs, sensitive data, compliance regimes, and real user traffic. Annoying sentence, yes. Necessary sentence, also yes.

What this implies for enterprise agent stacks

The practical business pathway is adaptive workload triage.

Most enterprise agent systems already have some implicit triage. A developer adds a shortcut here, a fallback there, a “use GPT-4 if confidence is low” rule somewhere else, and eventually the whole thing resembles a small government. DAAO points toward making that triage explicit and learnable.

A production version does not need to copy the paper’s full VAE-based design on day one. The useful operational pattern is:

Layer	Production analogue	What to measure
Difficulty estimate	Classify request complexity using input features, retrieval uncertainty, tool requirements, and historical success	Solve rate by difficulty band; escalation error rate
Workflow depth	Decide whether to use a shallow answer, tool-using workflow, or multi-agent verification	Latency, token cost, completion quality
Operator allocation	Select CoT, ReAct, Review, Testing, Debate, or Ensemble only when justified	Operator-level contribution and failure rate
Model routing	Assign models per operator rather than per whole query	Cost per successful answer; model usage concentration
Feedback update	Adjust difficulty and routing from outcome logs	Whether easy cases get cheaper and hard cases get better

The most important metric is not average accuracy alone. It is accuracy per unit of orchestration. That means tracking whether each extra layer, operator, and premium model call actually improves the outcome. If not, the system is not agentic. It is just expensive choreography.

For customer support, that could mean routing simple policy questions to a shallow retrieval-and-answer path, escalating ambiguous refund disputes to tool use and review, and reserving multi-agent debate for high-value exceptions. For code generation, it could mean using lightweight generation for simple snippets, adding Testing for executable tasks, and reserving Self-Consistency or Ensemble for high-risk changes. For finance or compliance workflows, it could mean using difficulty signals to trigger retrieval breadth, audit trails, and human review.

The inference is ours, not the paper’s direct claim: DAAO is best understood as a governance pattern for agentic cost and capability. It gives teams a vocabulary for controlling how much thinking a system is allowed to buy.

Where the result should not be over-sold

DAAO is benchmark-tested, not enterprise-certified. The experiments cover six public benchmarks, and the paper reports strong results across math reasoning, coding, general knowledge, and GAIA-style tool-augmented tasks. That is meaningful. It is not the same as proving reliability inside a bank, insurer, hospital, logistics network, or public-sector case-management system.

The main boundaries are clear.

First, difficulty estimation can be wrong. A hard query mis-scored as easy may receive a shallow workflow and fail silently unless the system has escalation checks. In production, difficulty should not be a single irreversible gate. It should be paired with confidence probes, verification, and fallback triggers.

Second, the operator set is finite and hand-defined. DAAO chooses among CoT, Debate, Review, Ensemble, ReAct, Self-Consistency, and Testing. That is a practical catalogue, but it is still a catalogue. Enterprise workflows may require domain-specific tools, approval steps, access controls, and audit events that do not look like benchmark operators.

Third, costs depend on the model pool and pricing environment. The paper’s cost results are persuasive within its setup, but vendor pricing, latency, rate limits, and data residency rules can change the economic picture. A router that looks efficient in one model pool may become foolish after a procurement change. Many systems have achieved this without academic assistance.

Fourth, the evaluation is still answer-centric. Even GAIA, useful as it is, cannot fully represent production concerns such as partial completion, user trust, privacy leakage, tool failure recovery, adversarial prompts, or legal accountability. Those concerns require separate evaluation harnesses.

The right conclusion is not “deploy DAAO.” It is “stop designing agent workflows as if query difficulty were somebody else’s problem.”

The takeaway: agentic systems need budgeted intelligence

DAAO’s contribution is to make multi-agent orchestration conditional. The system estimates query difficulty, uses that signal to adjust workflow depth and operator width, and routes each operator to an appropriate model under cost constraints. The evidence supports the mechanism: strong benchmark averages, better GAIA performance than compared agentic baselines, lower MATH cost with higher accuracy, and ablations showing that difficulty awareness is doing real work.

For business teams, the lesson is admirably unromantic. Better agentic AI is not necessarily more agents, more debate, or larger models. It is the right amount of reasoning, applied at the right point, with the right model, under a budget that someone can defend.

That may sound less glamorous than “autonomous digital workforce.” It is also far more likely to survive contact with an invoice.

Cognaptus: Automate the Present, Incubate the Future.

Jinwei Su et al., “Difficulty-Aware Agentic Orchestration for Query-Specific Multi-Agent Workflows,” arXiv:2509.11079, 2026, https://arxiv.org/abs/2509.11079. ↩︎

The expensive mistake is treating every query like a crisis#

DAAO starts by pricing the question before building the workflow#

The mechanism is depth, width, and model choice—not just routing#

The main evidence says orchestration and routing work better together#

GAIA shows where deeper orchestration is supposed to matter#

The cost result is the business hook, but only because accuracy does not collapse#

The ablations identify the actual control knob#

The sensitivity tests say there is a budget frontier, not a free lunch#

The inductive tests are promising, but they are not deployment proof#

What this implies for enterprise agent stacks#

Where the result should not be over-sold#

The takeaway: agentic systems need budgeted intelligence#