Enterprise AI teams love an architecture diagram. Boxes, arrows, specialist agents, memory stores, tool registries, a tasteful orchestrator sitting at the top like a middle manager with JSON access. It looks reassuring. It looks intentional. It also looks suspiciously like the kind of thing that can fail in six different places while still producing a beautifully formatted answer.
That is why ServiceNow’s AgentArch paper is useful.1 It does not ask whether agents are exciting. We have had enough excitement; it is starting to look like a liability. Instead, it asks a more operational question: when you vary the architecture of an enterprise agent system, what actually changes?
The answer is not the tidy one vendors would prefer. More agents do not automatically help. ReAct-style reasoning does not reliably improve execution. Complete memory is not always worth its context cost. “Thinking tools” help in some arithmetic-heavy workflows and barely move the needle elsewhere. And the best architecture for one model on one workflow may be mediocre for another.
This is not a paper about the end of agents. It is more annoying than that. It is a paper about architecture being contingent, measurable, and easy to over-romanticise.
The benchmark tests org design, not just model IQ
AgentArch evaluates 18 agentic configurations across six LLMs on two mock enterprise workflows. The models are GPT-4.1, GPT-4o, GPT-4.1-mini, o3-mini, LLaMA 3.3 70B, and Claude Sonnet 4, with temperature set to zero for reproducibility.
The workflows are deliberately business-flavoured rather than puzzle-flavoured:
| Workflow | What it tests | Scale |
|---|---|---|
| Requesting Time Off | PTO eligibility, date counting, leave balances, policy checks, approval or rejection | 8 tools, 3 agents |
| Customer Request Routing | Customer-service triage, duplicate detection, case creation, classification, escalation, document verification, response generation | 31 tools, 9 agents |
That split matters. The PTO task is closer to a deterministic administrative process: count days, check balances, apply policy. The customer-routing task is messier: it requires classification, routing, escalation judgement, and preserving context across a larger tool surface.
The benchmark then varies four architectural choices:
| Design dimension | Options tested | What this isolates |
|---|---|---|
| Orchestration | Single agent; orchestrator-led isolated agents; orchestrator-led open agent network | Whether specialisation and coordination beat one generalist |
| Agent style | Function calling; ReAct | Whether explicit thought/action formatting helps execution |
| Memory | Complete; summarized | Whether full context improves performance enough to justify longer prompts |
| Thinking tools | Enabled; disabled | Whether extra structured reasoning space helps models make decisions |
The key point is that AgentArch is not merely comparing models. It is comparing model-plus-architecture bundles. That is much closer to how enterprise systems are actually purchased, built, and regretted.
The success metric is harsh because production is harsh
AgentArch’s primary metric is the Acceptable Score. A run is acceptable only if three things are correct at the same time:
- The system chooses the required tools.
- The tool arguments are correct.
- The final business decision is correct.
This is a good metric because it refuses to flatter the model. In a real leave-request workflow, it is not enough to say, “Yes, Sarah’s leave should be approved.” The system also has to call the right approval tool, use the right employee ID, pass the right dates, and not accidentally perform a write operation it was never meant to perform. Enterprise automation is not graded by vibes. Usually.
The paper also reports supporting metrics: strict and lenient tool correctness, final-decision accuracy, hallucination rate, repeated tool calls, missing required tools, and $pass^K$. These serve different purposes.
| Evidence type | Likely purpose in the paper | What it supports | What it does not prove |
|---|---|---|---|
| Acceptable pass@1 | Main evidence | End-to-end workflow success under a specific configuration | That the architecture is production-ready |
| Correct final decision rate | Diagnostic metric | Whether the system reaches the right conclusion even when tool execution fails | That the workflow was safely completed |
| Hallucination rate | Failure-mode analysis | Whether agents invent tools, agents, or schema elements | Full safety or compliance behaviour |
| Tool repetition and missing-tool rates | Implementation diagnostics | Whether systems loop or skip required workflow steps | Root cause without deeper trace analysis |
| $pass^K$ | Reliability stress test | Whether repeated trials are consistently correct | Expected production reliability under different sampling settings |
This is the first major business lesson: measure the workflow as a workflow. A model that reaches the right conclusion but mangles a write tool is not “almost right.” It is a liability wearing a lab coat.
Comparison one: function calling beats ReAct where the work has to execute
The most consistent pattern in AgentArch is that function calling generally outperforms ReAct, especially in multi-agent settings.
That should make enterprise teams pause. ReAct has a strong intellectual appeal: reason, act, observe, repeat. It feels transparent. It looks auditable. It gives the model a little stage on which to explain itself before touching tools. In demos, this is delightful. In enterprise workflows, the paper suggests it can become a schema-hallucination machine with better narration.
The failure is especially visible in multi-agent ReAct. The authors report that models do not perform best under multi-agent ReAct, and hallucinations concentrate in ReAct settings. Sonnet 4, for example, shows hallucination rates around 36% in multi-agent ReAct configurations, while showing 0% hallucination in every other configuration reported in the discussion.
The likely mechanism is not mysterious. ReAct asks the model to produce structured reasoning and action output in a prescribed format. Multi-agent orchestration adds another layer: the orchestrator may select agents, agents may select tools, and communication has to remain faithful to available schemas. That creates more opportunities to invent an unavailable tool, select a non-existent agent, or drift from the required JSON structure.
Function calling gives the model a narrower lane. That lane is not glamorous, but enterprise systems often prefer boring lanes. Boring lanes have fewer cliffs.
This does not mean ReAct is useless. LLaMA 3.3 70B achieved its best PTO score using single-agent ReAct, though the paper also reports that LLaMA performed poorly overall and near zero on customer routing. The better reading is narrower: ReAct may help some models in some single-agent settings, but AgentArch gives no support to the idea that multi-agent ReAct is a safe enterprise default.
Comparison two: multi-agent systems improve decisions, not necessarily execution
The paper’s most interesting result is not simply “single agent good” or “multi-agent good.” That would be too convenient, and therefore suspicious.
On end-to-end acceptable score, some models perform best with single-agent function calling. GPT-4.1 reaches the top PTO score: 70.8% on the simpler Time Off workflow. Sonnet 4 reaches the top Customer Routing acceptable score: 35.3%, also using single-agent function calling.
But final-decision accuracy tells a more nuanced story. On the more complex Customer Routing task, multi-agent function-calling systems often produce better final decisions even when their overall acceptable score is lower. The paper reports that GPT-4.1 reaches 97–99% correct final decision rates with multi-agent function calling on Customer Routing, compared with 79–86% in single-agent function-calling setups. Sonnet 4 shows a similar pattern: 84–87% final-decision accuracy in multi-agent function calling versus 72–76% in single-agent settings.
That distinction is operationally important.
A multi-agent system may be better at dividing judgement: one agent validates, another extracts intent, another checks duplicates, another decides escalation. The division of labour can improve the final answer. But the same system also increases coordination complexity. More handoffs mean more chances to miss a required tool, pass a bad argument, or lose context between agents.
So the business interpretation is not “use multi-agent systems for hard workflows.” It is more precise:
| Business priority | Better starting point | Why |
|---|---|---|
| Exact tool execution on a constrained workflow | Single-agent function calling | Fewer handoffs, simpler control flow, smaller coordination surface |
| Final decision quality in complex triage | Multi-agent function calling | Specialist roles can improve classification and routing judgement |
| Fully autonomous write-heavy workflows | Neither, without validators | Acceptable scores remain too low for unsupervised deployment |
| Human-in-the-loop decision support | Multi-agent may be attractive | Final-decision accuracy can matter more than exact autonomous execution |
This is where many agent programmes go wrong. They mistake organisational analogy for engineering evidence. A sales agent, a support agent, a compliance agent, and an orchestrator sound like a company. But software does not become reliable merely because its boxes resemble departments.
Comparison three: thinking tools help arithmetic, not judgement
AgentArch includes “thinking tools”: a math tool and a synthesis tool. These are not external calculators in the usual sense. The paper describes them as giving the model space to generate extra tokens in tool-call form; the argument is returned and added to memory. In other words, the tool creates structured scratchpad-like space inside the workflow.
On the Time Off task, thinking tools help several non-reasoning models. GPT-4.1 improves from 48.5% to 70.8% in single-agent function calling with summarized memory when thinking tools are enabled. That is not a rounding error. It is the difference between “interesting prototype” and “maybe worth a controlled pilot, if you enjoy paperwork.”
The mechanism is plausible. PTO workflows require date calculations, leave-balance comparisons, and policy sequencing. The appendix specifically notes cases involving multiple months, leap years, invalid leave types, conflicts, and insufficient balances. A structured math or synthesis step helps the model slow down at exactly the point where administrative errors occur.
On Customer Routing, however, thinking tools have minimal impact across models. That task is less about arithmetic and more about classification, ambiguity, escalation, instruction following, and navigating a much larger tool space. Adding a “think harder” tool does not solve tool explosion. It also does not magically produce better business judgement. Tragic, but efficient.
For enterprise teams, this is a clean design rule: thinking tools should be attached to known cognitive bottlenecks, not sprinkled everywhere like paprika.
Use them when the workflow contains:
- date intervals;
- quantity comparisons;
- policy thresholds;
- multi-step tabulation;
- structured evidence synthesis before a decision.
Do not expect them to fix:
- vague user intent;
- poor tool descriptions;
- overloaded agent roles;
- missing business rules;
- weak escalation logic;
- noisy retrieval from enterprise systems.
A thinking tool is not governance. It is a place for the model to do its homework.
Comparison four: memory strategy matters less than people expect
AgentArch compares complete memory against summarized memory. Complete memory gives agents all prior tool calls, parameters, and responses. Summarized memory gives them final summaries from previous agents.
The paper finds that memory strategy has relatively small effects compared with agent style and model choice. In GPT-4.1’s best single-agent PTO setup, performance is nearly identical between complete and summarized memory, with summarized memory slightly ahead in the reported best score. In orchestrated configurations, complete memory sometimes has a slight advantage, but not enough to justify a universal rule.
This matters because memory is one of the easiest places to waste money. Teams often assume more context is safer. Sometimes it is. Sometimes it just gives the model more irrelevant JSON to trip over while billing you for the privilege.
The better operating principle is:
| Memory choice | Use when | Avoid when |
|---|---|---|
| Summarized memory | Prior steps can be reliably compressed into state, decisions, and completed actions | The summarizer is untested or drops required identifiers |
| Complete memory | Later steps depend on exact tool outputs, arguments, or audit trails | Context is long, repetitive, or full of irrelevant metadata |
| External trace logging | Always | Never confuse audit storage with prompt context |
| Retrieval into memory | When specific prior facts are needed | When retrieval becomes a second unvalidated agent |
AgentArch does not prove summarized memory is always enough. It proves something more practical: complete memory is not automatically a performance upgrade. In an enterprise system, full logs belong in observability and audit infrastructure. Only the necessary state belongs in the model prompt.
The model ranking is less useful than the architecture sensitivity
The obvious reading of any benchmark is to rank models. AgentArch provides some of that: GPT-4.1 and Sonnet 4 are the strongest overall, especially across architectures. GPT-4.1 reaches the best Time Off score, while Sonnet 4 reaches the best Customer Routing acceptable score. o3-mini is highly architecture-sensitive. GPT-4o and GPT-4.1-mini show mixed behaviour. LLaMA 3.3 70B struggles badly in this benchmark.
But the more useful reading is architectural sensitivity.
The paper reports coefficients of variation across configurations. GPT-4.1 and Sonnet 4 are comparatively robust on the simpler task, with lower variation. o3-mini is extremely sensitive: it reaches 56.7% on Time Off with single-agent function calling, but drops to 1.3% with orchestrated ReAct. GPT-4.1-mini can perform poorly under some configurations and strongly under others; its Time Off peak of 67.1% is close to Sonnet 4’s 68.5%, but only under the right setup.
That should change procurement behaviour. The question should not be, “Which model is best for agents?” That question is too blunt to be useful.
The better question is:
Which model, under which architecture, for which workflow, under which success metric?
That is less catchy. It is also less likely to waste a quarter.
The deployment gap is not subtle
The best Time Off acceptable score is 70.8%. The best Customer Routing acceptable score is 35.3%. The best $pass^K$ acceptable score across all models and configurations is 6.34%, meaning only a 6.34% chance of executing the workflow correctly across all eight trials.
That number should ruin at least one slide deck.
There is a common enterprise-AI storytelling habit: show a strong single example, imply general reliability, then hide the variance behind “human-in-the-loop” language. AgentArch attacks that habit directly. A system that succeeds on a single run may still be inconsistent across repeated attempts. For workflows involving approvals, case creation, customer communication, or escalation decisions, inconsistency is not a cosmetic issue. It is the issue.
The $pass^K$ result does not mean agents are useless. It means autonomous enterprise execution is still a reliability-engineering problem, not just a prompting problem.
A practical deployment plan should therefore separate three layers:
| Layer | What the paper directly tests | Cognaptus inference for deployment |
|---|---|---|
| Model reasoning | Final decision correctness | Use model choice and role decomposition to improve judgement |
| Tool execution | Required tools and exact arguments | Add schema validators, deterministic argument repair, dry-run modes, and write gates |
| Workflow reliability | Repeated success across trials | Use shadow evaluation, canary release, monitoring, and human approval for irreversible actions |
The paper tests model-and-agent architectures, not full production control systems. That boundary is important. A production system can wrap models with deterministic validators, policy engines, queue controls, retries, permissioning, and audit layers. Those wrappers may raise operational reliability substantially. But they do not erase the benchmark result. They explain why wrappers are necessary in the first place.
The appendix is mostly diagnostic, not a second thesis
AgentArch’s appendix gives useful implementation detail: how single-agent instructions are constructed, how edge cases are represented, what thinking-tool examples look like, how prompts are structured, and which supplemental metrics are reported.
The most important appendix point is that the benchmark’s ground truths are human annotated and deterministic. For each use case, the authors define expected tool inputs, expected outcomes, and expected tool order. That makes the evaluation stricter than a loose “did it seem helpful?” judgement.
The additional metrics also help diagnose why a configuration failed:
- strict tool correctness shows whether the model followed the exact expected tool sequence;
- lenient tool correctness allows extra read-only tools but penalises harmful write operations;
- repetition rate catches looping behaviour;
- missing required tool rate shows skipped workflow steps;
- hallucination rate identifies invented tools, agents, or schema elements.
These are not merely academic extras. They map directly to an enterprise agent runbook. When an agent fails, “the model got confused” is not a diagnosis. It is a shrug with a GPU budget. The failure needs to be classified: wrong decision, wrong tool, wrong argument, missing tool, repeated tool, hallucinated schema, or bad handoff.
AgentArch’s contribution is not just the benchmark table. It is the insistence that these failure modes should be measured separately.
A practical architecture rubric for enterprise teams
The useful output of AgentArch is not a universal architecture. It is a disciplined way to choose one.
Here is a conservative starting rubric.
| Workflow pattern | Default architecture to test first | Add only if evidence supports it | Main risk to monitor |
|---|---|---|---|
| Deterministic administrative process | Single-agent function calling with summarized memory | Thinking tools for calculations and policy synthesis | Wrong arguments or skipped write steps |
| Complex triage or escalation | Multi-agent function calling with clear specialist roles | Isolated orchestration before open agent networks | Better decisions but poorer execution |
| Large tool registry | Orchestrated specialists or routing before tool access | Complete memory for roles that need exact prior outputs | Tool selection errors and missing required tools |
| Arithmetic-heavy policy workflow | Function calling plus math/synthesis thinking tools | Deterministic calculator or rule engine for critical calculations | Model-generated arithmetic masquerading as certainty |
| Regulated write actions | Human approval, dry-run tools, validators | Autonomous writes only after repeated shadow success | False confidence from pass@1 alone |
| ReAct-based transparency | Single-agent, narrow-scope experiments | ReAct for one role, not the whole multi-agent stack | Hallucinated tools and schema drift |
The implied workflow for teams is straightforward:
- Define the business process as a deterministic evaluation set.
- Record expected tool sequence, argument values, and final decisions.
- Test at least three architectures: single-agent function calling, isolated multi-agent function calling, and the team’s preferred “fancy” architecture for emotional closure.
- Measure final decision accuracy separately from acceptable end-to-end execution.
- Add validators before adding more agents.
- Promote write tools gradually: read-only, dry-run, supervised write, then limited autonomous write.
- Treat any architecture claim as local until tested on the actual workflow.
This is not glamorous architecture. It is evidence-based plumbing. Enterprise software has always been mostly plumbing. The agents were never going to save us from that.
What the paper does not settle
AgentArch is valuable because it is specific. Its limitations come from the same source.
The benchmark covers two text-only enterprise workflows with 60 samples each. That is enough to reveal meaningful architectural differences, but not enough to generalise across all industries, document-heavy processes, multimodal workflows, or conversational support settings.
The model set is also limited: six models, one open-source model, and one reasoning model. o3-mini’s behaviour is interesting, but the paper cannot tell us whether all reasoning models behave similarly under these architectures.
The experiments use temperature zero. That helps reproducibility, but production systems may use different sampling settings, retries, or controlled stochasticity. The interaction between sampling and architecture remains open.
The benchmark also excludes important production economics: latency, token cost, orchestration overhead, infrastructure complexity, monitoring cost, and human review load. A configuration that scores slightly higher may still be worse commercially if it doubles latency and triples trace complexity. Yes, reality continues to be rude.
Finally, the Acceptable Score requires full correctness across tool choice, arguments, and final decision. That is appropriate for autonomous workflows, but some business processes can tolerate partial success if a human reviewer catches errors before execution. AgentArch should therefore guide deployment design, not replace risk analysis.
The real lesson: benchmark the org chart before hiring the robots
The fashionable way to build agent systems is to draw a team of specialised agents and assume the architecture has inherited the virtues of human organisations. AgentArch says: not so fast.
Single agents can execute constrained workflows more cleanly. Multi-agent systems can improve final decisions in complex routing while worsening tool hygiene. Function calling is usually safer than ReAct for execution. Thinking tools are useful when the bottleneck is calculation or structured synthesis, not when the bottleneck is ambiguity. Memory should be treated as an operating cost, not a moral good. And the same model can look impressive or hopeless depending on the architecture wrapped around it.
For business leaders, the message is simple but inconvenient: do not buy “agentic architecture” as a category. Test workflow by workflow. Measure decision quality and execution correctness separately. Benchmark the model and the org chart together.
The future of enterprise AI may indeed involve agents. But before giving them departments, managers, memories, and tools, it is worth checking whether the org chart works.
Cognaptus: Automate the Present, Incubate the Future.
-
Tara Bogavelli, Hari Subramani, and Roshnee Sharma, “AgentArch: A Benchmark for Evaluating Agent Architectures in Enterprise Workflows,” ServiceNow, arXiv:2509.10769, https://arxiv.org/html/2509.10769. ↩︎