Org Charts for Robots: What AgentArch Really Tells Us About Enterprise AI

Enterprise AI teams love an architecture diagram. Boxes, arrows, specialist agents, memory stores, tool registries, a tasteful orchestrator sitting at the top like a middle manager with JSON access. It looks reassuring. It looks intentional. It also looks suspiciously like the kind of thing that can fail in six different places while still producing a beautifully formatted answer.

That is why ServiceNow’s AgentArch paper is useful.¹ It does not ask whether agents are exciting. We have had enough excitement; it is starting to look like a liability. Instead, it asks a more operational question: when you vary the architecture of an enterprise agent system, what actually changes?

The answer is not the tidy one vendors would prefer. More agents do not automatically help. ReAct-style reasoning does not reliably improve execution. Complete memory is not always worth its context cost. “Thinking tools” help in some arithmetic-heavy workflows and barely move the needle elsewhere. And the best architecture for one model on one workflow may be mediocre for another.

This is not a paper about the end of agents. It is more annoying than that. It is a paper about architecture being contingent, measurable, and easy to over-romanticise.

The benchmark tests org design, not just model IQ

AgentArch evaluates 18 agentic configurations across six LLMs on two mock enterprise workflows. The models are GPT-4.1, GPT-4o, GPT-4.1-mini, o3-mini, LLaMA 3.3 70B, and Claude Sonnet 4, with temperature set to zero for reproducibility.

The workflows are deliberately business-flavoured rather than puzzle-flavoured:

Workflow	What it tests	Scale
Requesting Time Off	PTO eligibility, date counting, leave balances, policy checks, approval or rejection	8 tools, 3 agents
Customer Request Routing	Customer-service triage, duplicate detection, case creation, classification, escalation, document verification, response generation	31 tools, 9 agents

That split matters. The PTO task is closer to a deterministic administrative process: count days, check balances, apply policy. The customer-routing task is messier: it requires classification, routing, escalation judgement, and preserving context across a larger tool surface.

The benchmark then varies four architectural choices:

Design dimension	Options tested	What this isolates
Orchestration	Single agent; orchestrator-led isolated agents; orchestrator-led open agent network	Whether specialisation and coordination beat one generalist
Agent style	Function calling; ReAct	Whether explicit thought/action formatting helps execution
Memory	Complete; summarized	Whether full context improves performance enough to justify longer prompts
Thinking tools	Enabled; disabled	Whether extra structured reasoning space helps models make decisions

The key point is that AgentArch is not merely comparing models. It is comparing model-plus-architecture bundles. That is much closer to how enterprise systems are actually purchased, built, and regretted.

The success metric is harsh because production is harsh

AgentArch’s primary metric is the Acceptable Score. A run is acceptable only if three things are correct at the same time:

The system chooses the required tools.
The tool arguments are correct.
The final business decision is correct.

This is a good metric because it refuses to flatter the model. In a real leave-request workflow, it is not enough to say, “Yes, Sarah’s leave should be approved.” The system also has to call the right approval tool, use the right employee ID, pass the right dates, and not accidentally perform a write operation it was never meant to perform. Enterprise automation is not graded by vibes. Usually.

The paper also reports supporting metrics: strict and lenient tool correctness, final-decision accuracy, hallucination rate, repeated tool calls, missing required tools, and $pass^K$. These serve different purposes.

Evidence type	Likely purpose in the paper	What it supports	What it does not prove
Acceptable pass@1	Main evidence	End-to-end workflow success under a specific configuration	That the architecture is production-ready
Correct final decision rate	Diagnostic metric	Whether the system reaches the right conclusion even when tool execution fails	That the workflow was safely completed
Hallucination rate	Failure-mode analysis	Whether agents invent tools, agents, or schema elements	Full safety or compliance behaviour
Tool repetition and missing-tool rates	Implementation diagnostics	Whether systems loop or skip required workflow steps	Root cause without deeper trace analysis
$pass^K$	Reliability stress test	Whether repeated trials are consistently correct	Expected production reliability under different sampling settings

This is the first major business lesson: measure the workflow as a workflow. A model that reaches the right conclusion but mangles a write tool is not “almost right.” It is a liability wearing a lab coat.

Comparison one: function calling beats ReAct where the work has to execute

The most consistent pattern in AgentArch is that function calling generally outperforms ReAct, especially in multi-agent settings.

That should make enterprise teams pause. ReAct has a strong intellectual appeal: reason, act, observe, repeat. It feels transparent. It looks auditable. It gives the model a little stage on which to explain itself before touching tools. In demos, this is delightful. In enterprise workflows, the paper suggests it can become a schema-hallucination machine with better narration.

The failure is especially visible in multi-agent ReAct. The authors report that models do not perform best under multi-agent ReAct, and hallucinations concentrate in ReAct settings. Sonnet 4, for example, shows hallucination rates around 36% in multi-agent ReAct configurations, while showing 0% hallucination in every other configuration reported in the discussion.

The likely mechanism is not mysterious. ReAct asks the model to produce structured reasoning and action output in a prescribed format. Multi-agent orchestration adds another layer: the orchestrator may select agents, agents may select tools, and communication has to remain faithful to available schemas. That creates more opportunities to invent an unavailable tool, select a non-existent agent, or drift from the required JSON structure.

Function calling gives the model a narrower lane. That lane is not glamorous, but enterprise systems often prefer boring lanes. Boring lanes have fewer cliffs.

This does not mean ReAct is useless. LLaMA 3.3 70B achieved its best PTO score using single-agent ReAct, though the paper also reports that LLaMA performed poorly overall and near zero on customer routing. The better reading is narrower: ReAct may help some models in some single-agent settings, but AgentArch gives no support to the idea that multi-agent ReAct is a safe enterprise default.

Comparison two: multi-agent systems improve decisions, not necessarily execution

The paper’s most interesting result is not simply “single agent good” or “multi-agent good.” That would be too convenient, and therefore suspicious.

On end-to-end acceptable score, some models perform best with single-agent function calling. GPT-4.1 reaches the top PTO score: 70.8% on the simpler Time Off workflow. Sonnet 4 reaches the top Customer Routing acceptable score: 35.3%, also using single-agent function calling.

But final-decision accuracy tells a more nuanced story. On the more complex Customer Routing task, multi-agent function-calling systems often produce better final decisions even when their overall acceptable score is lower. The paper reports that GPT-4.1 reaches 97–99% correct final decision rates with multi-agent function calling on Customer Routing, compared with 79–86% in single-agent function-calling setups. Sonnet 4 shows a similar pattern: 84–87% final-decision accuracy in multi-agent function calling versus 72–76% in single-agent settings.

That distinction is operationally important.

A multi-agent system may be better at dividing judgement: one agent validates, another extracts intent, another checks duplicates, another decides escalation. The division of labour can improve the final answer. But the same system also increases coordination complexity. More handoffs mean more chances to miss a required tool, pass a bad argument, or lose context between agents.

So the business interpretation is not “use multi-agent systems for hard workflows.” It is more precise:

Business priority	Better starting point	Why
Exact tool execution on a constrained workflow	Single-agent function calling	Fewer handoffs, simpler control flow, smaller coordination surface
Final decision quality in complex triage	Multi-agent function calling	Specialist roles can improve classification and routing judgement
Fully autonomous write-heavy workflows	Neither, without validators	Acceptable scores remain too low for unsupervised deployment
Human-in-the-loop decision support	Multi-agent may be attractive	Final-decision accuracy can matter more than exact autonomous execution

This is where many agent programmes go wrong. They mistake organisational analogy for engineering evidence. A sales agent, a support agent, a compliance agent, and an orchestrator sound like a company. But software does not become reliable merely because its boxes resemble departments.

Comparison three: thinking tools help arithmetic, not judgement

AgentArch includes “thinking tools”: a math tool and a synthesis tool. These are not external calculators in the usual sense. The paper describes them as giving the model space to generate extra tokens in tool-call form; the argument is returned and added to memory. In other words, the tool creates structured scratchpad-like space inside the workflow.

On the Time Off task, thinking tools help several non-reasoning models. GPT-4.1 improves from 48.5% to 70.8% in single-agent function calling with summarized memory when thinking tools are enabled. That is not a rounding error. It is the difference between “interesting prototype” and “maybe worth a controlled pilot, if you enjoy paperwork.”

The mechanism is plausible. PTO workflows require date calculations, leave-balance comparisons, and policy sequencing. The appendix specifically notes cases involving multiple months, leap years, invalid leave types, conflicts, and insufficient balances. A structured math or synthesis step helps the model slow down at exactly the point where administrative errors occur.

On Customer Routing, however, thinking tools have minimal impact across models. That task is less about arithmetic and more about classification, ambiguity, escalation, instruction following, and navigating a much larger tool space. Adding a “think harder” tool does not solve tool explosion. It also does not magically produce better business judgement. Tragic, but efficient.

For enterprise teams, this is a clean design rule: thinking tools should be attached to known cognitive bottlenecks, not sprinkled everywhere like paprika.

Use them when the workflow contains:

date intervals;
quantity comparisons;
policy thresholds;
multi-step tabulation;
structured evidence synthesis before a decision.

Do not expect them to fix:

vague user intent;
poor tool descriptions;
overloaded agent roles;
missing business rules;
weak escalation logic;
noisy retrieval from enterprise systems.

A thinking tool is not governance. It is a place for the model to do its homework.

Comparison four: memory strategy matters less than people expect

AgentArch compares complete memory against summarized memory. Complete memory gives agents all prior tool calls, parameters, and responses. Summarized memory gives them final summaries from previous agents.

The paper finds that memory strategy has relatively small effects compared with agent style and model choice. In GPT-4.1’s best single-agent PTO setup, performance is nearly identical between complete and summarized memory, with summarized memory slightly ahead in the reported best score. In orchestrated configurations, complete memory sometimes has a slight advantage, but not enough to justify a universal rule.

This matters because memory is one of the easiest places to waste money. Teams often assume more context is safer. Sometimes it is. Sometimes it just gives the model more irrelevant JSON to trip over while billing you for the privilege.

The better operating principle is:

Memory choice	Use when	Avoid when
Summarized memory	Prior steps can be reliably compressed into state, decisions, and completed actions	The summarizer is untested or drops required identifiers
Complete memory	Later steps depend on exact tool outputs, arguments, or audit trails	Context is long, repetitive, or full of irrelevant metadata
External trace logging	Always	Never confuse audit storage with prompt context
Retrieval into memory	When specific prior facts are needed	When retrieval becomes a second unvalidated agent

AgentArch does not prove summarized memory is always enough. It proves something more practical: complete memory is not automatically a performance upgrade. In an enterprise system, full logs belong in observability and audit infrastructure. Only the necessary state belongs in the model prompt.

The model ranking is less useful than the architecture sensitivity

The obvious reading of any benchmark is to rank models. AgentArch provides some of that: GPT-4.1 and Sonnet 4 are the strongest overall, especially across architectures. GPT-4.1 reaches the best Time Off score, while Sonnet 4 reaches the best Customer Routing acceptable score. o3-mini is highly architecture-sensitive. GPT-4o and GPT-4.1-mini show mixed behaviour. LLaMA 3.3 70B struggles badly in this benchmark.

But the more useful reading is architectural sensitivity.

The paper reports coefficients of variation across configurations. GPT-4.1 and Sonnet 4 are comparatively robust on the simpler task, with lower variation. o3-mini is extremely sensitive: it reaches 56.7% on Time Off with single-agent function calling, but drops to 1.3% with orchestrated ReAct. GPT-4.1-mini can perform poorly under some configurations and strongly under others; its Time Off peak of 67.1% is close to Sonnet 4’s 68.5%, but only under the right setup.

That should change procurement behaviour. The question should not be, “Which model is best for agents?” That question is too blunt to be useful.

The better question is:

Which model, under which architecture, for which workflow, under which success metric?

That is less catchy. It is also less likely to waste a quarter.

The deployment gap is not subtle

The best Time Off acceptable score is 70.8%. The best Customer Routing acceptable score is 35.3%. The best $pass^K$ acceptable score across all models and configurations is 6.34%, meaning only a 6.34% chance of executing the workflow correctly across all eight trials.

That number should ruin at least one slide deck.

There is a common enterprise-AI storytelling habit: show a strong single example, imply general reliability, then hide the variance behind “human-in-the-loop” language. AgentArch attacks that habit directly. A system that succeeds on a single run may still be inconsistent across repeated attempts. For workflows involving approvals, case creation, customer communication, or escalation decisions, inconsistency is not a cosmetic issue. It is the issue.

The $pass^K$ result does not mean agents are useless. It means autonomous enterprise execution is still a reliability-engineering problem, not just a prompting problem.

A practical deployment plan should therefore separate three layers:

Layer	What the paper directly tests	Cognaptus inference for deployment
Model reasoning	Final decision correctness	Use model choice and role decomposition to improve judgement
Tool execution	Required tools and exact arguments	Add schema validators, deterministic argument repair, dry-run modes, and write gates
Workflow reliability	Repeated success across trials	Use shadow evaluation, canary release, monitoring, and human approval for irreversible actions

The paper tests model-and-agent architectures, not full production control systems. That boundary is important. A production system can wrap models with deterministic validators, policy engines, queue controls, retries, permissioning, and audit layers. Those wrappers may raise operational reliability substantially. But they do not erase the benchmark result. They explain why wrappers are necessary in the first place.

The appendix is mostly diagnostic, not a second thesis

AgentArch’s appendix gives useful implementation detail: how single-agent instructions are constructed, how edge cases are represented, what thinking-tool examples look like, how prompts are structured, and which supplemental metrics are reported.

The most important appendix point is that the benchmark’s ground truths are human annotated and deterministic. For each use case, the authors define expected tool inputs, expected outcomes, and expected tool order. That makes the evaluation stricter than a loose “did it seem helpful?” judgement.

The additional metrics also help diagnose why a configuration failed:

strict tool correctness shows whether the model followed the exact expected tool sequence;
lenient tool correctness allows extra read-only tools but penalises harmful write operations;
repetition rate catches looping behaviour;
missing required tool rate shows skipped workflow steps;
hallucination rate identifies invented tools, agents, or schema elements.

These are not merely academic extras. They map directly to an enterprise agent runbook. When an agent fails, “the model got confused” is not a diagnosis. It is a shrug with a GPU budget. The failure needs to be classified: wrong decision, wrong tool, wrong argument, missing tool, repeated tool, hallucinated schema, or bad handoff.

AgentArch’s contribution is not just the benchmark table. It is the insistence that these failure modes should be measured separately.

A practical architecture rubric for enterprise teams

The useful output of AgentArch is not a universal architecture. It is a disciplined way to choose one.

Here is a conservative starting rubric.

Workflow pattern	Default architecture to test first	Add only if evidence supports it	Main risk to monitor
Deterministic administrative process	Single-agent function calling with summarized memory	Thinking tools for calculations and policy synthesis	Wrong arguments or skipped write steps
Complex triage or escalation	Multi-agent function calling with clear specialist roles	Isolated orchestration before open agent networks	Better decisions but poorer execution
Large tool registry	Orchestrated specialists or routing before tool access	Complete memory for roles that need exact prior outputs	Tool selection errors and missing required tools
Arithmetic-heavy policy workflow	Function calling plus math/synthesis thinking tools	Deterministic calculator or rule engine for critical calculations	Model-generated arithmetic masquerading as certainty
Regulated write actions	Human approval, dry-run tools, validators	Autonomous writes only after repeated shadow success	False confidence from pass@1 alone
ReAct-based transparency	Single-agent, narrow-scope experiments	ReAct for one role, not the whole multi-agent stack	Hallucinated tools and schema drift

The implied workflow for teams is straightforward:

Define the business process as a deterministic evaluation set.
Record expected tool sequence, argument values, and final decisions.
Test at least three architectures: single-agent function calling, isolated multi-agent function calling, and the team’s preferred “fancy” architecture for emotional closure.
Measure final decision accuracy separately from acceptable end-to-end execution.
Add validators before adding more agents.
Promote write tools gradually: read-only, dry-run, supervised write, then limited autonomous write.
Treat any architecture claim as local until tested on the actual workflow.

This is not glamorous architecture. It is evidence-based plumbing. Enterprise software has always been mostly plumbing. The agents were never going to save us from that.

What the paper does not settle

AgentArch is valuable because it is specific. Its limitations come from the same source.

The benchmark covers two text-only enterprise workflows with 60 samples each. That is enough to reveal meaningful architectural differences, but not enough to generalise across all industries, document-heavy processes, multimodal workflows, or conversational support settings.

The model set is also limited: six models, one open-source model, and one reasoning model. o3-mini’s behaviour is interesting, but the paper cannot tell us whether all reasoning models behave similarly under these architectures.

The experiments use temperature zero. That helps reproducibility, but production systems may use different sampling settings, retries, or controlled stochasticity. The interaction between sampling and architecture remains open.

The benchmark also excludes important production economics: latency, token cost, orchestration overhead, infrastructure complexity, monitoring cost, and human review load. A configuration that scores slightly higher may still be worse commercially if it doubles latency and triples trace complexity. Yes, reality continues to be rude.

Finally, the Acceptable Score requires full correctness across tool choice, arguments, and final decision. That is appropriate for autonomous workflows, but some business processes can tolerate partial success if a human reviewer catches errors before execution. AgentArch should therefore guide deployment design, not replace risk analysis.

The real lesson: benchmark the org chart before hiring the robots

The fashionable way to build agent systems is to draw a team of specialised agents and assume the architecture has inherited the virtues of human organisations. AgentArch says: not so fast.

Single agents can execute constrained workflows more cleanly. Multi-agent systems can improve final decisions in complex routing while worsening tool hygiene. Function calling is usually safer than ReAct for execution. Thinking tools are useful when the bottleneck is calculation or structured synthesis, not when the bottleneck is ambiguity. Memory should be treated as an operating cost, not a moral good. And the same model can look impressive or hopeless depending on the architecture wrapped around it.

For business leaders, the message is simple but inconvenient: do not buy “agentic architecture” as a category. Test workflow by workflow. Measure decision quality and execution correctness separately. Benchmark the model and the org chart together.

The future of enterprise AI may indeed involve agents. But before giving them departments, managers, memories, and tools, it is worth checking whether the org chart works.

Cognaptus: Automate the Present, Incubate the Future.

Tara Bogavelli, Hari Subramani, and Roshnee Sharma, “AgentArch: A Benchmark for Evaluating Agent Architectures in Enterprise Workflows,” ServiceNow, arXiv:2509.10769, https://arxiv.org/html/2509.10769. ↩︎

The benchmark tests org design, not just model IQ#

The success metric is harsh because production is harsh#

Comparison one: function calling beats ReAct where the work has to execute#

Comparison two: multi-agent systems improve decisions, not necessarily execution#

Comparison three: thinking tools help arithmetic, not judgement#

Comparison four: memory strategy matters less than people expect#

The model ranking is less useful than the architecture sensitivity#

The deployment gap is not subtle#

The appendix is mostly diagnostic, not a second thesis#

A practical architecture rubric for enterprise teams#

What the paper does not settle#

The real lesson: benchmark the org chart before hiring the robots#