The most expensive sentence in agentic AI is “Let me think”
Every enterprise agent has a little theatre inside it.
A user asks for something routine: find a customer record, check a document, submit a form, update a profile, send a message. The agent pauses, reasons, chooses a tool, receives an observation, reasons again, chooses another tool, receives another observation, and continues until the task is finished or the budget is quietly set on fire.
This is useful when the task is genuinely uncertain. It is less charming when the agent is performing the same authentication flow, search routine, or web form interaction for the five-thousandth time. At that point, “reasoning” begins to look less like intelligence and more like a very expensive intern rereading the office manual.
The paper “Optimizing Agentic Workflows using Meta-tools” introduces Agent Workflow Optimization, or AWO, as a way to remove this kind of avoidable reasoning from agent systems.1 Its central idea is simple, but not shallow: if many agent executions contain recurring tool-call sequences, those sequences can be detected from traces, merged into a graph representation, and compiled into deterministic composite tools called meta-tools.
A meta-tool is not a better prompt. It is not a motivational speech for the agent. It is a new tool that bundles a repeated multi-step behavior into one callable operation, so the LLM does not need to re-plan the obvious each time.
That framing matters because the common misconception is that agent efficiency is mainly a model-quality problem. Use a better model. Write a better system prompt. Add better tool descriptions. Those things help, but the paper points to a different layer: the workflow itself. Some agentic inefficiency is not caused by bad reasoning. It is caused by making the model reason at all.
AWO treats agent traces like execution profiles, not chat logs
The mechanism begins with traces.
An agent execution can be viewed as a sequence of tool calls. The LLM’s private reasoning is not the object of optimization here; the paper focuses on externally observable tool interactions. This is important. The authors are not trying to read the agent’s mind. They are watching what it does.
AWO turns those executions into a state graph. Each path through the graph represents one agent run. Each node represents the state reached after some history of tool calls. Each edge represents a transition from one state to another, typically caused by another LLM decision and tool invocation.
That graph is then compressed in two conceptually different ways:
| Step | What it does | Why it matters |
|---|---|---|
| State graph construction | Converts many agent executions into a shared graph of tool-call histories | Makes repeated workflow structure visible |
| Horizontal merging | Collapses states that are semantically equivalent, using domain-aware rules | Reveals duplication hidden behind different IDs, orderings, or surface arguments |
| Vertical extraction | Finds frequent sub-paths whose replacement would reduce total edge weight | Identifies candidate meta-tools with real efficiency value |
| Toolset augmentation | Adds the resulting meta-tools to the agent’s available tools | Lets future agents skip repeated reasoning segments |
The paper’s strongest idea sits in the horizontal merging step.
Without merging, agent traces often look more diverse than they really are. Two executions may use different element IDs on a website, different user IDs in an API, or a different ordering of read-only calls. Superficially, the paths differ. Semantically, they may reach the same state.
For example, reading document A and then document B may be equivalent to reading document B and then document A, provided both are read-only operations and no side effects depend on order. Similarly, a user-specific identifier may need to be abstracted away so the same workflow over different users can be recognized as the same pattern.
This is where the paper avoids the usual “AI will discover everything automatically” fairy dust. The authors emphasize that horizontal merging is application-dependent. It requires knowledge of tool behavior, return values, side effects, and safe equivalence. In other words, the hard part is not merely finding repeated strings in logs. The hard part is knowing when two paths truly mean the same thing.
That is also where the enterprise lesson begins.
Companies already collect traces from workflow systems, RPA tools, APIs, customer support agents, and internal copilots. Most of those traces are treated as observability residue. AWO treats them as optimization material. The question is not only “Did the agent complete the task?” but “Which parts of the task have become so routine that we should stop asking the LLM to rediscover them?”
Horizontal merging is where correctness lives
AWO’s horizontal merging is not just a compression trick. It is the safety gate.
The paper gives different merging logic for different environments. In VisualWebArena, where agents interact with simulated web applications, the authors normalize ephemeral interface details. A search box might have a specific element ID, but the relevant semantic action is “type query into search box.” A comment field may appear with a page-specific ID, but the recurring operation is “write comment and submit.” In AppWorld, where agents interact with simulated application APIs, the authors use rules such as read-only operation commutativity, regex-based normalization of user-specific arguments, domain-level commutativity, and recurrent-action merging.
This is not decorative implementation detail. It defines what AWO is allowed to compress.
A bad merge can create a bad meta-tool. And a bad meta-tool is worse than an inefficient agent because it gives the agent a fast path to the wrong state. Wonderful. We saved tokens and automated the mistake.
The paper’s approach is therefore closer to compiler optimization than to prompt tuning. Compilers can inline functions, fuse kernels, and remove redundant operations only when they preserve program meaning. AWO asks the analogous question for agentic workflows: can this repeated sequence of tool calls be replaced by a deterministic composite operation without changing the intended state?
That question is easy for authentication routines. It is harder for workflows with side effects, conditional dependencies, personal data, mutable state, or messy user interfaces.
A business reader should notice the implication: meta-tools are not just technical artifacts. They are governance artifacts. Each one encodes a claim about what is safe to automate.
Vertical extraction chooses the routines worth compiling
Once a merged state graph exists, AWO identifies sub-paths that are worth turning into meta-tools. The goal is not to create a bespoke meta-tool for every trace. That would be the agentic version of hoarding cables in a drawer: technically reusable, mostly clutter.
The authors use a threshold-based extraction procedure. Candidate edges in the merged graph are considered only when they appear frequently enough. The algorithm greedily extends candidate chains when doing so reduces the weighted edge count of the graph. In plain English: a meta-tool is attractive when it replaces a repeated sequence that many executions pass through.
This is the right economic instinct. Meta-tools have a maintenance cost. They must be implemented, validated, described, exposed to the agent, monitored, and updated when APIs or interfaces change. A meta-tool used once is not optimization. It is craftwork disguised as infrastructure.
The paper’s examples make the point concrete.
In VisualWebArena, the implemented meta-tools include recurring search and posting/review flows across Reddit, Classifieds, and Shopping tasks. These collapse repeated UI actions such as typing into a search box and clicking the search button, or scrolling to a review area, setting a rating, typing a title, writing a review, and submitting it.
In AppWorld, the resulting meta-tools are even more revealing: they are mostly auto-login routines, such as spotify_auto_login(), file_system_auto_login(), venmo_auto_login(), phone_auto_login(), and simple_note_auto_login(). Each replaces a repeated sequence where the agent retrieves account credentials and logs into the relevant service.
That sounds almost embarrassingly mundane. Good. Mundane is where operational cost hides.
The value of AWO is not that it discovers some profound new reasoning strategy. It finds the boring hot paths that agents keep walking and turns them into paved roads.
The main evidence: fewer LLM calls, fewer tokens, sometimes higher success
The authors evaluate AWO on two benchmarks: VisualWebArena and AppWorld. VisualWebArena covers interactive web tasks across simulated sites such as Reddit, Classifieds, and Shopping. AppWorld covers API-driven tasks across simulated applications such as Gmail and Spotify.
The primary evidence is not a single heroic accuracy number. It is a pattern across cost, call count, workflow length, and success rate.
| Result area | What the paper reports | Likely purpose of the evidence | Business interpretation |
|---|---|---|---|
| LLM calls | Reduced by 5.6% to 10.2% on VisualWebArena and 7.2% to 11.9% on AppWorld, depending on model | Main efficiency evidence | Fewer paid reasoning turns and less latency exposure |
| Token usage | Reduced across benchmark/model settings, including 5.6% to 10.2% on VisualWebArena and 9.4% to 14.9% on AppWorld | Main cost evidence | Savings come from removing whole steps, not making each step cheaper |
| Total cost | Reduced by up to 10.2% on VisualWebArena and up to 15.0% on AppWorld | Main operational-cost evidence | Workflow compilation can affect cloud bills without changing the base model |
| Task success | Improves in several settings, including up to +4.2 percentage points | Quality check, not just efficiency evidence | Shorter trajectories can reduce opportunities for drift and compounding errors |
| Meta-tool usage | 98.2% in AppWorld for GPT 5.1 and Claude 4.5; 30.6% and 16.3% in VisualWebArena | Mechanism validation | Reuse depends heavily on task structure and tool environment |
The cost table is especially useful because it clarifies what AWO is actually saving. Meta-tools do not magically make each LLM call cheaper. They remove calls. That means they remove the output generation associated with those calls, and output tokens are often the costly part of the interaction.
This distinction matters for engineering decisions. If your agent spends most of its budget on long context retrieval, AWO may not be the first lever. If your agent spends heavily on repeated planning-and-tool-selection turns, AWO is directly relevant.
The success-rate results should be read carefully. The paper reports that AWO generally helps task success, with the largest gain being +4.2 percentage points. But this is not a universal law of agent systems. The authors also note a slight decrease for Claude 4.5 on AppWorld and attribute it to randomness after manually checking the meta-tools. That is plausible, but it should still be treated as a boundary: deterministic shortcuts can reduce error surfaces, but they do not guarantee higher task success in every model-environment combination.
The stronger claim is narrower and more useful: when meta-tools correctly replace routine sub-tasks, they can reduce the number of opportunities for the LLM to wander, forget, or choose the wrong next action. Less thinking can mean fewer chances to be creatively wrong. A small mercy.
The benchmark contrast explains where AWO works best
AppWorld and VisualWebArena behave differently, and that difference is more informative than the headline numbers.
In AppWorld, meta-tool utilization reaches 98.2% for both GPT 5.1 and Claude 4.5. This is because many tasks share a common early-stage routine: authentication and session initialization. The workflow has a hot prefix. Once the login step is compiled into a meta-tool, nearly every task can use it.
In VisualWebArena, utilization is lower: 30.6% for GPT 5.1 and 16.3% for Claude 4.5. That does not mean AWO fails there. It means web tasks branch earlier. Search, posting, review, and navigation routines are repeated, but they are less universally shared across tasks than login flows in AppWorld.
That contrast gives us a practical diagnostic.
| Workflow pattern | AWO fit | Reason |
|---|---|---|
| Repeated login, authentication, or session setup | Strong | Shared prefix across many tasks |
| Repeated search-and-open routines | Strong to moderate | Common pattern, but query and environment variation matter |
| Form submission with stable fields | Strong | Deterministic composite action is often safe |
| Multi-API workflows with independent read operations | Moderate | Requires correct commutativity rules |
| Highly bespoke reasoning tasks | Weak | Few recurring tool paths to compile |
| Workflows with sensitive side effects | Conditional | Safe only with strict equivalence and validation |
This is where the paper is useful for enterprise AI planning. It does not say “add meta-tools everywhere.” It suggests looking for structural repetition in production traces.
An internal HR agent may repeatedly look up employees, fetch documents, check permissions, and send files. A customer-support agent may repeatedly search account records, retrieve recent tickets, check refund eligibility, and draft updates. A finance operations agent may repeatedly reconcile invoice IDs, vendor records, payment status, and approval flows. In each case, the agent should not need to narrate its way through the same plumbing forever.
The right question is not “Can an LLM do this?” The right question is “How many times should we pay the LLM to decide this?”
The appendix tests robustness, not a second thesis
The appendices are not just extra furniture. They explain how much of AWO’s success depends on implementation choices.
The VisualWebArena appendix details the horizontal merging rules. These are mainly implementation details, but they support the mechanism by showing how UI-level noise is normalized. Search bars, comment fields, review boxes, rating stars, and submit buttons are mapped from page-specific element IDs into semantic actions.
The AppWorld appendix is more revealing because it shows cumulative graph compression. Starting from a disjoint graph with 2,428 nodes and 2,427 edges, the authors apply merging strategies. Adding read-like operations, regex normalization, and action-level merging reduces the graph to 249 nodes and 447 edges. This is not direct business value by itself, but it supports the claim that large parts of agent behavior can collapse once semantic equivalence is recognized.
The automated optimization loop in Appendix C is best read as an exploratory extension, not as the paper’s main evidence. The authors test whether an agent can discover merging rules by analyzing traces and applying primitives such as regex substitution, domain tagging, and semantic-type assignment. The loop reduces nodes substantially in one run, from 5,432 to 1,163 over 100 iterations on 168 AppWorld tasks. But the authors explicitly caution that evaluation is not yet mature, that the agent may generate thousands of rules, and that verifying rule quality can become harder than expert analysis.
That is an important warning. “Let an agent optimize the agent” sounds elegant until the optimizer quietly contaminates your equivalence rules. Then you have recursive automation with the governance properties of a haunted spreadsheet.
The additional GPT-OSS results are also important. The open-source model benefits in task completion but becomes less efficient on both benchmarks. The authors observe that it often forgets what it has done and repeatedly selects meta-tools because they appear useful. This is a useful negative result. Adding meta-tools does not automatically make every model behave efficiently. The model must use the new abstraction appropriately.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Main benchmark results | Main evidence | AWO can reduce calls, tokens, cost, and sometimes improve success | Universal gains across all agent systems |
| VisualWebArena merging rules | Implementation detail supporting mechanism | UI-specific noise can be normalized into semantic actions | Fully automatic meta-tool discovery |
| AppWorld graph compression table | Mechanism and sensitivity evidence | Merging rules can reveal strong hidden repetition | Every compressed path is deployable as a safe meta-tool |
| Automated optimization loop | Exploratory extension | Agents may assist rule discovery | Agent-discovered rules are production-safe |
| GPT-OSS results | Robustness and boundary test | Model behavior affects meta-tool value | Meta-tools always improve efficiency |
This reading matters because without it, the paper can be oversold. The result is not “agents can optimize themselves now.” The result is more precise: trace-driven workflow compilation can reduce repeated LLM decision-making when safe equivalence rules and useful hot paths exist.
That is less flashy. It is also much more deployable.
The business value is workflow engineering, not prompt polishing
The practical implication is a shift from prompt engineering to workflow engineering.
Prompt engineering asks the model to behave better. Workflow engineering asks whether the model should be involved in a step at all.
For enterprise agents, that distinction changes the optimization roadmap.
A basic agent deployment usually begins with tool descriptions, system prompts, retry logic, and maybe better retrieval. A more mature deployment adds monitoring, evaluation, role-specific toolsets, and guardrails. AWO suggests a further stage: use accumulated traces to identify repeated paths, validate equivalence, and promote stable routines into deterministic capabilities.
That creates a new kind of agent maturity curve:
| Stage | Optimization focus | Typical artifact | Failure mode |
|---|---|---|---|
| Prompt-level | Improve reasoning instructions | System prompts, examples, tool descriptions | The agent still reasons through routine work |
| Tool-level | Expose better actions | Cleaner APIs, narrower tools, validation wrappers | Toolset becomes fragmented or too fine-grained |
| Trace-level | Compile repeated behavior | Meta-tools from historical executions | Unsafe merges or stale routines |
| Governance-level | Manage meta-tool lifecycle | Review, monitoring, versioning, rollback | Fast deterministic mistakes |
The final row is not optional. Meta-tools need lifecycle management. APIs change. Website layouts shift. Business policies evolve. A login routine that is safe today may become insufficient tomorrow if multi-factor authentication, permission checks, or audit requirements change.
For Cognaptus-style automation work, the business pathway is clear but bounded:
- Instrument agent executions so tool-call traces are available.
- Identify high-frequency workflow prefixes and repeated sub-routines.
- Separate read-only, idempotent, and state-changing operations.
- Define equivalence rules with domain experts.
- Convert only high-confidence repeated sequences into meta-tools.
- Measure call count, latency, cost, task success, and error type before and after deployment.
- Monitor meta-tool usage so agents do not overuse shortcuts in the wrong context.
The ROI case is strongest when there is high repetition, high LLM-call cost, and stable tool semantics. It is weaker when tasks are highly heterogeneous, APIs are unstable, or correctness depends on subtle context that cannot be safely captured inside a deterministic composite tool.
The uncomfortable boundary: someone must know what “same state” means
AWO is powerful because it compresses redundancy. It is risky for the same reason.
The phrase “equivalent state” sounds technical, but in production it can become a policy decision. Are two customer records equivalent after different permission checks? Are two document retrieval paths equivalent if one uses cached data and the other queries the source system? Are two form submissions equivalent if one includes a compliance confirmation and one does not? Can read operations commute if one read changes what later tools return through logging, rate limits, or personalization?
These are not philosophical questions. They determine whether a meta-tool preserves meaning.
The paper is honest about this. Horizontal merging relies on domain knowledge, and the automated optimization loop remains preliminary. In AppWorld, many candidate meta-tools revealed by merging were not reasonably convertible. That distinction matters: graph compression can show theoretical redundancy, but operationalizing it as a safe tool is a stricter requirement.
The GPT-OSS result adds another boundary. Even a well-defined meta-tool can be used poorly by a model that over-selects it or forgets prior actions. Tool availability changes the agent’s action space. It does not guarantee rational use.
So the operational message is not “compile everything.” It is “compile the boring parts after proving they are boring.”
Agents should improvise only where improvisation pays
The best way to read AWO is not as a replacement for agentic reasoning. It is a budget discipline for agentic reasoning.
LLM agents are useful because they can adapt. They can interpret ambiguous requests, choose among tools, recover from failures, and assemble workflows that were not explicitly hard-coded. But flexibility is expensive. If an agent spends that flexibility on authentication, search-box clicks, repeated API setup, or standard form submission, the system is wasting its most costly component on clerical repetition.
AWO shows that some of this waste can be discovered from traces and removed. The method is mechanism-first: build state graphs, merge semantically equivalent states, extract frequent sub-paths, and expose them as deterministic meta-tools. The evidence suggests meaningful reductions in LLM calls, token usage, and cost, with success-rate improvements in several settings.
The business lesson is not that agents should stop thinking. It is that agent systems need a better division of labor.
Let the LLM handle ambiguity, judgment, and composition. Let meta-tools handle the repeated plumbing. Let domain experts decide which paths are truly equivalent. And let monitoring tell you when yesterday’s shortcut has become tomorrow’s incident report.
Agentic AI does not become more enterprise-ready by thinking louder. Sometimes it becomes ready by learning when to shut up and call the right tool.
Cognaptus: Automate the Present, Incubate the Future.
-
Sami Abuzakuk, Anne-Marie Kermarrec, Rishi Sharma, Rasmus Moorits Veski, and Martijn de Vos, “Optimizing Agentic Workflows using Meta-tools,” arXiv:2601.22037, 2026, https://arxiv.org/abs/2601.22037. ↩︎