Waiting is the least glamorous part of an AI agent.
A user asks for a report, a code fix, a dataset analysis, or a literature scan. The agent thinks, calls a tool, waits, reads the result, thinks again, calls another tool, waits again, and repeats this little ritual until the final answer appears. From the outside, this looks like “reasoning.” From the system side, much of it is simply queueing around tools.
That distinction matters. If an agent feels slow, the default instinct is to blame the model: inference is too expensive, context is too long, decoding is too slow, the GPUs are not sufficiently worshipped. Sometimes that is true. But the paper behind today’s article, Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution, points to a different bottleneck: in practical agent workloads, the model often sits inside a serial LLM-tool loop where tool execution absorbs a large part of end-to-end time.1
The authors report that tool execution accounts for about 60% of latency in coding tasks, 50% in deep-research tasks, and 36% in scientific tasks in their measured workloads. That is not a rounding error. It is the shape of the system.
The proposed system, PASTE, does not try to make the LLM itself dramatically faster. It asks a more operational question: when the next tool call is predictable, can the system start that work before the LLM explicitly asks for it?
The answer is yes, but only if the system can solve two problems that are easy to underestimate. First, it must predict not only the next tool type, but also the arguments that tool will need. Second, it must do so without letting a wrong guess mutate the environment, waste scarce resources, or slow down the real agent path. In other words: speculation is useful only when it is cheap, bounded, and boringly controlled. The exciting version would break things faster.
PASTE’s contribution is therefore best understood as a mechanism, not a slogan. It builds a pattern abstraction for predicting tool calls, then places those predictions behind an opportunistic scheduler that uses idle resources without competing with authoritative execution. The paper’s headline numbers are strong: up to 48.5% average task-completion latency reduction, up to 55.2% average tool-latency reduction, and a 67% reduction in tool-wait time. But the more useful lesson is not “agents can be faster.” The useful lesson is where the speed comes from.
The real bottleneck is the serialized LLM-tool loop
A conventional LLM request is already expensive, but it is at least conceptually simple: send input to the model, receive output. Agent execution is different. The model repeatedly alternates between reasoning and tool use. A tool may be a web search, page fetch, file edit, shell command, Python interpreter, package installation, vector database query, or scientific workflow step.
The loop looks like this:
LLM thinks → tool runs → LLM reads result → next tool runs → LLM reads result → ...
This looks naturally sequential because each step often depends on the previous one. A search result determines which URL to fetch. A grep result determines which file to open. A code edit determines which test command to run. A failed package import determines which dependency to install. The agent cannot always know the future with certainty.
But “not always” is doing a lot of work.
The paper’s central observation is that agent behavior is not random at the tool level. The language request may be open-ended, but many tool sequences follow recurring application-level patterns. In coding tasks, an edit is often followed by a validation command. A search through code is often followed by opening or editing a discovered file. In research tasks, a search is commonly followed by fetching one or more returned URLs. These are not deep philosophical revelations. They are workflow habits. And workflow habits are schedulable.
The paper distinguishes this from older runtime optimization work. Cold-start mitigation helps when the overhead is environment startup. General LLM serving optimizations help when the bottleneck is model inference. Static workflow scheduling helps when the task graph is known in advance. Agent systems have a nastier shape: the graph is generated online by the LLM, and tool arguments are context-dependent.
So PASTE targets the middle layer: not the model weights, not the business task itself, but the runtime boundary where tools are called.
That boundary is where many business deployments quietly lose time.
PASTE predicts structure first, then binds arguments later
The mechanism begins with the paper’s core abstraction: the Pattern Tuple.
A simplified version is:
This tuple matters because it separates two things that are often mixed together.
The first is control flow: what kind of tool call tends to follow what kind of prior event. The second is data flow: where the arguments for that future tool call will come from.
A naive system might try to predict a full future call directly: “next, fetch this exact URL.” That is brittle. URLs, file paths, search strings, and command arguments vary across tasks. Asking another LLM to hallucinate them in advance would be a fine way to produce confident garbage at industrial scale. PASTE instead treats the event structure and the argument derivation as separate objects.
Consider a deep research workflow:
Search succeeds → Web_fetch usually follows
The context is not the full text of the query. It is the event signature: a successful search. The prediction is the next tool type: Web_fetch. The value-mapping function says how to derive the argument: for example, use the URL field from the first item in the search result. The probability records how often this pattern held in validation traces.
The same idea applies in coding tasks:
Grep finds a file path → file editor or file reader often follows
File edit succeeds → test command often follows
Test fails → open relevant file or install dependency may follow
The key is late binding. PASTE does not need to know the concrete URL or file path before the upstream tool has produced it. It needs to know how that value is normally extracted once it exists.
This is why the mechanism is stronger than a simple next-tool predictor. A tool identity alone is insufficient. Knowing that the agent will probably call Web_fetch is operationally useless unless the system also knows what to fetch. PASTE’s value-mapping function turns a pattern into an executable prediction.
The Pattern Tuple is a small abstraction with a large operational consequence
The paper’s Pattern Tuple is useful because it is intentionally unromantic. It does not claim to understand the user’s full intention. It does not reconstruct the agent’s “plan” in natural language. It does not ask the LLM for a beautiful explanation of why the next step is inevitable.
It watches tool traces.
Pattern mining proceeds in two phases. First, PASTE strips away high-variance payloads and mines recurring event signatures. This identifies the “where”: after which tool-event context does another tool tend to appear? Second, it infers simple symbolic dependencies that explain the “how”: how can the predicted tool’s arguments be derived from earlier outputs?
The paper mentions transformations such as structured field lookup, index-based choice with fallback, and basic string formatting or normalization. These are not universal program synthesis. They are deliberately narrow rules for recurring agent data flows.
That limitation is also part of the strength. Business systems do not need the runtime to become a second agent with its own opinions. They need it to recognize boring patterns reliably:
| Pattern element | What PASTE tracks | Business interpretation |
|---|---|---|
| Context | Recent tool-event signatures, not full natural-language payloads | Reusable workflow shape across many user requests |
| Prediction | Likely next tool invocation type | A candidate operation to prepare early |
| Function | Symbolic rule for deriving arguments from prior outputs | A way to execute without asking the LLM again |
| Probability | Empirical confidence from historical traces | A scheduling signal, not a guarantee |
This is the right level of abstraction for infrastructure. It is not trying to replace agent reasoning. It is trying to notice when the agent’s next move is so conventional that waiting for the model to spell it out is unnecessary.
Quietly devastating, really. The agent may appear to be planning. Sometimes it is just following the same operational groove as yesterday.
Speculation is safe only when it is subordinate
Speculative execution has a bad reputation for good reasons. If the system guesses wrong, it can waste resources. If the operation has side effects, it can corrupt state. If speculation competes with real work, it can make latency worse. The cure becomes another bottleneck, which is a very systems-engineering way to embarrass oneself.
PASTE handles this with a scheduler that separates two categories of work:
- Authoritative invocations: tool calls actually issued by the agent. These are correctness-critical.
- Speculative invocations: tool calls predicted by PASTE. These are best-effort and may never be used.
The scheduler gives authoritative work strict priority. Speculative jobs run only on slack resources and within a speculation budget. If contention appears, speculative work is preempted. If an authoritative request arrives and matches an existing speculative job, the speculative job can be promoted: if already complete, its result is reused; if still running, it becomes the real job.
That promotion step is where speculation becomes useful. The system is not merely “doing extra work early.” It is creating a chance that the future authoritative call arrives to find its work already in progress or already completed.
The scheduler ranks speculative jobs by expected utility, combining the likelihood that the prediction will be consumed, the potential latency benefit, and the cost of executing it. The exact mathematical objective is less important for business readers than the operational rule: do not speculate because something is possible; speculate because the expected latency reduction justifies the resource and safety cost.
This is the paper’s most business-relevant engineering principle. In deployed agent systems, speculation should be treated as a budgeted optimization policy, not a personality trait.
Side effects are policy problems, not model vibes
A research agent fetching a webpage is relatively harmless. A coding agent installing packages, editing files, running shell commands, or touching external services is not. A wrong speculative command can change the environment before the real agent has decided to do so.
PASTE does not pretend that the runtime can magically infer safety. Instead, it uses a speculation eligibility policy. Operators specify which tools can be speculated, what level of speculation is allowed, and how duplicate predictions should be resolved.
The paper’s example policy allows full speculation for a web_search tool but restricts a pip_install operation to dry-run mode. That distinction is important. Some predicted calls can be executed end-to-end. Others should only trigger shallow preparation, such as environment warm-up, container initialization, dependency loading, or execution against staging resources.
This gives us a practical deployment framework:
| Tool type | Safe speculative action | Riskier action to avoid or transform |
|---|---|---|
| Web search | Run search early, cache result | None, unless rate limits or privacy apply |
| URL fetch | Fetch and cache page content | Fetch authenticated or sensitive URLs without policy |
| File read | Open likely files in sandbox or cache | Reading restricted paths without permission scope |
| Code test | Prepare runtime, maybe run in isolated workspace | Mutating shared state or external services |
| Package install | Dry run, dependency resolution, staging environment | Installing into the live execution environment |
| External API call | Usually no full speculation unless explicitly idempotent | Committing transactions, sending messages, writing records |
The important phrase is “unless explicitly.” PASTE’s paper states that the system does not infer side-effect freedom automatically. This is not a weakness; it is a necessary restraint. In business automation, policy should carry the responsibility for what may be pre-executed. The model should not be trusted to decide whether a CRM update, payment action, or compliance filing is “probably fine.”
Probably fine is not a control framework.
The evidence says PASTE wins by overlap, not by magic
The evaluation uses three agents and three benchmark families: VirtualLab for scientific workflows, Qwen Deep Research for open-ended research, and gemini-cli for coding and general-purpose tasks. The benchmarks include DeepResearchBench, SWE-bench, and ScholarQA. The baselines are ORION, a serverless DAG execution system, and SpecFaaS, a speculative serverless execution system.
The experimental setup is substantial: four nodes, each with 96 AMD EPYC vCPUs, 512 GB memory, and eight NVIDIA A100 80GB GPUs. The paper uses both proprietary LLM APIs and a locally hosted Qwen-DeepResearch-30B model, and mines patterns on historical tasks while evaluating on disjoint new tasks.
The strongest headline results are:
| Result area | Reported result | Likely purpose of test | What it supports | What it does not prove |
|---|---|---|---|---|
| End-to-end latency | Up to 48.5% average reduction; p95/p99 reductions up to 48.6%/61.9% | Main evidence | PASTE improves user-visible completion time in tested agent workloads | That every agent workload will see similar gains |
| Tool latency | Up to 55.2% average reduction; p95/p99 reductions up to 59.3%/60.6% | Main evidence | Speculative execution shortens effective tool critical path | That underlying tool execution itself became faster |
| Tool-wait time | 67% reduction | Mechanism evidence | Speedup comes from hiding stalls through overlap | That the agent reasons better or uses fewer tools |
| Scalability | At each concurrency, PASTE sustains at least 1.76×/2.05× higher speedup versus ORION/SpecFaaS | Robustness/sensitivity test | Opportunistic scheduling remains useful under concurrent sessions | That resource budgets can be ignored |
| Prediction quality | Up to 27.8% Top-1 accuracy, 43.9% Top-3 recall, 93.8% overall hit rate | Diagnostic / mechanism evidence | Useful speculation can work even when the top prediction is imperfect | That the predictor always knows the next action precisely |
| Side effects | 602 potentially side-effecting speculative actions detected among over 20,000 speculative actions; no divergent final result | Safety evaluation | Policy and sandboxing can contain unsafe guesses in the tested setup | That all side effects in arbitrary enterprise tools are solved |
| Resource overhead | Per second of latency reduction: 0.02 core-seconds CPU, 2.6 MB memory, 0.9 MB bandwidth; less than 100 ms scheduling/prediction overhead | Overhead / practicality test | Sidecar deployment can be lightweight under tested settings | That overhead is negligible under every tool mix |
The most interesting evidence is not the average latency number. Average latency numbers are good for headlines and bad for understanding. The mechanism evidence is more valuable: PASTE reduces waiting by overlapping speculative tool work with LLM generation. The agent is still in a serial loop logically, but the runtime starts to pipeline parts of that loop whenever patterns make the future sufficiently predictable.
That distinction prevents a common misreading. PASTE does not remove dependencies. It exploits cases where the dependency is already partially resolved. Once a search result exists, the next likely fetch target may be derivable. Once a file edit succeeds, the validation command is often predictable. Once a tool failure occurs, the fallback path may be patterned.
The system is not clairvoyant. It is opportunistic.
Low Top-1 accuracy is not fatal if the scheduler can afford candidates
One detail deserves special attention: the predictor’s Top-1 accuracy reaches only up to 27.8%, while the overall hit rate reaches up to 93.8%.
At first glance, that may look contradictory. It is not.
Top-1 accuracy asks whether the single most probable predicted tool is exactly the next tool. Overall hit rate asks whether any speculatively executed prediction matches the actual next tool. Since PASTE can execute multiple candidates when spare resources are available, it can still achieve useful overlap even when the top guess is wrong.
For business readers, this is a helpful mental model. Prediction quality is not a binary “right or wrong” property. Its value depends on the cost of being wrong and the budget for parallel guesses.
A cheap, safe, high-latency operation can be worth speculating even with moderate confidence. An expensive, state-changing, low-latency operation may not be worth speculating even with higher confidence. The scheduler’s job is to turn prediction confidence into operational decisions.
That is also why the paper’s explicit resource budget matters. Without it, the path from “we can predict some tools” to “let us run everything early” would be short, tempting, and professionally irresponsible.
What this means for companies building agent systems
The practical takeaway is not that every company should immediately implement PASTE as described. The paper is an academic system prototype, not a managed cloud product. The more useful takeaway is that agent latency should be measured at the tool-loop level.
For a company deploying research, coding, analytics, customer-support, or workflow agents, the implementation pathway looks like this:
- Instrument the agent runtime. Log tool calls, timestamps, arguments, outputs, status, and session boundaries. Without traces, there is no pattern mining. There is only narrative optimism, the cheapest benchmark.
- Separate model time from tool time. Measure LLM generation, active tool execution, tool stall, initialization, retries, and queueing. If tools are not a material share of latency, PASTE-like speculation is not the first investment.
- Mine recurring tool sequences. Look for edit-verify loops, search-fetch funnels, read-transform-write chains, retry patterns, and fallback paths.
- Identify argument derivation rules. The strongest candidates are values that are copied or lightly transformed from prior outputs: URLs, file paths, IDs, table names, error messages, dependency names, and search-result fields.
- Classify side effects by policy. Decide which tools are read-only, idempotent, dry-run capable, sandboxable, or never speculatable.
- Start with shallow speculation. Warm runtimes, prefetch read-only resources, resolve dependencies, cache search/fetch results, and stage likely file reads before moving toward full execution.
- Validate on held-out traces. Do not tune speculation on the same logs used to discover patterns. That is not optimization; it is nostalgia with statistics.
The ROI logic is straightforward but conditional. PASTE-like optimization matters most when agents are tool-heavy, tool calls are slow enough to dominate the critical path, workflows repeat across sessions, and unused compute or I/O capacity exists during model generation. It matters less when tasks are short, mostly model-bound, highly novel, or dominated by unsafe external actions.
A practical architecture for agent-runtime speculation
For a business implementation, the architecture does not need to begin as a full research-grade system. A staged version could look like this:
Agent runtime
↓
Tool proxy / middleware
↓
Event logger ──→ Pattern miner ──→ Pattern pool
↓ ↓
Policy engine ←── Prediction engine
↓
Speculative scheduler
↓
Tool backend + cache + sandbox
The tool proxy is the natural insertion point. It observes every tool request without forcing the agent designer to rewrite all task logic. The event logger builds the trace corpus. The pattern miner identifies recurring event signatures and argument mappings. The policy engine decides whether a predicted call may run fully, partially, in dry-run mode, in staging, or not at all. The scheduler uses spare resources and promotes successful speculation when the real agent call arrives.
This architecture also has a governance benefit. Speculation decisions become inspectable. Instead of letting agent prompts quietly evolve into operational behavior, the system records which patterns triggered which speculative actions under which policies. That audit trail will matter for enterprise deployments, especially when agents touch codebases, databases, customer records, or regulated workflows.
Where PASTE should not be overread
The paper is persuasive, but the boundary conditions are real.
First, PASTE depends on repetition. If an agent is deployed into a domain where tool sequences are highly idiosyncratic and arguments cannot be derived from prior outputs, speculation will have less to exploit. Pattern-aware systems need patterns. Annoying, but true.
Second, the safety story depends on explicit policies, sandboxing, and correct tool classification. The paper’s side-effect evaluation is encouraging: 602 potentially side-effecting speculative actions were detected among more than 20,000 speculative actions, and no final task result diverged from the baselines. Still, enterprise tools are messy. A “read” endpoint may trigger logging, billing, access updates, rate limits, notifications, or other external effects. Side-effect analysis is not always obvious from the API name.
Third, PASTE reduces effective latency by overlap. It does not make the underlying tool intrinsically faster. If tools are slow because external systems are unreliable, rate-limited, or frequently failing, speculation may hide some waiting but not fix service quality.
Fourth, the evaluation is done under controlled hardware, workloads, models, and baselines. The numbers are meaningful for the studied conditions, but they should be treated as performance evidence, not a universal constant. A company should expect to reproduce the measurement pattern, not copy the percentage.
Finally, there is a product-design question. Faster agents are better only when speed improves the user experience without making behavior harder to understand. For long-running research or coding tasks, reducing waiting can be valuable. For high-risk workflows, it may be better to expose intermediate confirmations rather than silently pre-execute more steps. The right design is not always maximum anticipation.
The business value is runtime discipline, not agent bravado
The paper’s title, Act While Thinking, sounds almost human. But the deeper lesson is not that AI agents are becoming wonderfully proactive. The deeper lesson is that agent systems are acquiring the same infrastructure needs as other serious distributed systems: tracing, scheduling, caching, sandboxing, admission control, and policy enforcement.
PASTE is interesting because it treats agent behavior as a runtime workload, not as a mystical reasoning stream. It notices that many tool calls are patterned. It formalizes those patterns without pretending to understand everything. It executes guesses only when policy and resources allow. And it measures whether the result is actually lower latency rather than nicer diagrams.
For Cognaptus readers, the implication is practical: when evaluating an AI agent platform, do not ask only which model it uses. Ask how it manages the LLM-tool loop. Ask whether it logs tool traces. Ask whether it can distinguish tool execution from tool stall. Ask whether safe prefetching, dry-run execution, and result caching are supported. Ask whether speculation is governed by policy or by vibes wearing a YAML jacket.
The next wave of agent performance will not come only from larger models. It will also come from systems that stop making the model wait for work the runtime could have started safely in the background.
That is the quiet engineering idea behind PASTE. The agent does not need to know the entire future. It only needs a runtime that recognizes the next boring step early enough to get it out of the way.
Cognaptus: Automate the Present, Incubate the Future.
-
Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, and Yuqing Yang, “Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution,” arXiv:2603.18897, 2026. https://arxiv.org/abs/2603.18897 ↩︎