TL;DR for operators

Pre-Act is a useful reminder that enterprise agents do not fail only because they choose the wrong tool. They fail because they lose the plot. A customer asks for help, the agent gathers one fact, calls one API, sees an unexpected result, and then behaves as if the workflow has reset. Charming, in the same way a lift that forgets floors is charming.

The paper introduces Pre-Act, a planning format for LLM agents that extends ReAct. Instead of producing a local “thought” for the immediate next tool call, the agent generates a multi-step execution plan, records what has already happened, proposes what should happen next, executes a tool or final response, observes the result, and then revises the plan.1 That is the mechanism. It is not merely “more chain-of-thought”, nor another decorative wrapper around tool calling.

The reported gains are substantial. Across five pretrained models, Pre-Act improves action recall over ReAct by an average of 70% on the Almita out-of-domain dataset and 102% on the authors’ proprietary dataset. After curriculum fine-tuning, Llama 3.1 70B with Pre-Act reaches 0.9238 action recall on Almita, compared with 0.5449 for GPT-4-turbo with Pre-Act. In end-to-end simulated conversations on five Almita use cases, the fine-tuned 70B model averages 0.82 goal completion, compared with 0.64 for GPT-4-turbo using Pre-Act and 0.32 for GPT-4-turbo using ReAct.

For business use, the article-level lesson is not “use Pre-Act and retire your workflow team”. Please. The practical lesson is sharper: if XAgent-style systems are expected to run real service or operations workflows, they need explicit planning state, workflow-aware evaluation, and fine-tuning data that teaches recovery paths. The technical method is interesting. The operating discipline around it is the actual moat.

The agent’s real enemy is not tool access. It is short-term memory with confidence

A customer-service agent can have every API in the world and still be useless.

Imagine a customer reports that a delivery never arrived. The agent needs to retrieve the order, check delivery status, verify address, decide whether to escalate, offer a remedy, and close the ticket only when the workflow conditions are satisfied. None of that is exotic. It is ordinary support work. Which is precisely why it exposes the weakness.

A ReAct-style agent reasons and acts in cycles: think about the immediate next move, call a tool, observe the result, then think again. That pattern works well when the task is a sequence of locally obvious actions. It is less reliable when the task has hidden dependencies: “do not close before confirmation”, “transfer to human if delivery was marked complete but customer disputes receipt”, “offer expedited shipping only under this condition”, “never invent missing account details”. The problem is not that the agent cannot call tools. The problem is that it may not keep a stable representation of the workflow while those tool calls unfold.

Pre-Act attacks that specific failure mode. It asks the agent to maintain a plan, not just a next action.

The distinction sounds small until one watches agents in workflow systems. A next-action agent asks, “What should I do now?” A planning agent asks, “What am I trying to complete, what have I already done, what remains, and how should the latest observation change the route?” That second question is more expensive cognitively, but it is closer to how business processes actually work.

This is why the paper deserves attention from teams building systems like XAgent. It is not an argument that every agent needs a grand symbolic planner. It is an argument that tool-using agents need a visible, updateable execution state. Without that, they become conversational slot machines: sometimes impressive, sometimes expensive, always a little too proud of themselves.

Pre-Act changes the loop, not just the wording

The paper’s central contribution is mechanical. ReAct produces a thought for the next action. Pre-Act produces a plan around the next action.

The authors describe Pre-Act as generating a sequence of steps:

$$ S = {s_1, s_2, \ldots, s_n, s_{fa}} $$

Each step specifies an intended action and reasoning, while the final step represents the final answer. When a step involves a tool call, the agent executes the action, receives an observation, and uses the accumulated context to refine the next step. In other words, the plan is not a one-shot itinerary. It is revised as the environment responds.

A simple contrast helps:

Agent pattern What it tracks Typical strength Typical weakness
ReAct Immediate thought, action, observation Good local tool-use loop Can reason one move at a time without preserving workflow structure
Pre-Act Previous steps, next steps, action, observation, revised plan Better coherence across multi-step execution Requires more structured prompting or fine-tuning data
Rule-based workflow Explicit states and transitions High control in known paths Slow to build, brittle outside designed flows

The paper’s Figure 1 is an implementation illustration rather than the main evidence. It compares a ReAct trace with a Pre-Act trace on a Glaive news-headline example. The example itself is simple: one tool call, then a final answer. Its purpose is not to prove superiority. It shows the format. ReAct says, roughly, “use the tool.” Pre-Act says, “here are previous steps, here are next steps, here is why this tool is needed, and after the observation here is the revised path to the answer.”

That difference matters more in harder workflows than in the example. For a single API call, planning can look like paperwork. For a branching support process, planning is the difference between an agent that follows the case and an agent that merely reacts to the latest line of chat.

The paper is also careful to position Pre-Act as applicable to conversational and non-conversational agents. That is important. The underlying issue is not chat. It is sequential decision-making under tool feedback. A procurement assistant, claims triage bot, internal IT resolver, or compliance intake agent has the same structural problem: actions are not independent; they are steps inside a process.

The training story is where Pre-Act becomes operational

Prompting a large model to use Pre-Act is one route. Fine-tuning smaller models to internalise the format is the more operationally interesting route.

The authors use a two-stage curriculum learning setup for Llama 3.1 8B and 70B. In the first stage, they fine-tune on Glaive Function-Calling v2 using a ReAct-style format with minimal reasoning. The reason is practical: Glaive is large, and adding full Pre-Act annotations to it would be expensive. This stage teaches basic agentic behaviour: when to call a tool, how to supply parameters, and when to produce a final response.

The second stage fine-tunes from that checkpoint on a smaller proprietary dataset adapted for Pre-Act. This dataset spans over 100 use cases across healthcare, manufacturing, telecommunications, banking, and finance. The authors say the multi-step plan sequence is derived from the dataset itself, while expert annotators supply the reasoning for each step. LoRA is used so that only a small fraction of model parameters are modified.

The pragmatic design is worth noticing. They do not attempt to annotate the world. They first teach broad tool-use mechanics on a large public dataset, then teach higher-quality planning behaviour on a smaller, workflow-rich dataset. That is probably closer to how enterprise agent training will work in practice: broad competence first, domain process discipline second.

There is also a small but important appendix result on catastrophic forgetting. Table 4 compares Glaive performance after stage one and stage two. The 8B model’s action recall drops from 0.9960 to 0.9881, and the 70B model’s action recall drops from 0.9965 to 0.9929. The authors report these as only 0.80% and 0.36% drops, respectively. This is not the main thesis of the paper. It is a robustness check for the curriculum: after learning Pre-Act on a different dataset, the model mostly preserves its earlier Glaive tool-use ability.

That matters because fine-tuning is often sold with a small-print nightmare: improved behaviour in one domain, quiet degradation elsewhere. Here, at least on Glaive, the degradation appears minimal. It does not prove there is no forgetting in production. It does show that the two-stage design did not obviously erase the basic function-calling competence the first stage was meant to teach.

The evidence separates local correctness from workflow completion

The paper proposes two levels of evaluation.

Level 1 is turn-level evaluation. Given a conversation turn, the model must choose either a final answer or a tool call. If the ground truth is a tool call, the evaluation measures tool F1 and full parameter match. If the ground truth is a final answer, it measures final-answer F1 and semantic similarity.

Level 2 is end-to-end evaluation. Here the agent runs full simulated conversations with tool access. The evaluation checks whether the agent completes workflow milestones in the right sequence. GPT-4 is used to generate milestone graphs, which are then human-verified and refined. GPT-4 is also used as the judge over simulated conversations, with instructions to avoid false positives, hallucinated milestone completion, and incorrect parameter validation.

That split is sensible because action accuracy and workflow success are related but not identical. An agent can choose several correct tools and still fail the case. It can also make partial progress without completing the final objective. For enterprise automation, this distinction is not academic. A support agent that gets 80% of the way through refund processing and then closes the ticket incorrectly has not delivered 80% of the business value. It has created a mess with a transcript.

The paper’s evaluation design can be read as follows:

Component Likely purpose What it supports What it does not prove
Table 1 dataset statistics Evaluation context Shows scale and domain differences across Glaive, proprietary data, and Almita Does not establish data quality or production representativeness
Table 2 turn-level metrics Main evidence Tests whether Pre-Act improves action and tool selection at individual turns Does not show full workflow completion
Table 3 end-to-end results Main evidence Tests whether agents complete milestone-based workflows in simulated conversations Depends on synthetic users, GPT-4 judging, and five selected use cases
Table 4 curriculum comparison Robustness check Suggests stage-two Pre-Act fine-tuning causes minimal forgetting on Glaive Does not rule out forgetting on other tasks or domains
Prompt-template appendices Implementation detail Makes the planning and evaluation machinery more inspectable Does not independently validate the method

This is stronger than the usual “our agent solved some demos” routine. It still has boundaries, but it asks the right question: did the agent complete the job, or merely perform plausible intermediate actions?

The headline gains are real, but the fine-tuning does much of the heavy lifting

The first major result is that Pre-Act improves pretrained models over ReAct.

Across five vanilla models, the paper reports that Pre-Act improves action recall by an average of 102% on the proprietary dataset and 70% on the Almita dataset. The models include Llama 3.1 8B, Llama 3.1 70B, Nvidia Nemotron 70B, DeepSeek-distil Llama 3.1 70B, and GPT-4-turbo. The comparison is not available on Glaive because Glaive lacks Pre-Act annotations.

The pattern is not uniform across every metric. The paper notes a minor drop in final-answer similarity for Llama 3.1 8B and 70B on the proprietary dataset. But the overall pattern is clear: when the task requires tool selection and action planning, making the model explicitly plan before acting improves its odds of choosing the correct action.

The second result is larger: fine-tuned Pre-Act models outperform the prompted baselines.

On Almita, GPT-4-turbo with Pre-Act reaches 0.5449 action recall. The fine-tuned Llama 3.1 70B with Pre-Act reaches 0.9238. That is the paper’s reported 69.5% improvement over GPT-4-turbo with Pre-Act. The fine-tuned 8B model also performs strongly, reaching 0.8706 action recall on Almita.

The end-to-end numbers are more business-readable:

Model and approach Average goal completion on five Almita use cases Interpretation
GPT-4-turbo + ReAct 0.32 The baseline often makes partial progress but fails many workflows
GPT-4-turbo + Pre-Act 0.64 Explicit planning substantially improves completion
Fine-tuned Llama 3.1 70B + Pre-Act 0.82 Training the planning behaviour beats prompting GPT-4-turbo in this setup

The fine-tuned 70B model does not dominate every single use case. On “Digital Download”, GPT-4-turbo with Pre-Act reports 0.66 goal completion, while the fine-tuned 70B reports 0.60. That detail is useful. It prevents the wrong reading: this is not magic dust sprinkled on Llama. The stronger claim is that, averaged across the five simulated workflows, the fine-tuned Pre-Act model performs better and more consistently.

The paper also reports progress rate, which is a softer but useful measure. If goal completion is binary at the workflow level, progress rate asks how far the agent advanced through the milestone graph. This matters for diagnosis. A low-completion, high-progress agent may be failing near the end. A low-progress agent may not understand the workflow at all. Those are different engineering problems.

The business value is not “smaller model beats GPT-4”. It is workflow control at lower marginal cost

The tempting headline is obvious: smaller fine-tuned model beats GPT-4. Fine. Put it on a slide, then delete the slide before anyone important sees it.

The more useful business interpretation is that Pre-Act shifts agent design from conversational improvisation toward workflow control. That is valuable in three ways.

First, it makes the agent’s decision process more inspectable. A plan with previous steps and next steps gives developers and evaluators a better handle on why the agent did something. This is not full interpretability. It is still generated text from a model. But it is operationally more legible than a bare tool call.

Second, it creates a training target. If a company has transcripts, workflows, tool definitions, and outcomes, it can potentially transform them into planning traces. That turns agent quality from “prompt wizardry performed by whoever is still awake” into a data engineering and annotation problem. Not cheap, but at least recognisable.

Third, it supports model substitution. If a smaller model can internalise the planning format, a business may reduce dependency on expensive proprietary models for high-volume workflows. The paper does not provide a full cost or latency benchmark, so the ROI case remains inferred rather than directly measured. Still, the mechanism is plausible: a fine-tuned smaller model that chooses tools correctly and completes workflows can be more attractive than a larger model that needs heavy prompting and still drifts.

For XAgent-style systems, the implication is straightforward. The architecture should not treat the LLM as a stateless reasoning oracle attached to tools. It should treat the LLM as a planner whose intermediate state is part of the product.

That affects product design:

Technical contribution Operational consequence ROI relevance
Multi-step execution plan Agent tracks the workflow, not only the next API call Fewer broken handoffs and incomplete cases
Observation-based plan revision Agent can adapt when tool outputs surprise it Better exception handling
Curriculum fine-tuning Planning behaviour can be taught to smaller models Potential cost and latency reduction
Milestone-based evaluation Teams can score end-to-end workflow progress Better QA than demo-based evaluation
Human-verified milestone graphs Evaluation becomes tied to actual process logic Useful for regulated or high-stakes workflows

The hidden cost is data preparation. Pre-Act is not free because the prompt has a nicer name. The authors relied on a proprietary dataset with expert reasoning annotations. For many companies, the bottleneck will not be choosing between ReAct and Pre-Act. It will be whether they can produce clean workflow definitions, tool schemas, exception paths, and high-quality annotated traces without turning the project into a consulting swamp.

Milestone graphs are the paper’s quiet business contribution

The paper’s evaluation machinery may be as important as the agent method.

For end-to-end evaluation, the authors create milestone dependency graphs from workflow and tool descriptions. These graphs include functional milestones, which correspond directly to tool calls, and non-functional milestones, which capture states or conditions in the workflow. For example, a workflow may require not only calling get_customer_details, but also reaching a state where the customer has confirmed satisfaction or where escalation is justified.

This matters because enterprise workflows are full of non-functional gates. “Customer agreed”, “identity verified”, “case eligible”, “handoff required”, “final confirmation received” — these are often not API calls, but they are operationally decisive. A tool-call-only benchmark may miss them. A milestone graph can represent them.

The evaluation still uses GPT-4 as part of the pipeline, so it is not deterministic in the way a traditional test suite is deterministic. The authors acknowledge future work should move toward more deterministic evaluation and reduce volatility from LLM-as-judge assessment. But the direction is right. Agents should be evaluated against the workflow they claim to automate.

For businesses, this suggests a useful QA hierarchy:

  1. Tool-call validity: Did the agent call the right function with valid parameters?
  2. Turn correctness: Was the next action correct for the immediate user input?
  3. Milestone progress: Did the agent move the case through the required workflow states?
  4. Goal completion: Was the business task actually completed?
  5. Recovery behaviour: Did the agent handle deviations, missing data, incorrect user input, or tool failures?

Most agent demos obsess over the first two. Customers care about the last three. Funny how that works.

Pre-Act corrects a common misconception about “reasoning agents”

A shallow reading of the paper would say: Pre-Act works because it gives the model more reasoning tokens.

That is incomplete.

The important move is not token volume. It is structure. Pre-Act forces the model to represent what has already happened and what should happen next. It ties reasoning to an execution plan, and it updates that plan after observing tool outputs. This is why the method is more relevant to workflow automation than generic chain-of-thought prompting.

More reasoning can even be counterproductive if it is unstructured. A model can produce a beautiful paragraph explaining the wrong action. It can rationalise missing information. It can hallucinate that a condition has been satisfied. The paper’s GPT-4 judge prompt explicitly warns against false positives, hallucinated data, incomplete parameter validation, and milestone claims unsupported by the transcript. That warning exists because agent reasoning is not self-validating. The model saying it completed a step is not the same as completing it. Shocking, yes.

The replacement belief should be this: agent reasoning becomes useful when it is constrained by workflow state, tool observations, and evaluable milestones. Pre-Act is one concrete attempt to impose that structure.

Where the result applies, and where it does not yet travel

The paper is strongest for task-oriented agents with tool calls and identifiable workflows. Customer service is the obvious fit, and the Almita use cases are in that family: order discrepancy, internet ping, gift card, digital download, and delivery. Similar logic applies to internal IT, claims handling, onboarding, procurement support, appointment scheduling, and banking service flows.

The result is less directly transferable to open-ended knowledge work where success is ambiguous, tools are optional, and milestones are hard to define. A research assistant writing a market memo may benefit from planning, but milestone graphs are harder to specify. A coding agent may need planning too, but its evaluation should include repository state, tests, build logs, and code review outcomes, not only conversation milestones.

Several boundaries matter.

First, the strongest fine-tuning results depend on proprietary data. The paper describes the dataset size and domain coverage, but external readers cannot fully inspect the annotation quality or distribution. That does not invalidate the result. It limits how confidently one can generalise it.

Second, the end-to-end evaluation uses five Almita use cases selected from eighteen after filtering out similar cases and those lacking tools or workflow information. Five workflows are enough to illustrate a serious evaluation method. They are not enough to claim universal production reliability.

Third, GPT-4 is used both as synthetic user and judge in parts of the evaluation pipeline. The authors mitigate this with milestone graphs and human verification, and the judge prompt is cautious. Still, LLM-as-judge remains a source of variance and bias. The paper itself points to more deterministic evaluation as future work.

Fourth, the paper does not provide a production cost-latency analysis. It argues that smaller fine-tuned models can reduce latency and cost, which is plausible and operationally important. But the reported experiments are primarily accuracy and completion evaluations, not a full deployment economics study.

Finally, Pre-Act may increase prompt verbosity when used purely as prompting. The operational win likely comes when the behaviour is internalised through fine-tuning or when the additional planning cost is offset by fewer failed workflows. That trade-off needs to be measured per deployment.

What XAgent should borrow from Pre-Act

The practical design lesson for XAgent is not to copy the prompt template and declare victory. Prompt templates are where good ideas go to become fragile rituals.

The better takeaway is architectural.

XAgent should maintain an explicit planning state that persists across tool calls. It should distinguish between previous actions, current observations, unresolved dependencies, and next required steps. That state should be visible enough for debugging and evaluation. It should also be revisable, because real workflows contain failed tools, ambiguous user answers, and exception branches.

XAgent should also separate tool correctness from workflow success. A tool call can be valid and still premature. A final answer can be fluent and still operationally wrong. Evaluation should ask whether the agent completed the process, not whether it looked competent at each turn.

Finally, XAgent should treat workflow data as training infrastructure. The Pre-Act paper’s strongest gains appear when the model is fine-tuned on planning-rich traces. That suggests a product roadmap: collect workflow transcripts, map them to tool calls and milestones, annotate reasoning only where it improves action selection, and test against end-to-end goal completion. The result is less glamorous than “autonomous AI employee”. It is also more likely to survive contact with users.

Plans before action is not bureaucracy. It is how agents stop wandering

Pre-Act is not the final answer to agent reliability. It is a disciplined move in the right direction.

It says that acting agents should not merely react. They should carry a plan, update it with observations, and be evaluated on whether they complete the workflow. The paper’s results support that claim at both turn level and end-to-end level, with especially strong gains after curriculum fine-tuning. Its limitations are real: proprietary annotations, synthetic evaluation, GPT-4 judging, and a small set of end-to-end workflows. But those limitations are also useful. They show what serious agent deployment will require.

The lazy version of enterprise AI says: connect an LLM to tools and wait for productivity. The Pre-Act version says: define the workflow, teach the plan, observe the execution, revise the route, and measure the outcome.

Less magical. More useful. A terrible trade for conference demos, perhaps. A better one for businesses that prefer completed work.

Cognaptus: Automate the Present, Incubate the Future.


  1. Mrinal Rawat, Ambuje Gupta, Rushil Goomer, Alessandro Di Bari, Neha Gupta, and Roberto Pieraccini, “Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents,” arXiv:2505.09970v2, 2025. ↩︎