Plan, Then Rewrite: Why Explicit Intent Wins in Agent Workflows

A user starts by asking for Italian restaurants, answers a few clarification questions, then changes their mind and asks for Mexican instead. A human hears the reversal. A planner may hear: pizza, pasta, Italian, Mexican, recommendations, and perhaps a vague invitation to overachieve. Naturally, it may then produce a plan with the confidence of a consultant who attended only half the meeting.

That is the failure mode RECAP studies: not whether a large language model can “understand language” in the abstract, but whether an agentic workflow is given the right operational input before it starts decomposing tasks.¹ The paper’s answer is simple and inconvenient: feeding raw chat history into a planner is often the wrong interface. The planner needs an explicit, current, compact statement of intent.

This is not ordinary summarisation. A summary is for a reader. An intent rewrite is for a planner. The difference matters.

The mechanism is a handoff, not a nicer transcript

RECAP assumes a fairly familiar agent architecture. A chat agent holds the conversation. A planner receives some representation of what the user wants and produces a directed acyclic graph, or DAG, of subtasks. Action agents could then execute those subtasks: search, draft, book, compare, generate, file, or otherwise do the work.

The paper inserts one component between conversation and planner:

Raw USER-AGENT dialogue
        ↓
Intent rewriter
        ↓
Explicit current intent
        ↓
Planner
        ↓
Task DAG

That middle layer is the point. The rewriter is not there to reduce token count as a housekeeping exercise, although it may do that. It is there to remove stale goals, preserve current constraints, separate multiple intents, and avoid letting the planner mistake the agent’s own clarification questions for the user’s real request. Small distinction. Large blast radius.

The authors compare three prompt-based rewrite styles:

Rewriter	What it gives the planner	Operational risk
Dummy	The raw dialogue, unchanged	Planner must infer latest intent from all turns, including stale and distracting ones
Basic	A generic summary of the conversation	May compress away the very details that determine the plan
Advanced	A concise instruction-style statement of the user’s latest intended task	Better aligned with planning, but can still misread subtle “fake shifts”

The Advanced rewriter is explicitly prompted to capture the final user goal, account for true intent shifts, preserve necessary specifications, filter noise, and avoid introducing unsupported information. That is already more than “summarise this chat.” It is closer to a contract: here is the task the planner should plan for, and here is what no longer matters.

RECAP tests whether the handoff changes the plan

The benchmark itself is built around the situations where agent planning usually becomes brittle. RECAP contains 810 validated USER-AGENT conversation instances, generated synthetically and then vetted, across five domains: cooking, programming, health, flights, and restaurants. It balances three conversation lengths and five intent categories: shifted intent, noisy input, underspecified intent, multi-intent, and perfect intent.

That design choice is important. A benchmark with only crisp single-turn tasks would mostly test whether the model can follow instructions. RECAP is trying to test the messier interface problem: what happens when the user’s desired task is distributed across a conversation, partially revised, and sometimes blurred by irrelevant turns.

The paper evaluates the output downstream, not by asking whether the rewrite sounds elegant. Each rewrite is fed into a static GPT-4o planner at temperature 0, which generates a DAG of subtasks. The resulting plans are compared using structural metrics, semantic distance, and preference judgments.

The preference rubric is practical. Evaluators judge whether a plan reflects the latest intent, avoids fabricated or redundant steps, has appropriate granularity, is complete, and follows a logical order. In other words, the evaluation asks whether the plan is useful, not whether the intermediate text has a charming personality. Blessedly.

The first result: planners are input-sensitive

The sensitivity experiment is the paper’s foundation. The authors compare planning from raw dialogue against planning from Advanced rewrites on two datasets: a 70-instance IN3 subset and a 70-instance RECAP-toy set. This is main evidence, not an ablation. Its purpose is to show whether the planner’s output changes meaningfully when the intent representation changes.

On RECAP-toy, human annotators prefer plans generated from Advanced rewrites over raw dialogue, and the advantage grows with conversation length:

Conversation length	Dummy preferred	Tie	Advanced preferred
Short	26.67%	23.33%	50.00%
Medium	20.00%	20.00%	60.00%
Long	16.67%	20.00%	63.33%

That pattern is the business story in miniature. The longer the conversation, the more damage stale or ambiguous context can do. A planner that looks competent on short prompts may degrade when it is handed a transcript containing reversals, clarifications, side constraints, and conversational lint.

The same experiment on IN3 produces many more ties. The authors interpret this as evidence that IN3 is less sensitive to the rewrite problem because it does not surface the same range of evolving intent behaviours. In plain terms: a benchmark can be too tidy to reveal where production agents actually trip.

The robustness check then repeats the sensitivity analysis with LLaMA 3.3-70B and o3-mini planners. The paper reports similar sensitivity trends. That does not prove every planner in the wild behaves identically, but it weakens the lazy objection that the result is only a GPT-4o quirk.

Generic summaries are not enough

The most useful misconception to kill here is that “intent rewriting” means “summarise the conversation before planning.” RECAP’s prompt-based comparison says otherwise.

On the RECAP train split, Advanced rewrites achieve the highest overall win rate against the other prompt-based rewrite strategies:

Rewriter	Win rate	Tie rate	Loss rate
Dummy	16.67%	56.19%	27.14%
Basic	13.81%	59.52%	26.67%
Advanced	31.90%	59.52%	8.57%

The details matter more than the aggregate. Advanced is strongest on shifted intent and multi-intent cases, where the planner needs to know what changed and which goals are still active. For shifted intent, Advanced reaches a 50.00% win rate with only a 4.76% loss rate. For multi-intent, it reaches 40.48% wins and 7.14% losses.

Basic summarisation performs poorly in shifted-intent cases. This is predictable: generic summaries tend to retain the conversational arc rather than the operational target. They may describe what happened, not what should now be done. A planner does not need a memoir. It needs a work order.

There is one useful counterexample. In underspecified-intent cases, Basic slightly beats Advanced on win rate, 20.00% versus 17.50%, although Advanced has a lower loss rate. The paper attributes this to “fake intent shifts”: cases where the user appears to start a new request but is actually refining the original one. If a rewriter is too eager to detect a shift, it can misclassify refinement as replacement.

That caveat is not a weakness of the paper; it is one of its best product lessons. Intent rewriting is not “always prefer the latest noun phrase.” It needs to distinguish reversal, refinement, elaboration, and distraction. Otherwise the cleaner interface becomes a very tidy way to be wrong.

The plans change structurally, not just cosmetically

The structural results explain why this is more than wording preference. Plans generated from different rewrite inputs diverge in node counts, edge counts, graph edit distance, and semantic distance. The largest graph edit distance appears between Basic and Advanced rewrites:

Plan comparison	Δ nodes	Δ edges	Graph edit distance	Semantic distance
Dummy vs Basic	1.68	2.18	4.99	0.10
Dummy vs Advanced	1.70	2.36	5.56	0.11
Basic vs Advanced	1.87	2.49	6.44	0.11

This is implementation evidence. It shows that the planner is not merely producing differently worded versions of the same plan. The rewrite changes the decomposition: which subtasks appear, how they depend on each other, and what the system may later execute.

For enterprise workflows, that is the uncomfortable part. A bad rewrite does not stay in the language layer. It turns into tasks, dependencies, API calls, file writes, recommendations, emails, bookings, or approvals. The mistake becomes operational.

Preference learning helps, but the supervision source matters

The paper then trains DPO-based rewriters. This is not a separate thesis; it is an extension of the same mechanism. If the system can observe which plans humans prefer, it can trace those plans back to the rewrites that produced them and fine-tune the rewriter accordingly.

There are two trained variants:

Trained rewriter	Supervision source	Purpose
DPO:human	Human plan preferences	Test whether human-aligned downstream preference improves rewrites
DPO:LLM	LLM-judge plan preferences	Test whether cheaper synthetic preference labels can scale the same idea

On the held-out RECAP test set, DPO:human outperforms the Advanced prompt rewriter in the direct human evaluation:

Rewriter compared with Advanced	Win rate	Tie rate	Loss rate
DPO:human	48.88%	28.90%	22.22%
DPO:LLM	28.88%	33.33%	37.78%

The result is strongest for DPO:human. It wins across nearly all challenge categories, including shifted intent, noisy input, multi-intent, underspecified intent, and perfect intent. DPO:LLM is more mixed. It is competitive in some categories but does not consistently beat Advanced.

That distinction should survive into any business interpretation. Preference learning looks promising, but the cheap label path is not automatically equivalent to human supervision. Shocking, yes: the shortcut is not free.

The evaluator is useful, not omniscient

To make preference evaluation scalable, the authors also train LLM-as-judge models to predict human plan preferences. This is an implementation detail with strategic implications: if every rewrite experiment requires human comparison, iteration becomes slow and expensive.

Zero-shot LLM evaluators perform weakly. Fine-tuning improves them substantially. The best model, fine-tuned GPT-4.1, reaches 65.01% test accuracy and 0.65 F1, compared with a zero-shot GPT-4.1 baseline at 45.00% accuracy and 0.46 F1.

That is useful, but not magical. A 65% preference predictor can support triage, regression testing, and experiment filtering. It should not become the Supreme Court of Plan Quality. The paper’s own appendix makes this clear: when the best fine-tuned evaluator is used to compare DPO:human and Advanced across the entire 810-instance RECAP dataset, preferences are largely neutral overall, with 26.80% wins for DPO:human, 44.53% ties, and 28.67% losses.

This is why the safe reading is nuanced:

Result	What it supports	What it does not prove
Advanced beats raw dialogue on RECAP-toy, especially in longer conversations	Explicit intent formulation improves downstream planning under conversational complexity	Every agent workflow will see the same gain
Advanced beats Dummy and Basic on most RECAP train challenges	Task-aware rewriting is different from generic summarisation	Prompting alone solves intent understanding
DPO:human beats Advanced on held-out human evaluation	Human preference signals can improve rewrite utility	Human-trained rewriters always dominate at full scale
Fine-tuned LLM judges improve over zero-shot judges	Automated plan preference evaluation can help scale iteration	LLM judges can replace human evaluation without audit

The paper is most valuable when read as an interface study, not a leaderboard announcement.

The business move is to make intent observable

For business systems, the main inference is straightforward: put an explicit intent layer between conversation and planning, then evaluate that layer through downstream plan quality.

That sounds simple until one asks what the layer should actually output. A production version should not merely emit a sentence. It should create a small, inspectable artefact with fields such as:

Field	Why it matters
Current active intent(s)	Prevents stale requests from leaking into the plan
Superseded intent(s)	Makes reversals auditable
Hard constraints	Captures budget, time, permissions, jurisdiction, format, and other non-negotiables
Soft preferences	Allows ranking without over-constraining
Open questions	Prevents the planner from inventing missing facts
Multi-intent grouping	Separates parallel tracks from dependent subtasks
Evidence turns	Lets reviewers trace why the rewrite says what it says

This is where the paper becomes commercially useful. Most teams trying to build agents focus on planner sophistication: better decomposition, better tool routing, better orchestration, better retries. All sensible. But RECAP suggests that some planning failures are upstream. The planner is not necessarily stupid; it may simply be receiving a bad contract.

The rewrite layer also improves observability. Raw chat logs are noisy. Full plan DAGs can be too operationally detailed. A compact intent artefact is easier to log, inspect, diff, version, and A/B test. It becomes the thing product managers, QA teams, compliance reviewers, and workflow owners can actually read without needing a ceremonial sacrifice to the token gods.

Where this applies first

The strongest near-term use cases are not high-stakes autonomous execution. They are workflows where user intent evolves across turns and the plan can be reviewed, constrained, or partially executed.

Customer support is an obvious candidate. A user starts with a refund question, adds delivery context, mentions a damaged product, then decides they want replacement rather than refund. The rewrite layer should prevent the planner from opening the wrong workflow.

Procurement and internal operations are another. A requester begins with “find vendors,” then adds budget, region, compliance constraints, and a preference for existing suppliers. A raw transcript can easily produce a bloated plan. An explicit intent can separate sourcing, compliance review, and approval routing.

Coding agents have the same issue. A developer asks for a bug fix, then narrows the scope, rejects a library, adds test requirements, and asks not to touch a module. The intent rewrite should make those constraints explicit before a planner generates file edits. Otherwise “agentic coding” becomes a very fast way to create tomorrow’s incident report.

Travel, HR, analytics, content workflows, and research automation all fit the same pattern. The user’s goal is not a static prompt. It is a negotiated object. Treating the final chat log as the goal is lazy architecture with better branding.

The boundary: this is plan quality, not live ROI

RECAP’s evidence is useful, but it has boundaries.

First, the dataset is synthetic, even though it is human-vetted. Synthetic conversations can capture controlled challenge types, but they may still miss the full weirdness of real enterprise users, who are creative in the same way unattended spreadsheets are creative.

Second, the experiments are text-only. Real business agents often have system state, permissions, documents, visual context, workflow history, user profiles, and external constraints. A production intent layer must bind to those signals, not just a transcript.

Third, the planner is static. It generates DAGs but does not execute them in an environment. Structural and preference metrics are valuable proxies, but they do not fully answer whether two different plans are functionally equivalent, executable, cheaper, safer, or more reliable in production.

Fourth, the main human-annotation experiments use sampled subsets because evaluation is expensive. The broader 810-instance appendix evaluation relies on the fine-tuned judge and softens the DPO:human advantage. That does not invalidate the human test result, but it does prevent a victory-lap interpretation.

The right conclusion is not “DPO rewriters will transform enterprise agents.” The right conclusion is more disciplined: explicit intent rewriting is a promising control point; plan-level evaluation is the right metric; human preference data helps; automated judges help with scale but need calibration; and live execution remains the next proof.

How to ship the idea without embarrassing yourself

A practical rollout would start with a prompt-based Advanced-style rewriter, not a fine-tuned model. The first target is not model glory. It is instrumentation.

A sensible implementation path:

Add a rewrite step before planning.
Store the rewrite alongside the raw conversation and generated plan.
Evaluate plans by human preference on difficult cases: shifted intent, multi-intent, underspecified requests, and long conversations.
Track execution outcomes separately: tool failures, reversions, clarification rates, cycle time, and human override frequency.
Use LLM judges only after validating them against human labels in the target domain.
Fine-tune only once enough preference pairs exist to justify the operational complexity.

The key design rule is that the rewrite must be judged by the plan it causes. A rewrite that sounds concise but causes a bad plan is not good. A rewrite that looks slightly awkward but preserves the true goal is doing its job. This may disappoint people who enjoy polished prose. They will recover.

The real lesson is interface discipline

RECAP’s contribution is not that agents need to “understand users better,” which is true in the same way that aircraft benefit from not colliding with mountains. Its contribution is more specific: agentic planning depends on the quality of the intent representation handed to the planner.

That representation should be explicit, current, constrained, and testable.

The business lesson is therefore not to buy a smarter planner every time an agent makes a bad plan. Sometimes the planner is reasoning from a contaminated input: old goals, clarification text, irrelevant details, and half-resolved ambiguity. Before upgrading the brain, check the briefing.

In agent workflows, the user’s intent is not the transcript. It is the cleaned, current, operational instruction extracted from the transcript. RECAP gives that boring interface a benchmark, an evaluation method, and some evidence that it matters.

Boring interfaces are underrated. They are also where many expensive systems quietly stop being ridiculous.

Cognaptus: Automate the Present, Incubate the Future.

Kushan Mitra, Dan Zhang, Hannah Kim, and Estevam Hruschka, “RECAP: REwriting Conversations for Intent Understanding in Agentic Planning,” arXiv:2509.04472, 2026. https://arxiv.org/pdf/2509.04472 ↩︎

The mechanism is a handoff, not a nicer transcript#

RECAP tests whether the handoff changes the plan#

The first result: planners are input-sensitive#

Generic summaries are not enough#

The plans change structurally, not just cosmetically#

Preference learning helps, but the supervision source matters#

The evaluator is useful, not omniscient#

The business move is to make intent observable#

Where this applies first#

The boundary: this is plan quality, not live ROI#

How to ship the idea without embarrassing yourself#

The real lesson is interface discipline#