A user starts by asking for Italian restaurants, answers a few clarification questions, then changes their mind and asks for Mexican instead. A human hears the reversal. A planner may hear: pizza, pasta, Italian, Mexican, recommendations, and perhaps a vague invitation to overachieve. Naturally, it may then produce a plan with the confidence of a consultant who attended only half the meeting.
That is the failure mode RECAP studies: not whether a large language model can “understand language” in the abstract, but whether an agentic workflow is given the right operational input before it starts decomposing tasks.1 The paper’s answer is simple and inconvenient: feeding raw chat history into a planner is often the wrong interface. The planner needs an explicit, current, compact statement of intent.
This is not ordinary summarisation. A summary is for a reader. An intent rewrite is for a planner. The difference matters.
The mechanism is a handoff, not a nicer transcript
RECAP assumes a fairly familiar agent architecture. A chat agent holds the conversation. A planner receives some representation of what the user wants and produces a directed acyclic graph, or DAG, of subtasks. Action agents could then execute those subtasks: search, draft, book, compare, generate, file, or otherwise do the work.
The paper inserts one component between conversation and planner:
Raw USER-AGENT dialogue
↓
Intent rewriter
↓
Explicit current intent
↓
Planner
↓
Task DAG
That middle layer is the point. The rewriter is not there to reduce token count as a housekeeping exercise, although it may do that. It is there to remove stale goals, preserve current constraints, separate multiple intents, and avoid letting the planner mistake the agent’s own clarification questions for the user’s real request. Small distinction. Large blast radius.
The authors compare three prompt-based rewrite styles:
| Rewriter | What it gives the planner | Operational risk |
|---|---|---|
| Dummy | The raw dialogue, unchanged | Planner must infer latest intent from all turns, including stale and distracting ones |
| Basic | A generic summary of the conversation | May compress away the very details that determine the plan |
| Advanced | A concise instruction-style statement of the user’s latest intended task | Better aligned with planning, but can still misread subtle “fake shifts” |
The Advanced rewriter is explicitly prompted to capture the final user goal, account for true intent shifts, preserve necessary specifications, filter noise, and avoid introducing unsupported information. That is already more than “summarise this chat.” It is closer to a contract: here is the task the planner should plan for, and here is what no longer matters.
RECAP tests whether the handoff changes the plan
The benchmark itself is built around the situations where agent planning usually becomes brittle. RECAP contains 810 validated USER-AGENT conversation instances, generated synthetically and then vetted, across five domains: cooking, programming, health, flights, and restaurants. It balances three conversation lengths and five intent categories: shifted intent, noisy input, underspecified intent, multi-intent, and perfect intent.
That design choice is important. A benchmark with only crisp single-turn tasks would mostly test whether the model can follow instructions. RECAP is trying to test the messier interface problem: what happens when the user’s desired task is distributed across a conversation, partially revised, and sometimes blurred by irrelevant turns.
The paper evaluates the output downstream, not by asking whether the rewrite sounds elegant. Each rewrite is fed into a static GPT-4o planner at temperature 0, which generates a DAG of subtasks. The resulting plans are compared using structural metrics, semantic distance, and preference judgments.
The preference rubric is practical. Evaluators judge whether a plan reflects the latest intent, avoids fabricated or redundant steps, has appropriate granularity, is complete, and follows a logical order. In other words, the evaluation asks whether the plan is useful, not whether the intermediate text has a charming personality. Blessedly.
The first result: planners are input-sensitive
The sensitivity experiment is the paper’s foundation. The authors compare planning from raw dialogue against planning from Advanced rewrites on two datasets: a 70-instance IN3 subset and a 70-instance RECAP-toy set. This is main evidence, not an ablation. Its purpose is to show whether the planner’s output changes meaningfully when the intent representation changes.
On RECAP-toy, human annotators prefer plans generated from Advanced rewrites over raw dialogue, and the advantage grows with conversation length:
| Conversation length | Dummy preferred | Tie | Advanced preferred |
|---|---|---|---|
| Short | 26.67% | 23.33% | 50.00% |
| Medium | 20.00% | 20.00% | 60.00% |
| Long | 16.67% | 20.00% | 63.33% |
That pattern is the business story in miniature. The longer the conversation, the more damage stale or ambiguous context can do. A planner that looks competent on short prompts may degrade when it is handed a transcript containing reversals, clarifications, side constraints, and conversational lint.
The same experiment on IN3 produces many more ties. The authors interpret this as evidence that IN3 is less sensitive to the rewrite problem because it does not surface the same range of evolving intent behaviours. In plain terms: a benchmark can be too tidy to reveal where production agents actually trip.
The robustness check then repeats the sensitivity analysis with LLaMA 3.3-70B and o3-mini planners. The paper reports similar sensitivity trends. That does not prove every planner in the wild behaves identically, but it weakens the lazy objection that the result is only a GPT-4o quirk.
Generic summaries are not enough
The most useful misconception to kill here is that “intent rewriting” means “summarise the conversation before planning.” RECAP’s prompt-based comparison says otherwise.
On the RECAP train split, Advanced rewrites achieve the highest overall win rate against the other prompt-based rewrite strategies:
| Rewriter | Win rate | Tie rate | Loss rate |
|---|---|---|---|
| Dummy | 16.67% | 56.19% | 27.14% |
| Basic | 13.81% | 59.52% | 26.67% |
| Advanced | 31.90% | 59.52% | 8.57% |
The details matter more than the aggregate. Advanced is strongest on shifted intent and multi-intent cases, where the planner needs to know what changed and which goals are still active. For shifted intent, Advanced reaches a 50.00% win rate with only a 4.76% loss rate. For multi-intent, it reaches 40.48% wins and 7.14% losses.
Basic summarisation performs poorly in shifted-intent cases. This is predictable: generic summaries tend to retain the conversational arc rather than the operational target. They may describe what happened, not what should now be done. A planner does not need a memoir. It needs a work order.
There is one useful counterexample. In underspecified-intent cases, Basic slightly beats Advanced on win rate, 20.00% versus 17.50%, although Advanced has a lower loss rate. The paper attributes this to “fake intent shifts”: cases where the user appears to start a new request but is actually refining the original one. If a rewriter is too eager to detect a shift, it can misclassify refinement as replacement.
That caveat is not a weakness of the paper; it is one of its best product lessons. Intent rewriting is not “always prefer the latest noun phrase.” It needs to distinguish reversal, refinement, elaboration, and distraction. Otherwise the cleaner interface becomes a very tidy way to be wrong.
The plans change structurally, not just cosmetically
The structural results explain why this is more than wording preference. Plans generated from different rewrite inputs diverge in node counts, edge counts, graph edit distance, and semantic distance. The largest graph edit distance appears between Basic and Advanced rewrites:
| Plan comparison | Δ nodes | Δ edges | Graph edit distance | Semantic distance |
|---|---|---|---|---|
| Dummy vs Basic | 1.68 | 2.18 | 4.99 | 0.10 |
| Dummy vs Advanced | 1.70 | 2.36 | 5.56 | 0.11 |
| Basic vs Advanced | 1.87 | 2.49 | 6.44 | 0.11 |
This is implementation evidence. It shows that the planner is not merely producing differently worded versions of the same plan. The rewrite changes the decomposition: which subtasks appear, how they depend on each other, and what the system may later execute.
For enterprise workflows, that is the uncomfortable part. A bad rewrite does not stay in the language layer. It turns into tasks, dependencies, API calls, file writes, recommendations, emails, bookings, or approvals. The mistake becomes operational.
Preference learning helps, but the supervision source matters
The paper then trains DPO-based rewriters. This is not a separate thesis; it is an extension of the same mechanism. If the system can observe which plans humans prefer, it can trace those plans back to the rewrites that produced them and fine-tune the rewriter accordingly.
There are two trained variants:
| Trained rewriter | Supervision source | Purpose |
|---|---|---|
| DPO:human | Human plan preferences | Test whether human-aligned downstream preference improves rewrites |
| DPO:LLM | LLM-judge plan preferences | Test whether cheaper synthetic preference labels can scale the same idea |
On the held-out RECAP test set, DPO:human outperforms the Advanced prompt rewriter in the direct human evaluation:
| Rewriter compared with Advanced | Win rate | Tie rate | Loss rate |
|---|---|---|---|
| DPO:human | 48.88% | 28.90% | 22.22% |
| DPO:LLM | 28.88% | 33.33% | 37.78% |
The result is strongest for DPO:human. It wins across nearly all challenge categories, including shifted intent, noisy input, multi-intent, underspecified intent, and perfect intent. DPO:LLM is more mixed. It is competitive in some categories but does not consistently beat Advanced.
That distinction should survive into any business interpretation. Preference learning looks promising, but the cheap label path is not automatically equivalent to human supervision. Shocking, yes: the shortcut is not free.
The evaluator is useful, not omniscient
To make preference evaluation scalable, the authors also train LLM-as-judge models to predict human plan preferences. This is an implementation detail with strategic implications: if every rewrite experiment requires human comparison, iteration becomes slow and expensive.
Zero-shot LLM evaluators perform weakly. Fine-tuning improves them substantially. The best model, fine-tuned GPT-4.1, reaches 65.01% test accuracy and 0.65 F1, compared with a zero-shot GPT-4.1 baseline at 45.00% accuracy and 0.46 F1.
That is useful, but not magical. A 65% preference predictor can support triage, regression testing, and experiment filtering. It should not become the Supreme Court of Plan Quality. The paper’s own appendix makes this clear: when the best fine-tuned evaluator is used to compare DPO:human and Advanced across the entire 810-instance RECAP dataset, preferences are largely neutral overall, with 26.80% wins for DPO:human, 44.53% ties, and 28.67% losses.
This is why the safe reading is nuanced:
| Result | What it supports | What it does not prove |
|---|---|---|
| Advanced beats raw dialogue on RECAP-toy, especially in longer conversations | Explicit intent formulation improves downstream planning under conversational complexity | Every agent workflow will see the same gain |
| Advanced beats Dummy and Basic on most RECAP train challenges | Task-aware rewriting is different from generic summarisation | Prompting alone solves intent understanding |
| DPO:human beats Advanced on held-out human evaluation | Human preference signals can improve rewrite utility | Human-trained rewriters always dominate at full scale |
| Fine-tuned LLM judges improve over zero-shot judges | Automated plan preference evaluation can help scale iteration | LLM judges can replace human evaluation without audit |
The paper is most valuable when read as an interface study, not a leaderboard announcement.
The business move is to make intent observable
For business systems, the main inference is straightforward: put an explicit intent layer between conversation and planning, then evaluate that layer through downstream plan quality.
That sounds simple until one asks what the layer should actually output. A production version should not merely emit a sentence. It should create a small, inspectable artefact with fields such as:
| Field | Why it matters |
|---|---|
| Current active intent(s) | Prevents stale requests from leaking into the plan |
| Superseded intent(s) | Makes reversals auditable |
| Hard constraints | Captures budget, time, permissions, jurisdiction, format, and other non-negotiables |
| Soft preferences | Allows ranking without over-constraining |
| Open questions | Prevents the planner from inventing missing facts |
| Multi-intent grouping | Separates parallel tracks from dependent subtasks |
| Evidence turns | Lets reviewers trace why the rewrite says what it says |
This is where the paper becomes commercially useful. Most teams trying to build agents focus on planner sophistication: better decomposition, better tool routing, better orchestration, better retries. All sensible. But RECAP suggests that some planning failures are upstream. The planner is not necessarily stupid; it may simply be receiving a bad contract.
The rewrite layer also improves observability. Raw chat logs are noisy. Full plan DAGs can be too operationally detailed. A compact intent artefact is easier to log, inspect, diff, version, and A/B test. It becomes the thing product managers, QA teams, compliance reviewers, and workflow owners can actually read without needing a ceremonial sacrifice to the token gods.
Where this applies first
The strongest near-term use cases are not high-stakes autonomous execution. They are workflows where user intent evolves across turns and the plan can be reviewed, constrained, or partially executed.
Customer support is an obvious candidate. A user starts with a refund question, adds delivery context, mentions a damaged product, then decides they want replacement rather than refund. The rewrite layer should prevent the planner from opening the wrong workflow.
Procurement and internal operations are another. A requester begins with “find vendors,” then adds budget, region, compliance constraints, and a preference for existing suppliers. A raw transcript can easily produce a bloated plan. An explicit intent can separate sourcing, compliance review, and approval routing.
Coding agents have the same issue. A developer asks for a bug fix, then narrows the scope, rejects a library, adds test requirements, and asks not to touch a module. The intent rewrite should make those constraints explicit before a planner generates file edits. Otherwise “agentic coding” becomes a very fast way to create tomorrow’s incident report.
Travel, HR, analytics, content workflows, and research automation all fit the same pattern. The user’s goal is not a static prompt. It is a negotiated object. Treating the final chat log as the goal is lazy architecture with better branding.
The boundary: this is plan quality, not live ROI
RECAP’s evidence is useful, but it has boundaries.
First, the dataset is synthetic, even though it is human-vetted. Synthetic conversations can capture controlled challenge types, but they may still miss the full weirdness of real enterprise users, who are creative in the same way unattended spreadsheets are creative.
Second, the experiments are text-only. Real business agents often have system state, permissions, documents, visual context, workflow history, user profiles, and external constraints. A production intent layer must bind to those signals, not just a transcript.
Third, the planner is static. It generates DAGs but does not execute them in an environment. Structural and preference metrics are valuable proxies, but they do not fully answer whether two different plans are functionally equivalent, executable, cheaper, safer, or more reliable in production.
Fourth, the main human-annotation experiments use sampled subsets because evaluation is expensive. The broader 810-instance appendix evaluation relies on the fine-tuned judge and softens the DPO:human advantage. That does not invalidate the human test result, but it does prevent a victory-lap interpretation.
The right conclusion is not “DPO rewriters will transform enterprise agents.” The right conclusion is more disciplined: explicit intent rewriting is a promising control point; plan-level evaluation is the right metric; human preference data helps; automated judges help with scale but need calibration; and live execution remains the next proof.
How to ship the idea without embarrassing yourself
A practical rollout would start with a prompt-based Advanced-style rewriter, not a fine-tuned model. The first target is not model glory. It is instrumentation.
A sensible implementation path:
- Add a rewrite step before planning.
- Store the rewrite alongside the raw conversation and generated plan.
- Evaluate plans by human preference on difficult cases: shifted intent, multi-intent, underspecified requests, and long conversations.
- Track execution outcomes separately: tool failures, reversions, clarification rates, cycle time, and human override frequency.
- Use LLM judges only after validating them against human labels in the target domain.
- Fine-tune only once enough preference pairs exist to justify the operational complexity.
The key design rule is that the rewrite must be judged by the plan it causes. A rewrite that sounds concise but causes a bad plan is not good. A rewrite that looks slightly awkward but preserves the true goal is doing its job. This may disappoint people who enjoy polished prose. They will recover.
The real lesson is interface discipline
RECAP’s contribution is not that agents need to “understand users better,” which is true in the same way that aircraft benefit from not colliding with mountains. Its contribution is more specific: agentic planning depends on the quality of the intent representation handed to the planner.
That representation should be explicit, current, constrained, and testable.
The business lesson is therefore not to buy a smarter planner every time an agent makes a bad plan. Sometimes the planner is reasoning from a contaminated input: old goals, clarification text, irrelevant details, and half-resolved ambiguity. Before upgrading the brain, check the briefing.
In agent workflows, the user’s intent is not the transcript. It is the cleaned, current, operational instruction extracted from the transcript. RECAP gives that boring interface a benchmark, an evaluation method, and some evidence that it matters.
Boring interfaces are underrated. They are also where many expensive systems quietly stop being ridiculous.
Cognaptus: Automate the Present, Incubate the Future.
-
Kushan Mitra, Dan Zhang, Hannah Kim, and Estevam Hruschka, “RECAP: REwriting Conversations for Intent Understanding in Agentic Planning,” arXiv:2509.04472, 2026. https://arxiv.org/pdf/2509.04472 ↩︎