Teach Me Once, Then Please Stop Calling the API
A familiar enterprise automation story starts with a competent but expensive expert in the loop.
At first, the expert is useful. They interpret messy instructions, break tasks into sensible stages, and recover when something goes wrong. Then the workflow scales. Suddenly the expert is being called for every transaction, every exception, every tiny decision that could probably have been handled by a trained local process. What began as intelligence becomes latency, cost, and operational dependency. Very elegant. Very billable. Not always very deployable.
The SCOPE paper asks a version of that problem for long-horizon text agents: can a large language model teach a smaller planner once, before training, and then disappear from runtime?1
The answer, in this paper, is cautiously yes. But the useful lesson is not “small models beat big models.” That would be the cheap headline, and cheap headlines are how dashboards become landfill. The more interesting mechanism is that SCOPE moves LLM intelligence from runtime reasoning into initialization. The LLM does not keep planning during execution. It produces subgoal-decomposition logic from demonstrations, the system uses that logic to pretrain lightweight manager and employee agents, and reinforcement learning then repairs the rough edges through interaction with a world model.
That shift matters because many business AI systems do not actually need a frontier model to deliberate forever. They need a good decomposition of work, a local policy that can execute repeated subtasks, and a feedback loop that prevents the first decomposition from becoming an expensive superstition.
The Problem Is Not Planning Once, But Planning Cheaply Again and Again
SCOPE works in TextCraft, a simplified text-only Minecraft-style environment. The agent receives a final crafting goal, a list of available crafting commands, and an inventory state. It must issue textual actions such as get base items and craft intermediate or final items. Reward arrives only when the final target item is produced.
That makes the environment deliberately simple and deliberately annoying. The task is not natural conversation. It is not web browsing. It is not robotics. It is a compositional planning problem where success depends on getting dependencies in the right order: gather base materials, craft intermediates, then craft the target item.
The difficulty is not that the agent lacks words. The difficulty is that the agent must connect words to a sequence of state changes. A final instruction like “craft lime stained glass pane” hides a dependency chain: collect or create ingredients, combine them in allowed recipes, and avoid actions that look plausible but do not advance the inventory toward the target. Anyone who has watched an agent confidently do the wrong small thing twelve times will appreciate the genre.
LLM-based planners are attractive here because they can propose high-level structure. Prior approaches such as ADaPT use an LLM as part of hierarchical planning during execution. That gives flexibility, but it also keeps the system dependent on repeated model calls. SCOPE takes the opposite bet: use the LLM once to extract subgoal logic from example trajectories, then train smaller agents to operate without ongoing LLM inference.
The distinction is important:
| Design choice | Runtime behavior | Operational consequence |
|---|---|---|
| LLM as repeated planner | The LLM is queried during task execution | More flexible, but higher latency and dependency cost |
| LLM as training-time teacher | The LLM generates subgoal logic once | Cheaper inference, but quality depends on the extracted decomposition |
| No hierarchy | The agent pursues the final goal directly | Simpler architecture, but weaker guidance for long-horizon tasks |
SCOPE is not trying to make the LLM smarter. It is trying to make the LLM’s contribution amortizable.
SCOPE Turns Demonstrations Into Subgoal Machinery
The mechanism starts with successful but suboptimal demonstration trajectories. The paper simulates 500,000 such rollouts, with 1,000 reserved for validation and 1,000 for testing. These trajectories are not treated as perfect plans. They include random actions at a 10% rate to mimic noisy human-like exploration.
From these demonstrations, the LLM is asked to generate a Python subgoal-decomposition function, called $f_{dc}$, and a subgoal-completion function, called $f_{sg}$. This is a neat move. Instead of asking the LLM to output bespoke subgoals for every future task, the authors ask it to produce a reusable procedure for decomposing trajectories into subgoals.
That procedure converts a full trajectory into shorter subtrajectories, each ending when a subgoal is achieved. The resulting training data then supports two policies:
- an employee agent, which learns to execute concrete actions to satisfy a given subgoal;
- a manager agent, which learns to propose the next subgoal toward the ultimate goal.
This is standard hierarchical planning logic, but with an LLM-generated bootstrap. The manager thinks in intermediate objectives. The employee acts in environment commands. The LLM teaches the initial decomposition and then leaves the building.
The mechanism can be summarized like this:
Demonstration trajectories
↓
One-time LLM-generated decomposition function
↓
Subgoal-labeled training data
↓
Employee pretraining: state + subgoal → action
Manager pretraining: state + final goal → subgoal
↓
World-model-based RL fine-tuning
↓
Hierarchical agent deployed without runtime LLM calls
The key phrase is without runtime LLM calls. SCOPE’s agent has only 11.04 million parameters. It uses a variational sequence-to-sequence architecture built around LSTMs, with text signals converted into a common format so both manager and employee can share the same general design. The appendix matters here because it shows that SCOPE is not secretly a giant language model wearing a small hat. It is a lightweight neural planner trained around text-encoded state and action representations.
The LLM’s Job Is Not to Be Perfect; It Is to Give the First Useful Cut
The subgoals generated by SCOPE’s one-time LLM process are not necessarily interpretable. The paper is unusually explicit about this. The LLM-derived subgoals may be suboptimal and less explainable than hand-engineered alternatives.
In one example, a demonstration for crafting a birch trapdoor is decomposed differently by the LLM and by a hand-engineered method. The hand-engineered version tracks inventory states after intermediate items are crafted. The LLM version can be looser: it may ask for “birch logs,” then “birch planks,” then “birch trapdoor.” It is not always the cleanest decomposition from a human recipe-graph perspective.
Yet the performance penalty is small in the main comparison. SCOPE with LLM-generated subgoals reaches a 0.56 success rate. The hand-engineered-subgoal variant reaches 0.58. That two-point gap is evidence with a very specific meaning: in this TextCraft setting, the LLM’s decomposition does not need to be perfectly interpretable to be operationally useful.
That is the first business-relevant idea in the paper. Many workflows do not require a beautiful ontology. They require intermediate states that are aligned enough to train a smaller execution system.
There is a trap here, though. “Aligned enough” does not mean “anything goes.” The paper’s later ablations show that vague but aligned subgoals can still help; misleading subgoals can destroy performance. This is the difference between a rough checklist and a wrong checklist. One is annoying. The other sends the intern to buy cement for a software deployment.
The Main Result Is Small, But the Latency Difference Is Not
The headline comparison is straightforward. On TextCraft, ADaPT with GPT-3.5 achieves a 0.52 success rate. SCOPE achieves 0.56 with 11.04 million parameters. It also completes the game in an average of 3.0 seconds on a single NVIDIA A10 GPU, while ADaPT with a GPT-3.5 backend requires 164.4 seconds under ideal network conditions.
Here is the paper’s core evidence in business-readable form:
| System or variant | Success rate | Parameters | Likely purpose of test | Interpretation |
|---|---|---|---|---|
| ADaPT with GPT-3.5 | 0.52 | 175B | Comparison with prior LLM-based hierarchical planner | Runtime LLM planning is strong but expensive |
| SCOPE | 0.56 | 11.04M | Main evidence | One-time LLM initialization plus RL fine-tuning can match or slightly exceed GPT-3.5 ADaPT on TextCraft |
| SCOPE with hand-engineered subgoals | 0.58 | 11.04M | Ablation on subgoal quality | Better, more interpretable subgoals help, but the LLM-generated version is close |
| SCOPE without manager RL fine-tuning | 0.24 | 11.04M | Ablation on manager adaptation | The manager’s RL correction is not decorative; it is central |
The absolute success-rate improvement over GPT-3.5 ADaPT is not huge: 0.56 versus 0.52. The operational difference is larger: 3.0 seconds versus 164.4 seconds. The paper’s practical force comes from the combination, not from either number alone.
A fair reading is this: SCOPE does not prove that small neural planners generally dominate LLM planners. It shows that in a structured text-planning environment, a one-time LLM teacher can seed a much smaller hierarchical agent that is competitive in success rate and dramatically cheaper at inference.
That is already enough to be interesting.
The Manager Is the Repair Layer, Not Just a Fancy Scheduler
The most important ablation is not the comparison to ADaPT. It is the “without manager RL fine-tuning” result.
Remove manager-level RL fine-tuning, and success falls to 0.24. That collapse tells us what SCOPE is really doing. The system is not merely executing a fixed recipe list extracted by the LLM. A fixed sequence of subgoals is brittle because the employee is imperfect. It may fail to complete a subgoal, enter an unfamiliar inventory state, collect irrelevant items, or attempt infeasible crafting commands. Once that happens, a static plan becomes a polite fiction.
The RL-finetuned manager learns to compensate. If one subgoal does not work because the employee cannot reliably execute it from the current state, the manager receives no positive feedback and can learn to propose alternatives. The paper describes this as the manager discovering easier, achievable subgoals that compensate for employee limitations.
This is the part managers should notice, for both human and machine reasons. Planning value is not only in decomposing a goal. It is in re-decomposing around actual execution capability.
In enterprise workflows, the same principle appears everywhere:
| Workflow analogy | Employee equivalent | Manager equivalent |
|---|---|---|
| Claims processing | Extract fields, classify documents, check rules | Choose next review step based on missing or inconsistent evidence |
| Procurement automation | Generate purchase order actions | Decide whether to request clarification, match vendor records, or escalate |
| Customer support routing | Draft response or retrieve account details | Select the next subtask when the first resolution path fails |
| Compliance review | Verify a document or transaction | Re-plan the checklist when evidence is incomplete or contradictory |
The lesson is not “use a manager agent because hierarchy sounds sophisticated.” The lesson is: if lower-level automation is imperfect, the high-level planner must learn around those imperfections. Otherwise the first failed subtask poisons the whole chain.
Subgoal Reliability Compounds Across the Plan
The paper’s Figure 7 examines the relationship between employee subgoal success and final goal success. The result is intuitive but worth stating: as subgoal success improves, ultimate goal success improves, and the gains accelerate when starting from a stronger employee.
The authors compare the effect to compounding probabilities. If a task requires several subgoals in sequence, each subgoal’s reliability is applied repeatedly. A small improvement in subgoal success can therefore produce a larger improvement in final completion.
A simplified version of the intuition is:
where $n$ is the number of required subgoal completions. This is not the paper’s formal model of the environment; it is a useful mental model for why small execution improvements matter more in long-horizon workflows than they appear to matter in isolated subtasks.
For business systems, this is one of the most practical points in the paper. A field extractor that improves from 90% to 94% may not look transformative when tested alone. But if a downstream workflow requires five reliable extraction, classification, matching, and validation steps, the final completion rate may move much more sharply. Long workflows punish small errors with compound interest. Sadly, unlike finance, the compounding usually works against you.
Vague Subgoals Are Bad; Misaligned Subgoals Are Worse
The paper’s best diagnostic section studies subgoal quality. It separates two issues that are often blurred in AI workflow design:
- vagueness or lack of specificity;
- misalignment with true environment outcomes.
To test vagueness, the authors remove item quantities from LLM-generated subgoals. A subgoal becomes satisfied once the listed item types appear at least once, regardless of quantity. Success drops to 0.30. A non-hierarchical baseline that pursues the final goal directly reaches 0.28. In other words, vague subgoals still preserve some structure, but much less than standard SCOPE.
To test misalignment, the authors randomly remap item names in the LLM-generated subgoals while leaving the real task objective unchanged. This breaks the relationship between the subgoal and what actually needs to happen in the environment. At a 25% remapping probability, subgoal success falls to 0.29 and ultimate success falls to 0.09. At full remapping, subgoal success reaches 0.05 and ultimate success 0.02.
That is a sharper result. Badly aligned subgoals are not merely less helpful than no subgoals. They can be worse than having no hierarchy at all. The non-hierarchical agent reaches 0.28 ultimate success, while the 25% remapped-subgoal variant falls to 0.09.
This distinction is extremely useful for operational AI:
| Subgoal problem | Paper evidence | Business interpretation |
|---|---|---|
| Less interpretable but aligned | SCOPE 0.56 vs hand-engineered 0.58 | Imperfect intermediate labels can still be valuable |
| Vague but partly aligned | No-quantity variant around 0.30 | Ambiguous milestones weaken guidance but may retain structure |
| Misaligned with outcomes | 25% remapping drops ultimate success to 0.09 | Wrong milestones actively mislead the system |
| No hierarchy | Non-hierarchical variant 0.28 | Sometimes no decomposition is safer than corrupted decomposition |
This is the paper’s most transferable lesson. In business automation, the dangerous failure mode is not always a missing process map. It is a confident process map with the wrong intermediate states.
The Experiment Is Preliminary, and That Word Does Work Here
The paper itself calls the empirical study preliminary. That is not academic modesty; it affects how the result should be used.
TextCraft is structured. Goals are explicit. Actions are textual commands. Completion can be checked from inventory states. The ultimate goal checker is simple: does the inventory contain the ace item? Demonstrations are simulated, not collected from messy human operations. The world model is trained from generated trajectories. These choices are reasonable for isolating the mechanism, but they also narrow the claim.
SCOPE directly shows that one-time LLM-guided subgoal decomposition can work in a text-based crafting environment with stable rules, observable states, and trainable completion checks. It does not show that the same architecture will automatically work in open customer conversations, ambiguous legal workflows, volatile supply chains, or physical robotics settings where state observation is partial and actions have irreversible consequences.
The dependency on completion functions is especially important. SCOPE uses $f_{sg}$ to decide whether a subgoal is achieved and $f_{ug}$ to decide whether the ultimate goal is achieved. In TextCraft, those checks are clean. In business workflows, they may require classifiers, validators, database reconciliation, or human review. A subgoal decomposition is only useful if the system can tell when the subgoal has actually been satisfied.
So the correct boundary is not “this only works in games.” That is too dismissive. The better boundary is: this approach is most promising when the workflow has repeatable structure, observable intermediate states, and enough demonstration data to train local policies after LLM-assisted decomposition.
That describes more business processes than one might expect. It also excludes more than a vendor deck would prefer. A rare moment of balance.
The Business Value Is Design-Time Intelligence, Not Model Shrinking for Its Own Sake
The obvious business reading is cost reduction: replace repeated LLM calls with a small local model. That is true, but incomplete. The deeper value is architectural.
SCOPE suggests a pattern:
- use an LLM to infer the structure of work from examples;
- convert that structure into subgoals and completion checks;
- train smaller specialized agents to execute and coordinate those subgoals;
- keep reinforcement or feedback loops to adapt planning around execution errors;
- reserve large-model calls for redesign, exception analysis, or periodic retraining, not every routine step.
This is not a universal recipe. It is a deployment pattern for repeated workflows where the cost of runtime deliberation is disproportionate to the novelty of each case.
For Cognaptus-style business automation, the inference is clear but bounded. The paper supports the idea that expensive LLM cognition can sometimes be shifted upstream into workflow design and training. It does not prove that every business process should be distilled into a small planner. The decision depends on volume, repeatability, state observability, compliance risk, and the cost of wrong intermediate goals.
A practical evaluation checklist would look like this:
| Question | Why it matters for SCOPE-like automation |
|---|---|
| Are tasks repeated with similar structure? | One-time decomposition only pays off if the pattern recurs |
| Are intermediate states observable? | Subgoal completion must be checkable |
| Are demonstrations available? | The LLM needs trajectories from which to infer decomposition logic |
| Can errors be simulated or safely explored? | RL-style refinement needs feedback without dangerous real-world cost |
| Are runtime LLM calls expensive or slow enough to matter? | If not, the architecture may be over-engineering |
| Are subgoals aligned with real outcomes? | Misaligned subgoals can underperform no hierarchy at all |
The paper’s main contribution is therefore not a new slogan about “small models.” It is a more disciplined view of where large models belong in an agent pipeline. Sometimes the LLM should not be the worker. Sometimes it should be the one-time methods consultant whose recommendations are later tested, corrected, and quietly embedded into operations.
Preferably without continuing to invoice by the token.
What to Take From SCOPE
SCOPE is strongest as a mechanism paper. It shows a path for moving LLM guidance from repeated runtime planning into one-time initialization, then using hierarchical reinforcement learning to adapt the resulting planner. Its TextCraft result is modest in success-rate margin but large in efficiency difference: 0.56 success with 11.04 million parameters and 3.0 seconds of inference, compared with GPT-3.5 ADaPT at 0.52 and 164.4 seconds.
The ablations are what make the paper useful. Hand-engineered subgoals improve performance only slightly over LLM-generated ones, suggesting that imperfect but aligned subgoals can be enough. Removing manager RL fine-tuning collapses performance, showing that adaptation around employee imperfections is central. Making subgoals vague hurts. Making them misaligned hurts much more.
For business readers, the message is not that LLM planners are obsolete. It is that runtime LLM dependence is not the only way to use language-model intelligence. In stable, repeated workflows, the better architecture may be to let the LLM teach once, let smaller systems execute often, and let feedback decide whether the taught decomposition actually survives contact with operations.
The old automation fantasy was that a model could reason its way through every task forever. SCOPE points to a quieter possibility: teach the system the shape of the work, train the parts that repeat, and call the expensive brain only when the shape changes.
That is less glamorous than an all-knowing agent. It is also more likely to fit on a budget.
Cognaptus: Automate the Present, Incubate the Future.
-
Haoye Lu, Pavan Seshadri, and Kaheer Suleman, “SCOPE: Language Models as One-Time Teacher for Hierarchical Planning in Text Environments,” arXiv:2512.09897, 2025, https://arxiv.org/pdf/2512.09897. ↩︎