A map query is easy: get me from A to B.

A service request is harder: leave after lunch, avoid tolls, find a charging station before the battery becomes theatrical, stop somewhere quiet for dinner, and make sure the restaurant is still open when we arrive.

Every additional clause turns a lookup into a sequence of commitments. Locations must be resolved. Routes must be calculated. Opening hours, traffic, weather, prices, and travel times must remain mutually consistent. An incorrect essay can still sound intelligent. An incorrect itinerary can leave someone beside a closed charging station.

Alibaba Amap’s STAgent is designed for precisely this less glamorous form of intelligence: reasoning that must remain attached to time, geography, and tool evidence.1 The model operates with ten tools covering maps, transport, weather, and information retrieval. It is trained on data distilled from approximately 30 million anonymized user queries and evaluated on both travel-specific and general benchmarks.

The interesting contribution, however, is not simply that Amap trained a travel agent.

It is the mechanism used to decide which interactions are worth teaching, which are worth practising through reinforcement learning, and which should be discarded before they consume more compute.

A common assumption is that a specialized agent improves mainly by receiving more domain logs or by starting from a larger model. STAgent offers a more operational answer: raw logs must first become a structured curriculum, and that curriculum must change according to what the current model can already do.

A log warehouse is not a lesson plan. It is merely a warehouse.

The complete pipeline begins before model training

STAgent is presented as an end-to-end pipeline rather than a single training technique:

30 million historical queries
Intent taxonomy and constraint annotation
Lexical, semantic, and geometric filtering
Approximately 200,000 candidate queries
Seed supervised fine-tuning
Capability probing on the current model
Higher-certainty trajectories → further SFT
Harder, learnable queries → reinforcement learning
STAgent

The sequence matters.

Training an agent before stabilizing its tools creates noisy trajectories. Curating prompts without understanding the tool environment produces attractive but unexecutable tasks. Selecting difficult examples without measuring the model’s present ability wastes supervision on problems that are either already solved or currently hopeless.

STAgent therefore treats environment construction, data selection, and post-training as connected parts of one system.

STAgent first makes the operational world replayable

The model interacts with ten specialized tools divided into four categories:

Category Example capabilities
Map and navigation Search places, calculate routes, find locations along a route, identify central meeting points
Travel Search flights and train schedules
Weather Retrieve current conditions and multi-day forecasts
Information retrieval Search for policies, events, and other open-domain information

These tools are wrapped through FastMCP to standardize invocation formats. Their outputs are converted into structured natural language, giving the model a more consistent interface for interpreting results. The sandbox also uses normalized parameters and an LRU cache to reduce repeated API calls during large-scale training.

This infrastructure is not decorative plumbing. Reinforcement learning requires the model to generate many trajectories, call tools repeatedly, receive observations, and obtain rewards. An unstable tool response may punish a correct action. A slow API may turn experimentation into an expensive queueing system. A changing schema may teach yesterday’s syntax very efficiently.

The paper reports that its asynchronous ROLL infrastructure improved training efficiency by nearly 80% relative to the comparison framework used by the team. It does not provide enough cost or systems detail to independently assess that figure, but the architectural point is sound: an agent’s training environment must be treated as part of the model.

For businesses, this reverses the usual order of enthusiasm. The first question is not, “Which model should we fine-tune?” It is, “Can the operational environment produce repeatable, auditable interactions at training scale?”

Without that, the model is learning from weather. Occasionally literal weather.

Thirty million queries are not thirty million useful lessons

Amap begins with anonymized queries collected over a three-month period. The raw volume is approximately 30 million. After curation, the candidate pool contains roughly 200,000 prompts—less than 1% of the original data.

That reduction is not an incidental cleaning step. It is one of the paper’s central design decisions.

Real user logs are dominated by common requests, repeated wording, ambiguous fragments, and tasks requiring little reasoning. Simply sampling from the observed distribution would teach the model popular behaviours repeatedly while leaving rare but operationally important cases underrepresented.

STAgent addresses this through a hierarchical intent taxonomy containing five primary categories, 16 second-level categories, and 30 fine-grained leaf nodes. The top-level categories include rules and policies, discovery, planning and decision, dynamic information, and application interaction.

The taxonomy is developed iteratively. Human-curated seed prompts are labelled, an LLM proposes categories, experts review ambiguity and missing coverage, and an “Other” category is used to identify emerging intents from the wider log distribution.

Each query is then annotated across multiple dimensions:

  • its primary intent;
  • supporting secondary intents;
  • explicit constraints such as location, time, vehicle type, or budget;
  • its likely execution difficulty.

This makes it possible to distinguish two requests that appear similar linguistically but demand very different agent behaviour. “Find a restaurant” is retrieval. “Find a quiet restaurant along my route that remains open after I finish a meeting” is constraint-aware planning with temporal verification.

Redundancy is removed three times

The paper applies a funnel with three forms of filtering.

First, locality-sensitive hashing removes lexical duplicates and near-duplicates. Second, embedding-based similarity search removes semantically repetitive examples within intent categories. Third, K-Center-Greedy selection preserves representative but distant points in embedding space, helping retain long-tail and corner-case queries.

Filtering stage What it removes Why it matters
Lexical filtering Literal and near-literal repetition Avoids paying to teach the same wording repeatedly
Semantic filtering Different phrasings of essentially identical tasks Increases behavioural diversity within each intent
Geometric selection Dense clusters of similar examples Preserves unusual and boundary cases

The business lesson is not that every company needs precisely these algorithms. It is that domain logs become strategically valuable only after the organization can describe what behaviours they cover, which constraints they contain, and what additional learning each example may provide.

Volume measures storage. Coverage measures potential.

Static difficulty is useful, but model-relative difficulty decides what to teach

STAgent initially assigns each query a difficulty score from -1 to 5. The score reflects three dimensions:

  1. the cognitive load involved in selecting the correct tools;
  2. the depth and dependency structure of the execution chain;
  3. the density of spatial, temporal, and preference constraints.

A direct weather lookup may receive a low score. A multi-city itinerary with budget, timing, and transport constraints receives a higher one. Queries that cannot be completed with available information or tools receive a score of -1, while requests requiring no domain tool receive 0.

The paper’s difficulty-distribution figure shows that simple attribute queries concentrate at lower difficulty levels, while long-trip planning contains more high-difficulty examples. This is primarily a dataset-characterization check: it demonstrates that the scoring system produces intuitively plausible distributions. It does not prove that the scoring method improves the final model.

More importantly, STAgent does not stop at this static definition of difficulty.

A problem may be difficult in the abstract but easy for the current model. Another may look simple to a powerful teacher model while remaining entirely outside the smaller policy model’s abilities. Treating both as equally useful training samples confuses task complexity with learnability.

STAgent instead defines learnability relative to the current policy.

After an initial supervised warm-up using a random 10% subset of the curated pool, the seed model attempts each candidate query eight times. A verifier scores the resulting trajectories. For query $i$, the system calculates the empirical reward mean $\hat{\mu}_i$ and reward variance $\hat{\sigma}_i^2$.

The paper defines a learnability-potential score:

$$ S_i = \hat{\sigma}_i^2 \cdot \hat{\mu}_i $$

The mean indicates whether the model can sometimes produce a useful answer. The variance indicates whether its performance remains unstable.

Together, these divide tasks into three operational regions:

Region Reward pattern Interpretation Training decision
Trivial High mean, low variance The model already solves the task consistently Reduce further training emphasis
Noise or out-of-reach Near-zero mean, low variance The model cannot currently find a productive trajectory, or the sample is defective Avoid forcing supervision
Learnable Non-zero mean, high variance The model sometimes succeeds but remains inconsistent Prioritize for additional training

This is the paper’s most transferable idea.

The best training example is not necessarily the hardest one. It is the example located near the model’s current capability boundary: difficult enough to produce an informative correction, but not so distant that the correction becomes imitation without understanding.

The process also allocates expensive teacher-model sampling according to learnability. Higher-value uncertain tasks may receive up to eight teacher trajectories, while easier tasks receive fewer calls.

For an organization using commercial foundation models as teachers, this creates a plausible cost-control mechanism. Teacher inference is concentrated where it has the highest expected training value rather than distributed uniformly across the dataset.

The paper does not publish a financial comparison against uniform sampling, so the return on investment remains an inference. Still, the operational logic is clearer than the familiar instruction to “generate more synthetic data,” which is often what people say shortly before discovering an impressive invoice.

Supervised learning establishes a workable route; RL practises near the edge

The training pipeline uses supervised fine-tuning and reinforcement learning for different purposes.

During supervised data construction, DeepSeek-R1 generates eight candidate tool-integrated reasoning trajectories for each selected query. Gemini-3-Pro-Preview evaluates them using the paper’s reward criteria, and only trajectories receiving perfect scores across all dimensions are retained.

The system also synthesizes rare long-tail queries by selecting unusual combinations of tools and asking a strong model to create tasks that require those combinations. These synthetic queries must then pass the same execution and verification process.

Supervised fine-tuning develops three basic capabilities:

  • decomposing user requests into executable plans;
  • calling the correct tools with valid parameters;
  • summarizing tool outputs without inventing unsupported details.

During training, tokens originating from tool observations are masked from the loss. The model learns to produce its reasoning, calls, and final responses without being trained to reproduce the environment’s returned text as though it generated those observations itself.

After supervised fine-tuning, harder learnable queries are used for reinforcement learning in the sandbox. STAgent uses Group Sequence Policy Optimization, a sequence-level variant of GRPO intended to stabilize updates for its mixture-of-experts base model.

The distinction between SFT and RL is practical.

SFT shows the agent valid routes through the task. RL lets it repeatedly explore uncertain routes, receive feedback, and improve its policy. Sending everything to RL would be expensive and unstable. Sending everything to SFT would leave the model imitating teacher trajectories without practising how to recover when tool feedback changes the plan.

The reward model treats hallucination as a service failure

STAgent scores trajectories across three dimensions:

  • reasoning and proactive planning;
  • information fidelity and integration;
  • presentation and closure of the service loop.

The evaluator dynamically changes the weights according to the request. A complex itinerary places more weight on reasoning. A direct information query emphasizes fidelity. A consultation request gives greater importance to presentation.

The notable feature is the hallucination veto. If the model invents a factual value such as a time, price, or distance that is not grounded in tool output, the complete trajectory receives a reward of zero.

Formally, the reward is a weighted score multiplied by an indicator that becomes zero when hallucination is detected:

$$ R = \mathbb{1}\ast{H=0} \sum_{k \in {\text{reasoning, information, presentation}}} w_k s_k $$

This prevents an eloquent, mostly correct answer from receiving a respectable score after quietly fabricating one operationally critical number.

The paper also includes negative and irrelevant queries, training the model to recognize when a requested action falls outside its available toolset or lacks required information. This matters because real agent reliability depends partly on knowing when not to proceed.

The desired behaviour is not maximum refusal. The evaluator penalizes agents that abandon a solvable task because of a minor error or ambiguous place name. STAgent is rewarded for correcting recoverable mistakes, retrieving missing information when possible, and refusing only when the task is genuinely unsupported.

That balance—proactive without becoming fictional—is considerably harder than adding a generic instruction to “avoid hallucinations.”

The results support the complete pipeline, not its individual components

The paper evaluates STAgent through a private online benchmark, the public TravelBench environment, and a collection of general-capability benchmarks.

Understanding the purpose of each test matters because they support different claims.

Evidence Likely purpose What it supports What it does not establish
Private online benchmark with 1,000 queries Main in-domain comparison under Amap-style tasks STAgent strongly improves over its base model and competes with some larger services Independent reproducibility or superiority over all larger models
TravelBench Main public in-domain evidence The complete STAgent pipeline improves travel-agent performance Which pipeline component caused the improvement
General benchmark suite Capability-retention and transfer check Specialization largely preserves general performance and improves function calling on BFCL Universal improvement or absence of any capability loss
Difficulty-distribution figure Dataset-characterization check Static scoring produces plausible task profiles That the curriculum itself improves performance
Appendix case studies Implementation illustration Shows how tools can be chained into useful responses Average reliability across real deployments
Training-reward curve Training diagnostic Rewards increased during RL training Causal evidence that the chosen curriculum is optimal

TravelBench shows a material gain over the initialization model

On TravelBench, STAgent scores 70.3, compared with 61.8 for Qwen3-30B-A3B-Thinking-2507, an improvement of 8.5 points.

It also narrowly exceeds Qwen3-235B-A22B-Instruct-2507, which scores 69.9, and DeepSeek-R1-0528, which scores 64.7.

The aggregate score deserves decomposition:

Model Multi-turn Single-turn Unsolved Overall
Qwen3-30B-A3B-Thinking-2507 59.6 69.4 56.3 61.8
Qwen3-235B-A22B-Instruct-2507 60.1 69.7 80.0 69.9
DeepSeek-R1-0528 34.3 76.1 83.7 64.7
STAgent 66.6 73.4 71.0 70.3

STAgent achieves the strongest multi-turn result, which is consistent with the pipeline’s focus on tool orchestration and intermediate verification. Its unsolved-task score improves substantially over its initialization model but remains below the strongest comparison models. The model has learned more about refusing or handling unsupported requests; it has not cornered the market on knowing its limits.

The private online comparison is more mixed than the headline score

Against its Qwen3-30B-A3B-Thinking base model, STAgent wins 77.5% of overall pairwise comparisons and performs particularly well in summarization and presentation.

It also wins 58.8% overall against Qwen-Plus-Latest. However, it loses most comparisons against Gemini-3-Pro, Kimi-K2, and MiniMax-M2. The authors attribute part of this gap to the model’s smaller scale.

This distinction matters. TravelBench shows that a specialized 30B model can outperform substantially larger models in a defined travel-agent environment. The private online evaluation shows that specialization does not automatically defeat stronger general systems across every quality dimension.

The narrower claim is also the more useful one: disciplined specialization can close or reverse a size disadvantage on tasks aligned with the training environment.

General capabilities are largely preserved, not universally improved

Compared with the corresponding Qwen3-30B-A3B base in the paper’s general evaluation table, STAgent improves BFCL V3 function-calling performance by 4.4 points. Its math, coding, MMLU-Pro, and C-Eval results remain close to the base model.

There are also declines. ArenaHard-v2.0 falls by 5.0 points, while IFEval declines by 1.8 points.

The appropriate interpretation is that domain specialization largely preserves broad capabilities while producing a meaningful in-domain gain. It does not establish that specialized training is free of trade-offs.

This general-benchmark suite functions as a retention check. It is valuable precisely because a travel agent that becomes better at planning while becoming dramatically worse at everything else would be a rather expensive form of tunnel vision.

For businesses, the data-selection loop may matter more than the final model

The paper directly demonstrates that the complete STAgent pipeline improves performance on travel-specific benchmarks while mostly maintaining general capability.

Several broader business implications are reasonable, but they remain interpretations rather than direct experimental findings.

Paper directly shows Cognaptus inference for business use Still uncertain
A stable ten-tool sandbox supports training and evaluation Firms should make operational APIs replayable before investing heavily in agent post-training Cost of building and maintaining the sandbox
Roughly 200,000 prompts are selected from about 30 million logs Existing interaction logs may contain enough variety to build a useful specialized curriculum How much historical volume is required in other domains
Learnability is estimated from the current model’s reward distribution Teacher-model and annotation spending can be targeted toward uncertain-but-solvable cases Actual savings relative to simpler sampling strategies
Negative samples teach tool boundaries Refusal and escalation behaviours should be trained as first-class operational tasks Effects on real-world error rates and user trust
STAgent reaches 70.3 on TravelBench with a 30B base A specialized smaller model may sometimes replace a more expensive general model for bounded workflows Inference cost, latency, maintenance cost, and production reliability

The strongest practical pathway is therefore not “fine-tune a model on company data.” It is more specific:

  1. stabilize the tools and create a replayable environment;
  2. organize user requests by intent, constraints, and executability;
  3. remove repetitive examples while preserving rare operational cases;
  4. train a seed policy and measure its behaviour across the candidate pool;
  5. spend teacher compute and RL effort near the model’s current capability boundary;
  6. evaluate both domain gains and capability losses.

This pathway is relevant beyond maps. Logistics, customer support, procurement, financial operations, and healthcare administration all involve heterogeneous tools, time-dependent information, invalid requests, and long-tail workflows.

Portability should not be assumed, however. Amap possesses unusually rich domain logs, mature mapping infrastructure, and a natural source of verifiable tool outputs. Organizations with sparse interactions, unstable systems, or poorly defined task completion may find that the agent is not the first problem requiring attention.

The main limitation is attribution, not benchmark quantity

STAgent is evaluated across several settings, but the paper does not provide component-level ablations.

There is no reported comparison showing:

  • the same model trained without hierarchical curation;
  • random sampling versus learnability-based selection;
  • supervised fine-tuning without the later RL stage;
  • rewards with and without the hallucination veto;
  • the effect of negative samples in isolation;
  • the incremental value of individual tools or data-filtering stages.

As a result, the evidence supports the complete pipeline, not a causal ranking of its components. The capability-aware curriculum is conceptually compelling, but the paper does not isolate how much of the TravelBench improvement comes from that curriculum rather than from the sandbox, teacher-generated trajectories, reward model, or reinforcement learning.

Other boundaries affect practical interpretation.

The online benchmark is private and evaluated using an LLM judge. TravelBench also relies on simulated users and model-based grading. The historical logs and operational tools are proprietary. The paper does not report production A/B tests, human satisfaction measures, end-to-end latency, inference costs, annotation costs, or maintenance requirements.

The appendix’s case studies demonstrate implementation patterns, while the rising RL reward curve shows that the training objective improved during optimization. Neither constitutes independent evidence of reliable deployment.

These limitations do not erase the result. They determine which claim the result can responsibly support.

STAgent provides strong evidence that an integrated environment-data-training pipeline can produce a capable specialized travel agent. It provides weaker evidence about which element delivers the largest return, how cheaply the approach can be reproduced, or how reliably it transfers beyond Amap.

Teaching agents where learning is still possible

The obvious story about STAgent is that Alibaba trained a map agent capable of planning routes, checking constraints, and coordinating travel tools.

The more useful story is about selection.

The team does not treat every historical query as valuable. It does not treat difficulty as a permanent label. It does not spend teacher compute evenly. It does not send already-solved tasks and currently impossible tasks through the same training process.

Instead, STAgent repeatedly asks a practical question:

Where is the current model uncertain, but still capable of succeeding?

That question turns training data from a static collection into an evolving curriculum. It also suggests a more disciplined approach to enterprise agents. The competitive asset may not be the largest log archive or the cleverest prompt. It may be the system that continually identifies the next behaviour worth teaching.

Maps start thinking only after someone decides which wrong turns are educational.

Cognaptus: Automate the Present, Incubate the Future.


  1. AMAP AI Agent LLM Team, “AMAP Agentic Planning Technical Report,” arXiv:2512.24957, version 2, January 2026. https://arxiv.org/abs/2512.24957 ↩︎