When Language Models Ask for Help: The Curious Case of Uncertain AI

Escalation is the least glamorous part of automation. It is also where many systems either become useful or become expensive theatre.

In a normal business workflow, we understand escalation almost instinctively. A junior analyst handles routine invoices. An exception goes to a senior reviewer. A suspicious transaction goes to compliance. A warehouse robot follows a route until the floor plan stops behaving like yesterday’s floor plan. Nobody sensible asks the senior reviewer to approve every invoice. Nobody sensible lets the junior analyst improvise when the case is clearly outside their experience.

AI systems often forget this. Or, more accurately, their designers forget it for them.

The paper behind this article, “When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning,” proposes a useful correction.¹ It does not say that language models should replace reinforcement learning policies. It does not say that reinforcement learning is enough. It says something less dramatic and more operationally important: when a trained policy becomes uncertain, ask a language model for help — but only then.

That “only then” is the paper’s real contribution. The language model is not the hero. The routing rule is.

The useful comparison is not RL versus LMs, but four ways to fail

The easiest summary would be: the authors combine reinforcement learning with language models and get better out-of-distribution generalization. That summary is not wrong. It is just too blunt to be useful.

The paper is better read as a comparison among four systems:

System design	What it tries to do	What the paper shows	Business reading
PPO-only policy	Let a trained reinforcement learning agent act alone	Strong in familiar same-size maps; fails in downward transfer	Specialist systems are efficient when the operating context stays close to training
LM-only control	Let the language model directly choose actions	Near-zero test performance across the navigation tasks	General reasoning is not the same as sequential control
Poorly calibrated LM intervention	Ask the LM at the wrong times, or let unreliable advice overwrite the policy	Mid-size models can degrade performance despite occasional intervention	Escalation can create risk if the reviewer is not actually competent
ASK	Query the LM only when policy uncertainty crosses a threshold	No meaningful in-domain gain, but strong downward transfer with sufficiently capable models	The value is orchestration: route exceptions, do not replace the workflow

This comparison matters because the paper’s central misconception is highly plausible. Many readers will assume that attaching a language model to an RL agent should improve planning automatically. That is the current industry reflex: add a language model, sprinkle a prompt, call the result “agentic,” and wait for the demo video.

The paper is less forgiving. Language models alone fail. PPO alone fails under some distribution shifts. Hybridization helps only when the intervention is gated and the language model is capable enough. A clumsy hybrid is not a compromise between two strengths. It is two weaknesses sharing a desk.

What ASK actually does: the policy acts, uncertainty decides, the LM advises

ASK stands for Adaptive Safety through Knowledge. The architecture is intentionally external to the trained RL policy. That is important. The authors do not retrain the PPO agent. They do not fine-tune the language model. They add an inference-time mechanism around an already trained policy.

The loop is simple:

The PPO policy proposes an action.
Monte Carlo Dropout estimates the policy’s uncertainty.
If total uncertainty is below a fixed threshold $\tau$, the PPO action is executed.
If uncertainty exceeds $\tau$, the system prompts a language model for an alternative action.
If the LM output is a valid action, it can overwrite the PPO choice; otherwise, the system falls back to the policy.

The uncertainty estimate combines two signals. Epistemic uncertainty reflects lack of model knowledge — the policy is seeing something unlike what it learned. Aleatoric uncertainty reflects ambiguity in the situation itself. ASK uses the sum of both as the intervention trigger:

$$ U_{total}(o) = U_e(o) + U_a(o) $$

This is not uncertainty as decoration. It is uncertainty as a routing signal.

The implementation detail is worth spelling out because it prevents a common misunderstanding. The language model is not handed a vague instruction like “navigate safely.” The prompt is structured. It includes the agent’s position, the goal position, immediate neighboring tiles, two-step look-ahead tiles, valid action constraints, and the PPO policy’s own recommendation, framed as an “autopilot suggestion.” The model must return exactly one action: UP, DOWN, LEFT, or RIGHT.

That prompt design does two things. It gives the LM just enough symbolic state to reason locally, and it keeps the output compatible with the action space. In business terms, this is not a chatbot wandering into operations. It is a constrained reviewer inside a controlled decision protocol. Less romantic, more deployable.

The benchmark is small, but the comparison is clean

The experiments use deterministic FrozenLake, a grid-world navigation environment. The agent must move from a start tile to a goal tile while avoiding holes. Rewards are sparse: success gives a reward of 1; everything else gives 0. The paper varies map sizes from 4×4 to 8×8 and creates different hole layouts as different contexts.

This is not a warehouse, a trading desk, or a claims-processing department. It is a clean diagnostic environment. That is not a flaw by itself. The point is to isolate the interaction among three things: a trained RL policy, language-model guidance, and uncertainty-based intervention.

The authors train PPO policies with fixed hyperparameters, use Qwen2.5 language models from 0.5B to 72B parameters, estimate uncertainty with 100 MC Dropout forward passes at dropout rate 0.2, and select thresholds using Bayesian optimization with Optuna over validation contexts. Test contexts are held out.

Two additional metrics make the results readable:

Metric	Meaning	Why it matters
Intervention Rate (IR)	How often the LM is consulted	Measures computational overhead and how frequently the base policy is judged uncertain
Overwrite Rate (OR)	How often the LM changes the PPO action when consulted	Measures whether the LM is merely validating the policy or actively steering behavior

IR is the cost meter. OR is the behavioral meter. Reward alone would hide too much.

Same-size maps show restraint, not improvement

The in-domain experiment evaluates agents on maps of the same size as training, with different hole layouts. Here the story is deliberately unexciting.

PPO already performs well. On the 6×6 grid, PPO achieves a reward of 0.93 ± 0.26 with an average episode length of 9.49 ± 1.90 steps. On 7×7, PPO reaches 0.86 ± 0.35. On 8×8, it reaches 0.79 ± 0.41.

Adding ASK does not consistently improve these results. Most model sizes match the PPO baseline. Some mid-size configurations slightly hurt performance. On 8×8, the 3B model drops to 0.71 ± 0.46 and length increases to 18.14 ± 21.14 steps, compared with PPO’s 0.79 ± 0.41 and 12.49 ± 3.17 steps.

This is not a disappointing result. It is a useful negative result.

When the specialist policy already knows what it is doing, the language model has little to add. Successful configurations mostly defer. In same-size tests, models that preserve baseline performance tend to have near-zero overwrite rates. The 0.5B model, for example, is consulted quite often across same-size grids — roughly 50% to 59% of steps — but almost never overwrites the PPO action. It behaves like a quiet reviewer who keeps saying, “Yes, the autopilot is fine.” Occasionally annoying, but safe.

The mid-size models are more interesting. The paper observes a U-shaped pattern: very small and very large models can preserve performance, while some mid-size models create a failure mode with low intervention but relatively damaging overwrites. The problem is not just model size. It is the relationship between model capability, threshold selection, and overwrite behavior.

That distinction is commercially important. A low escalation rate does not prove safety. If the system escalates rarely but makes bad overrides when it does, the rare cases become exactly where losses concentrate. Enterprise risk managers, who enjoy this sort of thing far more than anyone should, will recognize the pattern.

LM-only control fails even in a toy lake

The paper also evaluates an LM-only baseline, where the language model directly controls the agent without PPO guidance. This is the comparison that should make “just prompt the agent” designs slightly embarrassed.

LM-only performance remains near zero on test splits across model sizes and grid dimensions. The authors note that even with explicit environment descriptions and reasoning prompts, the models fail to navigate effectively under distribution shifts.

This result is not saying that language models are useless. It is saying that language models are not automatically planners. They do not, by default, maintain reliable long-horizon state, estimate value, or perform the kind of consistent sequential control that RL policies are built to learn.

The useful role for the LM in ASK is narrower. It is not the driver. It is a semantic critic called in high-uncertainty states, with a structured view of the local situation and the policy’s proposed action.

That is a much more modest claim. Conveniently, it is also more believable.

Downward transfer is where the hybrid earns its keep

The more revealing experiment is downward generalization. Here the PPO policy is trained on 8×8 maps and evaluated on smaller environments: 4×4, 5×5, 6×6, and 7×7.

This is a distribution shift. The trained policy has not simply moved to a new hole layout within the same geometry; it must operate under a changed map size. PPO alone fails. It receives zero reward on 4×4, 6×6, and 7×7, and only about 0.01 on 5×5. LM-only control also fails.

The hybrid only becomes strong above a capability threshold.

Downward test environment	PPO-only reward	ASK with 1.5B	ASK with 32B	ASK with 72B
4×4	0.00 ± 0.00	0.37 ± 0.49	0.95 ± 0.22	0.95 ± 0.22
5×5	0.01 ± 0.10	0.23 ± 0.42	0.87 ± 0.34	0.86 ± 0.35
6×6	0.00 ± 0.00	0.11 ± 0.31	0.69 ± 0.46	0.75 ± 0.44
7×7	0.00 ± 0.00	0.10 ± 0.30	0.58 ± 0.50	0.68 ± 0.47

This is the paper’s strongest evidence. Neither component works alone. The uncertainty-gated combination works when the language model is large enough to provide useful spatial guidance.

The 32B and 72B models reach 0.95 reward on 4×4 and remain functional as the maps get larger. The 1.5B model shows modest success in the smallest maps but does not become a reliable collaborator. The 3B–14B range mostly fails in downward transfer, despite being larger than 1.5B. Scaling, as usual, refuses to behave like a neat spreadsheet column.

The overwrite rates explain part of the story. In downward transfer, the policy is highly uncertain, and the LM is consulted at every step in the reported test results. For 32B and 72B, overwrite rates sit in a moderate band: roughly 46% to 68%, depending on map size. That means the LM is not merely rubber-stamping PPO. It is actively changing decisions — and those changes help.

By contrast, the 0.5B model shows overwrite rates from about 61% to nearly 100%, paired with failure. That is not helpful autonomy. That is confident sabotage in a very small lake.

The lesson is not “use more LLM.” The lesson is “use the LM only when it is both needed and competent enough to change the decision.” The first condition is handled by uncertainty. The second is harder.

The threshold analysis is an ablation on restraint

The paper’s threshold analysis is easy to skim past. That would be a mistake. It is not a second thesis; it is a robustness and sensitivity check on the gating mechanism.

The authors select an uncertainty threshold $\tau$ for each model and environment through Bayesian optimization. Averaged across 6×6, 7×7, and 8×8 same-size environments, the best thresholds show a pattern:

Model	Average threshold $\tau$	Average reward
Qwen2.5-0.5B	0.971	0.83
Qwen2.5-1.5B	0.544	0.83
Qwen2.5-3B	1.063	0.80
Qwen2.5-7B	0.944	0.82
Qwen2.5-14B	1.079	0.82
Qwen2.5-32B	0.763	0.83
Qwen2.5-72B	0.483	0.84

Mid-size models require higher thresholds, meaning the gate must suppress their interventions more aggressively. The 72B model tolerates a much lower threshold. In one 7×7 case, the paper reports $\tau = 0.12$, so the LM is consulted nearly every step while still maintaining 0.89 reward.

This is not merely a tuning footnote. It tells us that gating and model capability are coupled. A threshold that works for one model size may not work for another. A procurement decision that swaps a model for a cheaper one without recalibrating the gate is not optimization. It is a small experiment with operational risk, hopefully not conducted on customers.

The business lesson is escalation architecture, not model worship

For Cognaptus readers, the business relevance is not that FrozenLake navigation maps directly onto enterprise automation. It does not. The relevance is architectural.

Many business workflows already have a version of PPO-only behavior: a trained classifier, rule engine, robotic process automation script, recommender, scheduler, or pricing model that performs well in routine conditions. The failure cases usually appear at the boundaries: unusual documents, changed vendor behavior, unexpected market regimes, ambiguous customer requests, partial data, policy exceptions.

The tempting response is to put a language model in front of everything. That is often wasteful. It can also be dangerous, because the language model may generate plausible but wrong decisions in cases where a specialist system already had the right answer.

ASK suggests a different pattern:

Design question	ASK-inspired answer	Practical business translation
Who handles routine cases?	The trained policy	Let specialized systems handle stable, high-volume decisions
What detects exception cases?	Uncertainty estimated from the policy	Build confidence, drift, anomaly, or disagreement signals into routing
Who reviews uncertain cases?	A structured LM prompt	Use LLMs as constrained reviewers, not free-form operators
When should the reviewer override?	Only when advice is valid and useful	Track override quality, not just escalation frequency
What must be recalibrated?	The threshold $\tau$	Re-tune gates when models, data, or workflows change

The ROI logic is also different from the usual “LLMs reduce labor cost” pitch. ASK is about cheaper diagnosis and safer delegation. It reduces unnecessary LM calls in familiar states, while preserving the option to use language-based reasoning when the specialist policy is uncertain. In production terms, this can mean lower inference cost, clearer audit trails, and fewer reckless overrides.

But the paper also warns against a lazy version of hybrid AI. It is not enough to add an LLM as a fallback. The fallback must be competent, constrained, and measured. Intervention rate alone is not enough. Overwrite quality matters. Threshold sensitivity matters. The prompt format matters. The base policy’s uncertainty estimate matters.

Architecture is where the magic goes to become boring. Good. Boring is often what survives deployment.

Where the result should not be over-read

The paper’s evidence is clean, but its scope is narrow.

First, FrozenLake is deterministic in this study. Real operational environments are often stochastic, partially observable, multi-agent, and full of missing data. The authors themselves point to more complex and partially observable environments as future work.

Second, the prompt is highly structured and task-specific. It includes coordinates, neighboring tiles, look-ahead tiles, and hard output constraints. That is a strength for the experiment, but it means the result is not evidence that generic prompting will work in open-ended control tasks.

Third, the strongest transfer result appears only for 32B and 72B Qwen models. The method is model-agnostic, but the demonstrated robust benefit is not model-size agnostic. Smaller models are cheaper, but cheaper advice is still expensive if it breaks the process.

Fourth, the reported downward-transfer setting is peculiar: training on 8×8 maps and testing on smaller maps. This is a useful distribution shift, but it is not the full space of OOD generalization. A larger map, stochastic transitions, hidden states, or adversarial layouts could change the conclusion.

Fifth, MC Dropout requires repeated policy forward passes. The authors use 100 forward passes for uncertainty estimation. That may be lightweight compared with full Bayesian ensembles, but it is not free. In business systems, the cost of uncertainty estimation, LM inference, latency, and monitoring must be counted together.

These boundaries do not weaken the paper’s main point. They locate it. ASK is not a universal agent recipe. It is a clear demonstration of a design principle: uncertainty should decide when general reasoning is invited into a specialized control loop.

The quiet lesson: ask for help, but audit the helper

The most useful idea in this paper is almost managerial.

Do not replace a specialist with a generalist. Do not ask the generalist to approve every move. Do not assume the generalist is helpful just because it is larger than yesterday’s model. Give the specialist a way to say, “I am outside my comfort zone.” Then give the generalist a constrained role, measure its overrides, and recalibrate the gate when conditions change.

That is not a grand theory of intelligence. It is workflow design.

The irony, naturally, is that this modest design lesson is more valuable than many grand theories. Most AI failures in organizations will not come from a lack of impressive models. They will come from poor routing: the wrong component making the wrong decision at the wrong time, with everyone later pretending the dashboard looked green.

ASK offers a better question. Not “Can the language model control the agent?” Not “Can the RL policy generalize forever?” But: when the policy is uncertain, is the language model competent enough to help, and do we know when to let it speak?

That is the kind of question production AI systems should learn to ask before they ask for budget.

Cognaptus: Automate the Present, Incubate the Future.

Juarez Monteiro, Nathan Gavenski, Gianlucca Zuin, and Adriano Veloso, “When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning,” arXiv:2604.02226v1, 2026, https://arxiv.org/abs/2604.02226. ↩︎

The useful comparison is not RL versus LMs, but four ways to fail#

What ASK actually does: the policy acts, uncertainty decides, the LM advises#

The benchmark is small, but the comparison is clean#

Same-size maps show restraint, not improvement#

LM-only control fails even in a toy lake#

Downward transfer is where the hybrid earns its keep#

The threshold analysis is an ablation on restraint#

The business lesson is escalation architecture, not model worship#

Where the result should not be over-read#

The quiet lesson: ask for help, but audit the helper#