A support ticket arrives. Then a compliance question. Then a spreadsheet formula request. Then a genuinely nasty piece of mathematical reasoning wearing the innocent expression of a homework problem. In too many AI systems, all four get sent to the same expensive reasoning model, because the architecture has the subtlety of a hotel buffet: everything goes through the same line.

That is the waste this paper attacks. Optimizing Reasoning Efficiency through Prompt Difficulty Prediction asks a narrow but commercially serious question: can an AI system predict how hard a prompt is before spending full reasoning compute on it?1

The answer, in this paper, is yes — at least for math-heavy reasoning benchmarks and under a set of controlled assumptions. The important part is not merely that routing saves compute. We already knew that, in the same way we know bicycles are cheaper than helicopters. The useful contribution is the mechanism: use intermediate representations from a strong reasoning model to train lightweight predictors, then route each problem to the smallest model likely to solve it.

That shifts the efficiency story from “make one model think less” to “decide which model should think at all.” A modest distinction, except for the small matter of inference budgets.

The mechanism is prediction first, reasoning second

The paper’s routing system has four moving parts.

First, it extracts intermediate representations from s1.1-32B, a capable reasoning model. These are not final answers, chain-of-thought traces, or surface features such as prompt length. They are hidden-layer embeddings: internal signals produced while the model processes the problem.

Second, it trains lightweight multilayer perceptrons on those representations. The authors build two prediction routes:

Predictor type Training signal Routing decision
Difficulty predictor MATH dataset difficulty labels from 1 to 5 If predicted difficulty exceeds a threshold, send the problem to a larger model
Model-correctness predictor Whether each candidate model answered a MathCombined problem correctly Send the problem to the weakest model whose predicted correctness clears a threshold

Third, it evaluates the resulting router against static model use and random assignment. The key metric is not only accuracy. It is accuracy versus average inference time per question, which acts as the paper’s practical proxy for compute cost.

Fourth, it varies thresholds. This matters because routing is not a binary invention. It is an operating policy. A stricter threshold sends more work to expensive models and buys accuracy; a looser one accepts more cheap-model risk. The paper is therefore less about one magic router and more about an efficiency frontier.

That frontier is where the business value lives.

The middle layers know more than the answer layer

The first useful finding is almost architectural: the best prompt-difficulty signal does not come from the final layer.

The authors train a three-layer MLP on the MATH dataset, which contains 7,500 competition math problems labelled by difficulty from level 1 to level 5. They use 6,000 problems for training and 1,500 for validation. The MLP takes s1.1-32B layer outputs as 5,120-dimensional inputs and tries to predict the labelled difficulty.

The layer sweep is the paper’s first main diagnostic. Figure 2 is not just decoration; it is a sensitivity test over model depth. Accuracy improves from early layers into the middle of the model, peaks around the mid-to-late region, and then weakens again toward the end. For model-correctness prediction, test loss follows a similar pattern: later is not automatically better, and final representations are not the cleanest place to read task difficulty.

This is plausible. Early layers are still close to token and syntax processing. Final layers are closer to answer formation. Middle layers may preserve a more useful abstraction of problem structure before the model commits to a solution path. The paper uses layer 45 for the main routers, which is an implementation choice grounded in that layer sweep.

The business translation is simple but not trivial: useful routing signals may sit inside the model before the model answers. If you wait until the expensive model has fully reasoned, you have already paid the bill. Very elegant. Also very late.

Difficulty routing is the simpler instrument

The difficulty-based router is the easier mechanism to understand. Train a classifier to estimate problem difficulty. Choose a threshold. Above the threshold, use a larger model. Below it, use a smaller one.

This is tested on MathCombined, a 3,136-problem collection assembled from AIME24, AMC23, GSM8k, Minerva, OlympiadBench, and TheoremQA. The split is 1,882 training problems, 626 validation problems, and 628 router-evaluation problems. The dataset has ground-truth solutions but not difficulty labels, so the difficulty predictor trained on MATH is used as a transferable difficulty estimator.

Figure 3 is the main evidence for this route. The authors compare threshold-based routing between smaller models and s1.1-32B. Across tested thresholds, the blue router points sit above the random-assignment baseline. In plain language: the predictor is doing something more intelligent than flipping a weighted coin.

This does not mean the difficulty classifier has become a general theory of mathematical hardness. It means that, within the tested benchmark family, its internal representation is good enough to allocate work more efficiently than naive routing. That distinction matters. “Good enough to route” is much weaker than “understands difficulty,” but for production systems, weaker claims often pay invoices more reliably than philosophical ones.

Correctness routing is more operationally useful

The second router is more interesting for deployment because it predicts model-specific success. Instead of asking “is this problem hard?”, it asks “which model is likely to solve this problem?”

The authors train a four-layer MLP using s1.1-32B intermediate outputs to predict whether candidate models will answer correctly. The candidate set includes models such as OLMo-2-1124-7B-Instruct, Phi-4, Llama-3.3-70B-Instruct, Llama-3.3-Nemotron-Super-49B, and s1.1-32B, among others. Appendix B adds an important cleaning step: some models are removed from the router pool because their accuracy-time profiles are not useful for the routing setup, including non-reasoning models and other exceptions to the general accuracy-cost trend.

The resulting accuracy-based router assigns each problem to the weakest model whose predicted correctness exceeds a threshold. Figure 4 is the main evidence here. With suitable thresholds, the routed system reaches accuracy comparable to, and in some settings slightly above, always using s1.1-32B, while requiring only about two-thirds of the inference compute.

That is the headline, but the mechanism is the real story. The router is not merely replacing a big model with a small model. It is using a learned estimate of model adequacy as a gate. The system’s unit of optimization is no longer “one model per product”; it becomes “one model per problem.”

For enterprises, that is the difference between buying a single enormous machine and running a dispatch centre.

The appendix explains why routing can beat the strongest model

One easy misconception is that a router can at best approximate the strongest model. That would be true if model capability were perfectly nested: every problem solved by a weaker model would also be solved by every stronger model.

Appendix B shows that the world is less tidy, because naturally it is. The authors compare model pairs and count cases where model $i$ answers correctly while model $j$ does not. Every pair has non-zero disagreement. Even s1.1-32B, the strongest model in the paper’s main comparison, misses a few problems that the weakest model, Mixtral-8x7B-Instruct, answers correctly.

This heatmap is not the main performance result. Its likely purpose is diagnostic: it explains why a routed ensemble can sometimes edge past the strongest individual model. The mechanism is complementarity. Stronger average performance does not imply total dominance on every item.

That observation is especially relevant to business systems. Model portfolios are usually justified by cost tiers: cheap, medium, premium. This paper suggests a second reason: error diversity. A smaller model is not just a worse large model. Sometimes it is wrong differently. Occasionally, inconveniently, it is right differently too.

What each experiment is actually doing

The paper is compact, so the figures carry much of the argument. Their roles are not identical.

Test or figure Likely purpose What it supports What it does not prove
Figure 1 routing diagram Implementation detail The system predicts before selecting a model That the predictor is accurate
Figure 2 layer sweep on s1.1-32B Sensitivity test and representation evidence Middle layers carry useful difficulty and correctness signals That the same layer works for all domains
Figure 3 difficulty-based routing Main routing evidence Threshold routing beats random assignment between model pairs That difficulty labels transfer universally
Figure 4 accuracy-based routing Main routing evidence Correctness prediction can match or slightly exceed s1.1-32B with lower average inference time That production cost savings equal the measured time ratio
Figure 5 MATH difficulty distribution Dataset detail The difficulty predictor has labelled training data across levels That MATH labels capture enterprise difficulty
Figure 6 model accuracy versus time Model-pool comparison Higher accuracy generally costs more time, with notable exceptions That all models form a clean capability ladder
Figure 7 pairwise model heatmap Diagnostic evidence Different models solve different subsets of problems That ensemble routing will always outperform the top model
Figures 8–10 using Llama Nemotron 8B embeddings Robustness/sensitivity extension Smaller-model embeddings can still support decent prediction and routing That any small model can replace s1.1-32B as the representative embedder

The final appendix result deserves emphasis. The main experiments use s1.1-32B embeddings to train predictors. Appendix C repeats the idea with Llama-3.1-Nemotron-Nano-8B-v1 embeddings and again finds useful middle-layer signals, with routers still outperforming random assignment. This is a robustness extension, not a second thesis. It weakens the fear that the method only works because the embedder is already a large strong model. It does not eliminate the need to validate the embedder in each deployment environment.

The business value is cost elasticity, not cheaper intelligence in general

Cognaptus’ practical inference is that predictive routing should be treated as an AI infrastructure pattern.

A production version would look something like this:

  1. Collect representative prompts from the target workflow.
  2. Run them through candidate models and record correctness, quality scores, latency, and cost.
  3. Train a lightweight predictor using available prompt representations.
  4. Calibrate thresholds against business tolerance for error, delay, and spend.
  5. Route easy or predictable cases to cheaper models.
  6. Escalate hard, risky, or low-confidence cases to stronger models.
  7. Monitor drift, because user prompts have the unfortunate habit of changing after launch.

This pattern is relevant anywhere reasoning cost is unevenly distributed: customer-service triage, compliance checks, internal analytics, coding assistants, structured document review, tutoring systems, and decision-support tools. In each case, the point is not to “use small models” as a moral philosophy. The point is to avoid paying premium reasoning rates for tasks that do not require premium reasoning.

The paper also hints at secondary uses. Difficulty predictors could label datasets automatically, support curriculum learning, or trigger abstention when a problem is predicted to be too hard. Those are credible extensions, but they are not the central evidence. The central evidence is routing.

The uncomfortable deployment details

There are several boundaries that matter before anyone starts stapling this idea onto production systems.

First, the evidence is math-heavy. MATH, GSM8k, AIME, AMC, Minerva, OlympiadBench, and TheoremQA are useful reasoning tests, but they are not vendor onboarding emails, insurance claims, audit memos, or messy Excel requests written by someone at 11:47 p.m. The method may transfer, but the paper does not prove broad enterprise generalisation.

Second, the approach assumes access to intermediate representations. That is easy in open-weight or internally hosted models. It is less straightforward when using closed commercial APIs that expose only text outputs and usage metadata. A practical enterprise version may need surrogate features, local embedding models, logit access, confidence estimates, or a separately trained router model.

Third, threshold tuning is not a footnote. A threshold is a policy decision disguised as a number. Lower thresholds reduce cost but increase the risk of under-routing. Higher thresholds protect accuracy but send more work to expensive models. In regulated or customer-facing systems, this calibration needs governance, not vibes and a dashboard.

Fourth, the cost metric is average inference time per question. That is useful, but production cost also depends on token pricing, hardware utilisation, batching, memory pressure, vendor contracts, retry rates, and the overhead of running the router itself. Two-thirds inference compute in a benchmark is not automatically two-thirds invoice reduction in a real deployment. Finance departments tend to notice such things.

Finally, the paper’s representative-model assumption is practical but fragile. It uses embeddings from one sufficiently capable model to predict difficulty patterns across a model pool. That may work when the candidate models share similar task structure and benchmark distribution. It may fail when models differ sharply in training data, tool use, language coverage, domain specialisation, or refusal behaviour.

The real lesson: reasoning should be allocated, not worshipped

The paper’s best idea is not that smaller models are useful. That is already known. Nor is it that routing can save money. Also known. The useful step is showing that hidden representations from a reasoning model can forecast enough about prompt difficulty and model correctness to make routing operational.

That changes the architecture of AI reasoning. Instead of treating a large model as the universal destination for every problem, the system first asks a cheaper question: what kind of thinking does this prompt deserve?

For businesses, this is a welcome demotion of brute force. Reasoning models remain valuable, but they become part of a portfolio rather than the only adult in the room. The future of efficient AI may not be a single model that thinks perfectly at every depth. It may be a dispatch system that knows when a cheap mind is enough, when a stronger mind is necessary, and when the problem should be escalated before the invoice becomes an argument.

Fast minds are useful. Cheap thinking is useful. Knowing which one to use is where the money is.

Cognaptus: Automate the Present, Incubate the Future.


  1. Bo Zhao, Berkcan Kapusuzoglu, Kartik Balasubramaniam, Sambit Sahu, Supriyo Chakraborty, and Genta Indra Winata, “Optimizing Reasoning Efficiency through Prompt Difficulty Prediction,” arXiv:2511.03808, 2025, https://arxiv.org/abs/2511.03808↩︎