TL;DR for operators
Most model-selection dashboards still ask the wrong question. They ask which LLM gives the best accuracy for the lowest inference cost. Zellinger and Thomson’s paper asks a more operationally honest one: how much does a wrong answer, a slow answer, or no answer cost in this specific workflow?1
The paper’s useful move is to convert competing performance metrics into a single expected dollar reward. Inference cost stays in dollars. Latency gets priced in dollars per second or minute. Errors get priced by their business consequence. Abstention gets priced by the cost of failing to answer or escalating to a human. Once everything is in the same unit, the “best model” is no longer the one that looks attractive on a Pareto plot. It is the model with the highest expected reward under the actual economics of the task.
In the paper’s MATH experiments, reasoning models are much more expensive and slower than non-reasoning models, but they also make far fewer mistakes. When latency is ignored, the average reasoning-model category beats the non-reasoning category once the price of an error exceeds about $0.20 per query in the cost-only comparison. The abstract and conclusion state an even lower headline threshold of $0.01 for reasoning models on difficult maths questions; the body’s Figure 3 reports $0.20 for level-3 questions and $0.14 for level-5 questions in the cost-only setting. The exact number is less important than the order of magnitude: if an LLM error costs even pocket change, “cheap but wrong” stops being cheap rather quickly.
Latency changes the answer. When the paper prices latency between $0 and $10 per minute, reasoning models generally dominate above an error price of $10 when latency is worth up to $5 per minute; at $10 per minute, the critical error price rises to $100. So the operational rule is not “reasoning models always win”. It is “reasoning models win when the avoided mistakes are worth more than the extra waiting time and token bill”.
The cascade result is the useful slap on the wrist. A small-to-large cascade sounds economically elegant: use a cheap model first, escalate only when uncertain. But in the paper, sending difficult level-5 MATH questions directly to Qwen3 235B-A22B often beats cascades once the price of error crosses $0.10, assuming latency is not expensive. The exception is revealing: a cascade using Llama3.1 405B as the first model can outperform because that model has unusually good self-verification. In cascades, raw accuracy is not enough. The first model must know when it is likely wrong. Very zen. Also very measurable.
For business use, this paper should push procurement and product teams to build model evaluation around priced failure modes, not token bills. Estimate the cost of a wrong answer, the cost of delay, the cost of handoff, and the value of abstention. Then run sensitivity tables. The paper’s experiments are not a universal purchasing guide: they use the MATH benchmark, filtered numeric answers, June 2025 API prices, and a correctness evaluator that could affect some cascade findings. But the decision mechanism is broadly useful. The model budget should begin with the cost of being wrong.
The cheapest model is only cheap before it makes a mistake
A manager choosing an LLM for customer support, claims review, coding assistance, or internal analysis usually faces a spreadsheet with three familiar columns: accuracy, latency, and cost per query. The cheap model looks disciplined. The frontier model looks indulgent. The cascade looks clever, which is usually enough to make it dangerous in meetings.
The problem is that those columns are not in the same unit. Accuracy is a probability. Latency is time. Inference cost is money. Abstention is an operational event. A Pareto frontier can show that one model is not obviously dominated by another, but it cannot tell you whether a 3% error-rate reduction is worth 14 extra seconds and five more cents. It only draws the battlefield and then politely walks away.
The paper’s central contribution is to stop treating model selection as a beauty contest among metrics. It treats each LLM or LLM system as an agent that produces a per-query reward. The reward is negative because every bad thing is a cost:
Here, $C$ is the inference cost, $L$ is latency, $\mathbf{1}_E$ indicates an error, and $\mathbf{1}_A$ indicates abstention. The coefficients are the important part. $\lambda_E$ is the price of an error. $\lambda_L$ is the price of latency. $\lambda_A$ is the price of abstention. They are not mystical weights from a product manager’s slide deck. They are economic prices: how much the organisation would pay to avoid that bad outcome.
That turns model choice into expected reward maximisation:
where $\theta$ can be a model choice, a prompt setting, a cascade threshold, or another system configuration. The point is not that the formula is mathematically exotic. It is not. The point is that it forces an organisation to say the quiet part aloud: how much does a bad answer actually cost?
That is the useful discomfort. A vague preference for “better quality” becomes a dollar-valued operating assumption. A vague concern about latency becomes a price per minute. A vague dislike of abstention becomes a cost of escalation, non-response, or user abandonment.
Pareto charts are still useful. The authors even connect their framework back to Pareto optimality through theoretical results showing that sweeping across economic scenarios can recover the Pareto surface under regularity assumptions. But the article-worthy idea is sharper: the Pareto frontier is not the decision. It is the menu. Prices are the ordering system.
Pricing the error is the hard part, which is why it matters
The framework depends on estimating $\lambda_E$, the price of error. That sounds like the part where everyone suddenly has another meeting.
But the paper gives a concrete example for medical diagnosis. It estimates the price of a diagnostic error using malpractice payout data and Bayes’ theorem. The calculation starts with the expected cost conditional on a malpractice lawsuit, the probability that a malpractice lawsuit involves a genuine medical error, the probability of a lawsuit per diagnosis, and the probability of diagnostic error. Plugging the paper’s chosen approximations into the formula yields an estimated diagnostic-error price of about $333. Earlier in the paper, the authors also discuss medical-note-taking as a setting where the price of error may exceed $100.
The point is not that $333 is the universal price of a medical LLM mistake. It is not. The point is that error cost can be estimated from business primitives: rework time, liability exposure, lost revenue, refund risk, churn, compliance remediation, expert review, or opportunity cost. An organisation does not need metaphysics. It needs accounting with fewer adjectives.
For example:
| Workflow | Plausible error-cost components | What should be priced |
|---|---|---|
| Customer support | refund, escalation, churn, support time | cost of a wrong resolution |
| Legal drafting | lawyer review, contract delay, risk of defective clause | cost of an undetected drafting error |
| Sales operations | wrong lead qualification, wasted outreach, missed conversion | cost of misclassification |
| Medical documentation | correction time, downstream clinical risk, liability exposure | cost of inaccurate note or recommendation |
| Internal analytics | decision delay, wrong forecast, managerial rework | cost of acting on bad analysis |
This is where the paper quietly demotes the token bill. In many serious workflows, the cost of a mistake dominates the cost of inference. A model that costs ten times more per call but prevents a few expensive errors can be cheaper in the only sense finance departments should care about: total expected cost.
The obvious objection is that error prices are uncertain. The paper’s answer is not to pretend otherwise. It suggests evaluating models over a range of economic parameters. That is the right move. A model-selection exercise should not produce a single ceremonial winner; it should produce a sensitivity table showing when the winner changes.
If the chosen model remains optimal across plausible values of $\lambda_E$ and $\lambda_L$, deployment confidence rises. If it flips as soon as the error cost moves from $1 to $3, the organisation has learned something useful: the decision is fragile, and someone should stop calling it “obvious”.
Reasoning models look expensive until errors have a price
The paper applies the framework to difficult mathematics questions from the MATH benchmark. The authors filter to numeric-answer questions and sample 500 questions each from difficulty levels 1, 3, and 5. The models include three non-reasoning models—Llama3.3 70B, Llama3.1 405B, and GPT-4.1—and three reasoning models—DeepSeek R1, o3, and Qwen3 235B-A22B. The experiments use zero-shot chain-of-thought prompting, commercial API calls, measured latency, and token-based API pricing from June 2025.
The baseline result is unsurprising but important. On the hardest MATH questions, reasoning models have far lower error rates. The figure reports error rates of 30.6%, 24.0%, and 22.0% for Llama3.1 405B, Llama3.3 70B, and GPT-4.1, compared with 5.8%, 3.8%, and 2.8% for DeepSeek R1, o3, and Qwen3 235B-A22B. But the price is visible: reasoning models cost 10–100 times more per query and can take up to 10 times longer.
A shallow summary would stop there and say: “reasoning models are better but more expensive”. Thank you, observability dashboard. The paper instead asks when the extra accuracy is economically worth buying.
In the cost-only comparison, ignoring latency, reasoning models beat non-reasoning models once the price of an error crosses a low threshold. The abstract and conclusion state a headline threshold of $0.01 on difficult maths questions. The body’s Figure 3, which averages by model category and plots levels 3 and 5 separately, reports thresholds of $0.20 and $0.14 respectively, with the caption summarising the crossover as $0.20. This discrepancy should not be hidden, because it matters for careful reading. The safe interpretation is that the threshold is very low, but the precise value depends on the experimental slice and aggregation being referenced.
The paper’s own intuition check is useful. If a human worker takes five minutes to redo a task after a mistake, then a $0.20 error cost is reached once the worker’s wage exceeds $2 per hour. That is not exactly a high bar. In most professional workflows, a wrong answer costs more than a few minutes of rework. It may trigger review, correction, delay, escalation, reputational harm, or a customer deciding that the company’s shiny AI assistant is, in fact, a vending machine for nonsense.
The mechanism is simple: as $\lambda_E$ rises, the penalty for error dominates the extra token cost. A high-accuracy model can be economically preferable even when its inference cost looks offensive in isolation.
This is where the common “use the cheapest acceptable model” heuristic becomes suspect. Acceptable by what price of error? Acceptable under whose review process? Acceptable when the wrong output is caught, or when it reaches the customer, the clinician, the regulator, or the spreadsheet that drives next quarter’s inventory order?
Without those answers, “acceptable” is not a threshold. It is a vibe with procurement approval.
Latency is not a footnote; it changes the winner
The cleanest version of the reasoning-model result ignores latency. Real systems rarely get that luxury. A model that returns a brilliant answer after too long may still be a bad product.
The paper’s second analysis introduces a price of latency, ranging from $0 to $10 per minute. The authors frame this range as equivalent to human wages from $0 to $600 per hour, covering many human-task automation settings. They explicitly note that this does not cover more stringent latency environments such as popular web applications or large-scale database iteration, where milliseconds may be economically meaningful.
Once latency is priced, the threshold for reasoning models moves upward. The paper reports that reasoning models generally outperform non-reasoning models for error prices above $10 when latency is priced at no more than $5 per minute. When latency is priced at $10 per minute, the critical error price rises to $100. Across the grid, Qwen3 235B-A22B, o3, and Llama3.3 70B emerge as preferred in many economic scenarios.
This is not a contradiction of the cost-only result. It is the framework doing its job.
The business distinction is straightforward:
| Deployment setting | Error price | Latency price | Likely implication |
|---|---|---|---|
| Offline document review | medium to high | low | stronger reasoning models are easier to justify |
| Internal analyst assistant | medium | moderate | model choice depends on task urgency and review burden |
| Customer chat | low to medium | high | latency can outweigh incremental accuracy |
| Clinical or legal support | high | often moderate | accuracy may dominate, but escalation design matters |
| High-volume web interaction | variable | very high | the paper’s tested latency range may not apply |
The lesson is not “always use reasoning models”. It is “stop evaluating them on inference price alone”. In a slow-but-serious workflow, latency may be tolerable and errors may be expensive. In a user-facing product, the reverse can be true. The same model can be brilliant in one economic environment and absurd in another. This is why a single leaderboard ranking is operationally lazy.
Cascades save money only when the first model knows when to shut up
LLM cascades are one of those ideas that sound correct before they are measured. Start with a cheaper model. If it is confident, use its answer. If it is uncertain, escalate to a stronger model. Elegant. Efficient. Very easy to put in a systems diagram.
The paper tests this idea by comparing cascades against a standalone large model. The large model, $M_{\text{big}}$, is Qwen3 235B-A22B. The small model varies across GPT-4.1, Llama3.3 70B, and Llama3.1 405B. The experiments focus on level-5 MATH questions. The cascade threshold is tuned on 250 training examples and evaluated on 250 held-out examples. This makes the cascade experiment a direct test of operational routing, not just a benchmark comparison.
The headline is awkward for cascade enthusiasts. When latency is ignored, sending queries directly to Qwen3 235B-A22B generally beats the cascade once the price of error exceeds $0.10. As latency becomes more expensive, the crossover moves: at $0.50 per minute, the critical error price rises to $10; at $10 per minute, it rises to $1,000. The practical interpretation is clear. Cascades become attractive when latency and cost are meaningful, but they lose ground when mistakes are expensive and the larger model is much more accurate.
Then comes the interesting exception. The cascade using Llama3.1 405B as the first model outperforms the other cascades and, according to the paper, beats Qwen3 235B-A22B across a wide range of economic scenarios, including error prices up to $10,000 and latency prices up to $10 per minute. This is surprising because Llama3.1 405B is worse as a standalone model than Llama3.3 70B in the baseline error-rate chart.
So the first model in a cascade does not simply need to be accurate. It needs to be selectively reliable. It must answer when it is likely correct and defer when it is likely wrong.
The paper formalises this through a covariance-based cascade error expression. In simplified terms, cascade performance depends on whether the deferral decision is correlated with the small model’s actual errors. The authors define cascade error reduction as:
Here, $\mathbf{1}D$ indicates that the small model deferred, and $\mathbf{1}^{M{\text{small}}}_{error}$ indicates that the small model’s own answer would have been wrong. A high value means the small model is deferring precisely on the queries where it is likely to fail. That is the property a cascade needs.
This explains why the Llama3.1 405B cascade performs well despite its weaker standalone accuracy. Its self-verification signal is more useful. It knows when to leave the room. A rare corporate virtue.
For operators, the design implication is direct: do not validate a cascade only by measuring the small model’s average accuracy. Measure the quality of its uncertainty signal. If the confidence score, self-verification probability, or router signal is poorly aligned with actual failure, the cascade is just a slow way to launder errors through a systems architecture.
What the paper directly shows
It is useful to separate the evidence from the business extrapolation.
The paper directly shows three things in its tested setting.
First, an economic reward framework can rank individual LLMs and LLM systems when their trade-offs involve cost, latency, error, abstention, or other priced objectives. This is the conceptual contribution. It is not tied to MATH specifically.
Second, in filtered numeric MATH problems, reasoning models produce much lower error rates than the tested non-reasoning models, while costing more and taking longer. When the paper prices only accuracy and inference cost, reasoning models become favourable at low error prices. When latency is priced, the crossover point shifts upward.
Third, in the tested cascade setting, standalone Qwen3 235B-A22B often beats small-to-large cascades when errors have even modest economic cost. But cascade performance can become strong when the first model has a high-quality self-verification signal, as measured by cascade error reduction.
That is the evidence. It is already enough.
What Cognaptus infers for business use
The broader business inference is that AI deployment cost should be evaluated as expected operational loss, not API spend.
A model bill is visible. A mistaken answer is often hidden inside rework, review, churn, delay, compliance risk, or downstream bad decisions. This visibility gap biases organisations toward cheap models because token costs arrive as invoices while error costs arrive as “team friction”, “customer dissatisfaction”, or “why is legal still reviewing this?”
The paper gives a disciplined way to fight that bias. For a production workflow, a practical evaluation should include at least five steps:
| Step | Question | Output |
|---|---|---|
| Define the unit of work | What counts as one query or task? | comparable per-query economics |
| Price mistakes | What happens when the output is wrong and not caught? | $\lambda_E$ range |
| Price delay | What is the cost of an extra second or minute? | $\lambda_L$ range |
| Price abstention or escalation | What does human handoff or non-response cost? | $\lambda_A$ range |
| Run sensitivity analysis | When does the optimal model switch? | robust deployment choice |
This is especially relevant for organisations deploying LLMs into knowledge work rather than toy demos. If the model is summarising low-stakes internal notes, error prices may be small. If it is drafting legal clauses, triaging medical notes, recommending financial actions, or driving operational decisions, the price of error rises quickly. At that point, the obsession with shaving fractions of a cent from inference becomes charmingly irrelevant.
The paper also changes how teams should think about cascades. Cascades are not automatically economical. They are economical only when the routing or deferral mechanism is economically aligned. A cheap model that is confidently wrong is not a cost-saving layer. It is a liability multiplier with a friendly architecture diagram.
The tests in the paper serve different roles
Not every result in the paper carries the same evidentiary weight. The main experiments support the business argument; the appendices and theoretical results clarify mechanisms and boundaries.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Reward framework | Main contribution | Multi-objective LLM evaluation can be reduced to expected dollar reward | That every organisation can estimate prices perfectly |
| MATH reasoning vs non-reasoning experiment | Main evidence | Error cost can dominate inference cost for difficult reasoning tasks | That the same thresholds apply to all enterprise workflows |
| Latency grid | Sensitivity test | Model choice changes when latency has economic value | That the tested latency range covers high-scale web systems |
| Cascade comparison | Main evidence for systems | Cascades can lose to a strong standalone model when errors matter | That cascades are generally bad |
| Cascade error reduction | Mechanism / ablation-style explanation | Self-verification quality explains why one cascade works better | That self-verification will be reliable in all domains |
| Output-token appendix | Robustness / implementation context | Reasoning models have higher token baselines, while both categories scale with difficulty | That token count alone explains performance |
| Pricing table | Implementation detail | API costs reflect June 2025 provider prices | That future model prices preserve the same rankings |
| Limitation section | Boundary condition | MATH contamination and evaluator choice may affect interpretation | That the framework itself depends on MATH |
This matters because the wrong reading of the paper is easy: “Use the biggest model.” The better reading is: “Use the model whose expected operational cost is lowest after pricing the failure modes.” Sometimes that will be the biggest model. Sometimes it will be a faster non-reasoning model. Sometimes it will be a cascade. The framework does not worship scale. It taxes mistakes.
The boundaries are real, not decorative
The paper’s limitations are not boilerplate. They affect how aggressively one should use the empirical thresholds.
The experiments use the MATH benchmark, and specifically numeric-answer questions. That makes evaluation cleaner, but it narrows the domain. Enterprise work often involves ambiguous outputs, partial correctness, subjective quality, policy compliance, and multi-step workflows where errors compound. A wrong legal clause is not the same object as a wrong numeric answer.
The paper uses the training split of MATH to obtain enough difficult examples. The authors acknowledge that this raises the possibility of data contamination, because evaluated models may have seen similar questions during training. If contamination differs across models, it can distort measured accuracy.
Correctness evaluation uses Llama3.1 405B with a prompt containing the ground-truth reference answer. The authors note that this could artificially inflate that model’s self-verification accuracy, although they argue the effect should be limited because correctness evaluation has access to the ground truth while self-verification only sees the model’s own proposed answer. Still, for business deployment, evaluator independence matters. If the same model family helps produce, verify, and judge outputs, comforting metrics can become circular. The machine grades its own homework. Historically, this has not been humanity’s strongest governance pattern.
The API prices are from June 2025. Model pricing changes. Hosted latency changes. Provider routing changes. New models arrive with better speed-cost-accuracy profiles. That does not weaken the framework, but it does mean the empirical winner list should be treated as perishable.
Finally, the paper’s latency regime is framed around automating human tasks. It does not cover ultra-low-latency product surfaces, high-frequency applications, or batch processes where millions of records make small per-query differences enormous. In those settings, $\lambda_L$ and volume effects need separate treatment.
These boundaries do not make the paper less useful. They keep it from becoming a procurement horoscope.
A better model-selection meeting
The practical output of this paper should be a different kind of model-selection meeting.
Not: “Which model is cheapest while still good enough?”
Not: “Which model is on the Pareto frontier?”
Not: “Can we use a cascade because it sounds efficient?”
The better questions are:
- What is the expected cost of an undetected wrong answer?
- What is the expected cost of an extra second or minute of latency?
- What is the cost of abstention, escalation, or human review?
- How often will this task run?
- Across plausible values of those costs, which model or system has the highest expected reward?
- If using a cascade, is the first model’s uncertainty signal actually correlated with its failures?
This is less glamorous than a leaderboard. It is also more likely to survive contact with a finance team.
The paper’s strongest contribution is not the claim that reasoning models often win. That will change as models and prices change. Its strongest contribution is the conversion layer: translate technical metrics into economic terms, then select the model that maximises expected reward.
Once errors have prices, the cheap model has to defend itself like everyone else.
\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast
-
Michael J. Zellinger and Matt Thomson, “Economic Evaluation of LLMs,” arXiv:2507.03834, 2025, https://arxiv.org/abs/2507.03834. ↩︎