TL;DR for operators
Many AI workflows do not need a yes-or-no judgment. They need a number: how well did this answer follow the instruction, how far did this reasoning trace remain valid, how much better is answer A than answer B, how strong is this essay, how risky is this case, how close is this support call to escalation?
That sounds like ordinary scoring. It is not.
The paper behind this article defines reasoning-intensive regression: tasks where a model must first reason through text and then output a calibrated continuous score.1 This is where two familiar AI shortcuts break in opposite directions. A small fine-tuned encoder may optimise the loss by predicting safe middle values. A prompted LLM may understand the case but output chunky, over-rounded numbers. Brilliant. One model refuses to see the slope; the other sees it and then reports it in five decorative buckets.
The authors test this problem across four tasks: mathematical error detection, instruction following, pairwise RAG comparison, and essay grading. They argue that CCC, or Concordance Correlation Coefficient, is often more revealing than NMSE because it tests whether predictions preserve ranking, calibration, and distributional spread. That matters in operations because a scoring system that always hovers near the mean may look numerically polite while being operationally useless.
Their proposed method, MENTAT, is deliberately modest. It evolves prompts through batched error reflection, then generates multiple LLM rollouts and trains a tiny MLP aggregator over those scores. The LLM supplies reasoning; the neural head supplies calibration. This hybrid improves most configurations, but not magically. It costs roughly three rollouts at inference, can still inherit dataset biases, and does not prove universal deployment readiness.
The business reading is simple: if your organisation uses LLMs as judges, graders, reviewers, or evaluators, do not treat “ask a bigger model” as the architecture. Build a scoring system that separates reasoning, calibration, aggregation, monitoring, and cost control. Numbers need narration. They also need a measuring instrument that does not wobble every time the prompt gets philosophical.
The problem starts when the output is a number, not a label
Most enterprise AI evaluation conversations are still too fond of categories.
Pass or fail. Relevant or irrelevant. Safe or unsafe. Good or bad. Escalate or ignore. These are convenient for dashboards and comforting for procurement decks. They are also often too blunt for actual operations.
A customer-support call is not merely compliant or non-compliant. It may be 0.62 compliant, with one missed disclosure and one excellent recovery. A RAG answer is not simply correct or incorrect. It may be more complete than the baseline but slightly less faithful to the retrieved evidence. A student essay, policy memo, legal answer, code review, or audit note may require a score that reflects degrees, not boxes.
This is the territory of regression: predicting a real-valued score. But the paper’s useful move is to show that not all text regression is the same kind of animal.
Some regression tasks are feature-based. Predicting a house price from bedrooms, bathrooms, and square footage does not require a philosophical seminar. Some are semantic. Sentiment scoring, similarity scoring, and basic essay grading require language understanding, but not necessarily deep sequential reasoning. The difficult class is what the authors call reasoning-intensive regression, or RiR: the model must reason through the instance before it can score it.
The difference is not cosmetic. It changes what failure looks like.
In RiR, the system must answer three questions at once:
- Did it understand the instance?
- Did it reason through the relevant structure?
- Did it map that reasoning into a well-calibrated number?
Most systems are decent at one or two. The trouble begins when we ask for all three and pretend this is just another eval prompt.
The failure triangle: reasoning, calibration, and spread
The paper’s central mechanism is a triangle. Each corner matters.
| Requirement | What it means | Typical failure |
|---|---|---|
| Reasoning | The model must inspect the content deeply enough to know what happened. | A shallow encoder misses the actual logical or contextual structure. |
| Calibration | The predicted number must match the scale and meaning of the label. | A prompted LLM gives plausible but poorly calibrated scores. |
| Distributional spread | Predictions must preserve variance, ranking, and extremes where the data requires them. | A model collapses toward the mean and looks deceptively stable. |
The last corner is the one many business teams miss. A scoring model can have a tolerable average error while still being useless for decisions. If the task is to identify which cases deserve escalation, which answers deserve replacement, or which agents need coaching, ranking and spread matter. A model that predicts every case near the middle is not cautious. It is asleep with good posture.
This is why the paper argues for using Concordance Correlation Coefficient alongside NMSE. NMSE measures squared error relative to the variance of the target. That is useful, but it can be fooled by middle-hugging predictions. CCC, by contrast, rewards agreement in both correlation and calibration. It asks whether the prediction distribution behaves like the target distribution, not merely whether the average miss is tolerable.
The paper’s mathematical-error example makes the point sharply. In Figure 1, a fine-tuned NeoBERT model gets NMSE of about 1.014 and CCC of only 0.008. GPT-5 with detailed prompting has NMSE around 0.809 and CCC of 0.685. MENTAT with GPT-5 improves to NMSE around 0.402 and CCC of 0.792. If one only looked at point error, the distinction between a collapsed encoder and a reasoning model would be under-read. CCC reveals the real operational difference: one system barely ranks the cases; the other preserves meaningful variation.
That is the first business lesson. Evaluation systems should not be judged only by whether they produce a number close enough on average. They should be judged by whether their numbers remain useful for sorting, thresholding, routing, and escalation.
The benchmark is useful because it is awkward
The authors construct four RiR tasks. They are not meant to exhaust the universe. They are meant to expose different kinds of scoring difficulty.
| Task | Score being predicted | Why it is awkward |
|---|---|---|
| Mathematical error detection | How far a mathematical solution gets before the first wrong step, on a 0–10 scale. | Requires formal stepwise reasoning and calibrated estimation of where the error occurs. |
| Instruction following | A continuous score from 0 to 1 for how well a generated answer satisfies a task. | Requires inferring compliance without being shown the hidden requirement decomposition used to generate the label. |
| Pairwise RAG comparison | A score from −2 to 2 indicating how much one answer beats another. | Requires judging helpfulness, truthfulness, and completeness, while avoiding shallow preference cues. |
| Essay grading | A 1–5 overall essay score. | Serves as a more semantic reference point where small encoders can do better with enough data. |
This mix is important. The paper is not merely asking whether MENTAT wins a leaderboard. It is trying to diagnose where different scoring mechanisms break.
Mathematical error detection stresses deep sequential reasoning. Pairwise RAG comparison tests preference judgment and calibration. Instruction following tests whether a model can infer satisfaction of many constraints from the final answer. Essay grading is deliberately less exotic: it asks whether the same machinery still helps when the task is closer to traditional semantic scoring.
A weaker article would list the four tasks and then recite results. The more useful reading is to see the tasks as diagnostic probes. Each one asks: does your scoring system fail because it cannot reason, because it cannot calibrate, because it cannot preserve variance, or because the task itself contains confounds?
That diagnosis matters more than the brand name of the model.
Prompting can reason, then ruin the measurement
Prompted LLMs have an obvious appeal for RiR. They can read the case, explain themselves, and adapt across tasks without a large labelled dataset. This is exactly why teams use LLM-as-judge setups in the first place. One model, many rubrics, minimal engineering, maximum hope. Very modern. Very dangerous.
The paper’s evidence shows that prompting often brings reasoning but not precision.
The appendix quantization analysis is particularly telling. On the mathematical-error task, GPT-4.1 predictions cluster at .00 or .50 decimal endings 63.1% of the time. GPT-5 does so 86.5% of the time. Ground-truth labels are approximately uniform. That means the model is not merely uncertain; it is discretising a continuous task into a coarse grid.
This is a subtle failure because the reasoning text may look sophisticated. The model may identify the relevant mistake, compare the relevant answers, or discuss the relevant rubric. Then it outputs a number that behaves like a multiple-choice answer wearing a decimal costume.
For business scoring, this matters directly. Suppose a company uses an LLM to score compliance quality from 0 to 1. If most values land on 0.0, 0.5, 0.75, and 1.0, the system may be operationally too coarse. It will create artificial cliffs where the real process needs smooth thresholds. Cases just above and below a review cutoff may be separated by formatting habits rather than evidence.
The paper also finds centre-seeking behaviour. In pairwise RAG comparison, GPT-5, despite stronger reasoning capability, compresses scores toward the centre of the −2 to 2 range. The authors describe this as overthinking: the model generates cautious, under-dispersed judgments where shorter, more decisive reasoning would preserve the target distribution better.
This is the second business lesson. More reasoning is not always better scoring. Sometimes the model thinks itself into mush.
Fine-tuning can optimise the loss while missing the task
The opposite baseline is a small Transformer encoder, here NeoBERT with about 250 million parameters. This is the sensible, economical option: format the input, fine-tune a regression head, and avoid expensive LLM inference. For ordinary text regression, that can be competitive.
For RiR, the paper shows a more brittle picture.
On mathematical error detection, NeoBERT’s CCC is near zero in both low-data configurations. On pairwise RAG comparison, it also produces very low CCC while appearing less disastrous under NMSE. The mechanism is familiar: when labels are hard and data is scarce, a model can reduce squared error by predicting a narrow band near the centre. It has not learned the task. It has learned how not to embarrass the loss function.
The essay-grading result complicates the story in a good way. NeoBERT improves substantially with more data on essay grading, reaching CCC 0.65 at the 500-sample configuration. The paper also notes that on pairwise RAG, NeoBERT can improve with a much larger regime of 1,500 train/validation examples, reaching NMSE 0.60 and CCC 0.66. So the conclusion is not “small encoders are useless.” That would be convenient and false, a classic two-for-one.
The better conclusion is:
- small encoders can work when the task is closer to semantic scoring or enough labelled data exists;
- they struggle when the score depends on deep reasoning that is not easily inferable from shallow representation learning;
- NMSE alone can hide this struggle by rewarding safe collapse.
For operators, this suggests a practical test. Before deploying a cheap fine-tuned scorer, inspect its prediction distribution. If it is too narrow, if variance is missing, or if CCC is weak while NMSE looks acceptable, the model may be gaming your metric rather than learning your judgment.
MENTAT separates the thinking from the measuring
MENTAT is not an enormous architecture. That is part of the charm. It has two phases.
First, the model evolves its prompt through batched error reflection. It runs on labelled examples, identifies the worst-performing cases, analyses systematic mistakes, and proposes improved instructions. The authors use a small number of iterations — three in the main setup — and select the best prompt on validation performance.
Second, MENTAT generates multiple stochastic rollouts using the evolved prompt. In the experiments, it uses three rollouts per input. Those rollout scores are sorted, supplemented with simple statistics such as mean, standard deviation, minimum, and maximum, and passed into a small MLP aggregator trained with a combined CCC–NMSE objective.
That separation is the mechanism.
The LLM is asked to do what LLMs are relatively good at: reading, decomposing, comparing, and explaining. The small neural aggregator is asked to do what direct text generation is bad at: turning noisy, quantized, correlated numeric outputs into a calibrated continuous prediction.
This is not glamorous. It is engineering. The sort that usually works better than buying a bigger model and hoping the invoice contains calibration.
A simplified view:
Input text
↓
LLM reasoning under evolved prompt
↓
Three numeric rollouts
↓
Sorted scores + summary statistics
↓
Small MLP aggregator
↓
Final calibrated score
The important point is that MENTAT is not just “prompt engineering plus averaging.” The ablations show why.
Prompt evolution alone helps in some places but is not enough. Simple averaging improves some outputs but cannot fully learn how to de-quantize or reweight rollout patterns. The learned aggregator gives the system a small calibration layer, which is often where the business value lives.
What the results actually support
The headline is that MENTAT improves most configurations, but the more useful story is where and why.
| Evidence | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Main comparisons across math errors, pairwise RAG, and essay grading with GPT-4.1/GPT-5 | Main evidence | Hybrid prompt evolution plus aggregation often improves over prompting and small-encoder baselines. | Universal superiority across all scoring tasks or all models. |
| Instruction-following results with gpt-oss-20b | Generalisation check | The method is not limited to proprietary models, at least for one open-source setting. | On-premise readiness for smaller regulated deployments. |
| MENTAT Prompt and MENTAT-Avg ablations | Ablation | Both prompt refinement and learned aggregation matter; averaging is not the whole method. | That this exact MLP design is optimal. |
| GEPA comparison | Comparison with prompt optimisation | Stronger prompt search can outperform MENTAT on some metrics, especially instruction following CCC. | That prompt optimisation alone solves calibration. |
| RM-R1-Qwen-14B on pairwise RAG | Comparison with reward-model prior work | Binary preference reward models do not automatically become continuous scorers. | That all reward models fail at RiR. |
| Quantization and length-bias analyses | Failure-mode analysis | LLM scores can be coarse; pairwise RAG judgments can amplify verbosity bias. | That MENTAT fully removes these biases. |
On mathematical error detection, MENTAT with GPT-5 is the clearest success. Detailed GPT-5 prompting already performs strongly, with CCC 0.69 and NMSE 0.78. MENTAT improves CCC to 0.72 in the 100-sample setup and 0.78 in the 500-sample setup, while reducing NMSE to 0.52 and 0.42 respectively. This is the cleanest example of the paper’s thesis: reasoning was present, but calibration improved once the system stopped relying on a single textual number.
On instruction following, the result is more mixed and therefore more interesting. NeoBERT achieves CCC 0.36 and NMSE 1.08. Basic and detailed gpt-oss-20b prompting sit around CCC 0.32–0.33. RL fine-tuning improves CCC to 0.37 but worsens NMSE to 1.51, suggesting better relative discrimination but worse calibration. MENTAT reaches CCC 0.42–0.43 and NMSE 0.90–0.95. GEPA beats MENTAT on CCC at 0.46, but has worse NMSE at 1.06. This is not a coronation. It is a trade-off map.
On pairwise RAG comparison, the paper gives the most useful warning. GPT-4.1 outperforms GPT-5 on CCC. Detailed GPT-4.1 prompting reaches CCC 0.47, while detailed GPT-5 prompting reaches only 0.31. MENTAT with GPT-4.1 improves NMSE heavily and nudges CCC to 0.50–0.52. GPT-5 remains under-dispersed. The authors’ interpretation is that GPT-5 over-deliberates: it clusters near the centre and produces repeated, coarse scores across rollouts. In business terms, the “more intelligent” judge may be less decisive where the task requires calibrated preference strength.
On essay grading, the story returns to a more familiar pattern. NeoBERT improves with more data. GPT-4.1 is strong. MENTAT improves GPT-4.1 detailed prompting from CCC 0.65 to 0.70 in the 100-sample setup and reduces NMSE from 0.73 to 0.54. GPT-5 is weaker than GPT-4.1 here as well, again supporting the idea that heavy reasoning can be unnecessary or even harmful for simpler semantic scoring.
The results support a sober claim: MENTAT is a promising lightweight pattern for RiR. They do not support the claim that every scoring pipeline should immediately use three rollouts and an MLP. That would be replacing blind faith in prompts with blind faith in a small neural net, which is at least cheaper but still blind.
The reward-model result is a useful slap on the wrist
One of the paper’s more operationally relevant comparisons is RM-R1-Qwen-14B on pairwise RAG comparison. Since reward models are trained for preference judgment, one might expect them to adapt naturally to continuous pairwise scoring.
They do not, at least not here.
The paper maps the model’s preference-token log-probabilities into the −2 to 2 regression scale. RM-R1-Qwen-14B achieves NMSE 5.66 and CCC 0.15, worse than basic GPT-5 prompting. Its binary classification accuracy is only 55.7%, marginally above chance.
The authors attribute this to objective mismatch. A reward model trained to choose a winner is not necessarily trained to express magnitude. By the time it produces a preference token, its probability distribution is often highly concentrated. That gives little signal about whether answer A is slightly better, substantially better, or merely longer with better manners.
This matters for companies building RAG evaluation systems. A preference model may help choose between two responses. It may not provide a reliable continuous score for routing, thresholding, or trend monitoring. The distance between “A wins” and “A wins by 1.3 on a calibrated scale” is not a rounding error. It is a product requirement.
The business pattern: build a scoring instrument, not a judging prompt
The business relevance is not that every company should reproduce MENTAT exactly. The relevance is that many AI evaluation workflows are already RiR workflows without naming them.
Examples include:
- support-call quality scoring;
- RAG answer grading;
- compliance review;
- legal or policy memo assessment;
- essay or training-response grading;
- code review scoring;
- AI-agent trajectory evaluation;
- internal risk-assessment triage.
In each case, the organisation wants a score that supports action. A score is not a decoration. It drives thresholds, escalations, coaching, model selection, monitoring, and sometimes compensation. So the scoring mechanism needs to be tested like an instrument.
A practical RiR deployment pattern would look like this:
| Layer | Operational question | Design implication |
|---|---|---|
| Rubric | What exactly does the number mean? | Define the scale with examples, not just adjectives. |
| Reasoning | Can the system inspect the case properly? | Use prompts or models capable of decomposing the input. |
| Calibration | Do scores match the label scale? | Track NMSE, bias, and calibration drift. |
| Concordance | Does the system preserve ranking and spread? | Track CCC or similar agreement metrics, not point error alone. |
| Aggregation | Are multiple judgments being combined intelligently? | Use learned or validated aggregation, not blind averaging by default. |
| Bias checks | Are scores responding to irrelevant cues? | Test length bias, position bias, source bias, and demographic leakage where relevant. |
| Cost | Does the added precision justify inference cost? | Compare 1-rollout, 3-rollout, and adaptive-rollout policies. |
Cognaptus inference: the highest ROI use case is not casual grading. It is evaluation infrastructure where better continuous scoring improves downstream decisions. If a calibrated score helps catch bad RAG answers, route uncertain cases, reduce manual review load, or tune agent behaviour, the extra complexity may be justified.
But the deployment threshold should be explicit. If a binary classifier is enough, use one. If ordinal buckets are enough, use them. RiR is valuable where the difference between 0.42 and 0.68 changes the decision.
The hidden cost is not only tokens
MENTAT requires three rollouts at inference in the paper’s setup. The authors note that these can be parallelised, so wall-clock latency may remain close to a single rollout in well-provisioned systems. Token cost still rises roughly 3×.
That is the visible cost.
The less visible cost is evaluation maintenance. Once a business uses continuous scoring, it must monitor whether the score distribution drifts, whether reviewers still agree with the rubric, whether models begin overusing safe middle values, and whether new input types break calibration. The system becomes an eval product, not a prompt snippet.
That is not a reason to avoid it. It is a reason to budget for it.
The paper’s own cost comparison is useful here. MENTAT uses a fixed three-iteration prompt-evolution design. In the 500-sample setup, Phase 1 involves roughly 2,003 LLM calls across four sequential stages, while Phase 2 adds 1,000–1,500 calls depending on reuse. The MLP itself is tiny, with only eight hidden units in the described setup. Compared with broader evolutionary prompt search, MENTAT is designed to reduce sequential optimisation rounds.
In short: the extra cost is not the neural aggregator. The extra cost is asking the LLM to reason multiple times and then keeping the scoring system honest.
Where the paper is careful, operators should be even more careful
The paper’s limitations are not boilerplate. They affect practical interpretation.
First, the benchmark has four tasks. They are well-chosen but not exhaustive. The authors themselves point to clinical scoring, financial risk assessment, and code review as future domains. Those are precisely the places where business readers may be tempted to overgeneralise. Please resist. A good result on math-error localization and RAG comparison is not a validation study for loan underwriting or medical triage.
Second, the open-source evidence is limited. Instruction following uses gpt-oss-20b, but the paper notes that validation on smaller open-source models would matter for on-premise deployment. That is important in regulated industries where proprietary API access may be restricted.
Third, MENTAT does not erase bias. In pairwise RAG comparison, the appendix shows that human annotations already correlate with response length. Prompted models can amplify that verbosity bias. MENTAT mitigates it but does not eliminate it. A learned scorer trained on biased labels can become a slightly more elegant bias delivery mechanism. Progress, of a sort.
Fourth, the method’s inference cost is real. A 3× rollout multiplier may be trivial for offline evaluation and unacceptable for real-time routing. Adaptive rollout strategies — one rollout for easy cases, more for ambiguous ones — are a natural next step, but not established by this paper.
Fifth, stronger reasoning models do not dominate. GPT-5 does very well on mathematical error detection but underperforms GPT-4.1 on pairwise RAG and essay grading. This should permanently damage the lazy sentence “we will just use the strongest model as judge.” One can hope.
What to do Monday morning
For teams building AI scoring systems, the paper suggests five concrete practices.
First, decide whether the task is actually RiR. If the score depends on deep inspection of the instance, treat it as RiR. If it is mostly semantic or feature-based, a cheaper model may be enough.
Second, evaluate with both point error and concordance. NMSE answers one question. CCC answers another. Production scoring needs both.
Third, inspect prediction distributions. Look for mean collapse, under-dispersion, overuse of round numbers, and missing extremes. Histograms are not glamorous, but neither is a broken escalation queue.
Fourth, test model size against task complexity. A reasoning model may help on hard reasoning tasks and hurt on simpler judgment tasks. Benchmark it. Do not worship it.
Fifth, separate reasoning from calibration. Whether you use MENTAT, a variant, or another architecture, avoid making a single prompted number carry the entire burden of understanding, scoring, calibration, and distribution matching.
The conclusion: scoring is an architecture
The neatest phrase in this paper is not a model name. It is the problem class: reasoning-intensive regression.
Once a team names the problem, the architecture changes. The question stops being, “Which LLM should judge this?” It becomes, “How do we build a scoring instrument that reasons through the case, preserves the distribution, calibrates the number, and stays affordable?”
MENTAT is one answer. It is light, clever, and empirically useful across most tested configurations. It is also not the final word. The stronger contribution is the diagnostic frame: LLM scoring fails not because models cannot talk, but because talking is not measuring.
That is the operator’s takeaway. When AI systems turn messy text into operational numbers, narration is necessary but insufficient. The score still has to behave like a score.
Cognaptus: Automate the Present, Incubate the Future.
-
Diane Tchuindjo and Omar Khattab, “Reasoning-Intensive Regression,” arXiv:2508.21762v4, 2026. The paper is listed for CAIS ’26 and reports an ACM DOI: 10.1145/3786335.3813139. ↩︎