Numbers Need Narration: Making LLMs Do Reasoning‑Intensive Regression

Thesis: When the job is to read text, reason carefully, and return a precise number (not just a label), ordinary regression heads and vanilla prompting often fail in opposite ways. The paper introduces MENTAT, a lightweight recipe that marries batch‑reflective prompt evolution with a small MLP aggregator over multiple LLM rollouts. The result: tighter calibration and better ranking on tasks where each example demands real reasoning, not surface features.

What counts as “Reasoning‑Intensive Regression” (RiR)?

RiR tasks look like this: the model must (1) think through the input with step‑wise analysis, and then (2) score it on a real‑valued scale. The paper frames three such tasks:

Detecting Mathematical Errors: predict how far a solution progressed (0–10) before the first wrong step.
Pairwise RAG Comparison: score how much answer A beats answer B (−2..2) on helpfulness/truthfulness/completeness.
Essay Grading: 1–5 holistic score for student essays.

Why this matters to business users: these are isomorphic to rubric‑based LLM QA, call/agent scoring, content quality grading, A/B answer arbitration, and continuous reward shaping in evaluation pipelines.

Why the usual tools break

Fine‑tuning encoders (e.g., BERT‑style) on tiny datasets often collapse to narrow predictions that look good on NMSE but fail at ranking (low CCC). Think “safe middle.”
Prompted LLMs reason well but quantize numbers (many predictions end in .0 or .5), hedging toward coarse grids and center‑seeking behaviors—great narratives, shaky calibration.

Translation: one approach “games the loss,” the other “rounds reality.” Neither gives the spread we need to reflect ground‑truth variance.

The MENTAT recipe (simple, but sneaky‑effective)

Phase 1 — Batch‑Reflective Prompt Evolution

Start with a plain prompt.
Run it over a batch of training examples; surface the worst cases.
Ask the same LLM to analyze its own failure patterns across many items at once and rewrite the instructions.
Keep the best prompt on a dev split; iterate a few times (the paper used ~3 iterations).

Phase 2 — Multi‑Rollout + Tiny Aggregator

With the improved prompt, sample K rollouts per input (K≈3).
Feed their sorted scores + simple stats (mean, stdev, min, max) into a small MLP.
Train the MLP with a joint CCC+NMSE loss to learn a calibrated mapping from noisy rollouts to one number.

Think of it as self‑aware instruction tuning + learned ensembling. The LLM brings the reasoning; the MLP tunes the calibration.

What moved the needle

On Math Error detection, a naive fine‑tune produced near‑zero CCC (collapsed predictions). A detailed prompt helped, but MENTAT + a reasoning LLM materially improved both CCC and NMSE.
On Pairwise RAG, a surprise: a smaller non‑reasoning LLM was more decisive (better CCC) than a heavier reasoning LLM, which tended to overthink and center its scores. MENTAT mitigated—but did not entirely remove—this under‑dispersion.
On Essay Grading, encoders improved with more data, but MENTAT still delivered lower NMSE and higher CCC in low‑data settings.

Practitioner’s cheat‑sheet

Symptom in your pipeline	Likely cause	What MENTAT piece helps
Predictions cling to the mean; ranking is weak (low CCC)	Small encoder fine‑tune “games” NMSE	Phase 1: batched error reflection adds reasoning scaffolds; Phase 2 adds distributional spread
Numbers look chunky (.0/.5), little use of extremes	LLM hedging/quantization	Phase 2: aggregator learns de‑hedging from multi‑rollout patterns
Bigger “reasoning” model underperforms a smaller one on simple judgments	Over‑deliberation → center‑seeking	Keep prompts short; cap CoT; rely more on aggregator

A deployment‑oriented blueprint (Cognaptus edition)

Data budget: Aim for 100–500 labeled items per task variant (that’s realistic in ops).
Prompt evolution loop (3 passes):
- Score all items; collect bottom‑N cases per pass.
- Ask the model to produce explicit rules it kept breaking (e.g., “don’t award completeness if source cites are missing”).
- Bake these rules into the system prompt; freeze the best on dev.
Multi‑rollout: Keep K=3 in production; more is often overkill.
Aggregator: a 2‑layer MLP with inputs [sorted_scores, mean, sd, min, max] and a CCC+NMSE loss is enough.
Guardrails: measure both NMSE (point accuracy) and CCC (agreement + spread). Track variance vs. ground‑truth variance.
Cost control: the loop is parallelizable; the MLP trains on CPU in seconds; inference cost ≈ K×LLM calls + one tiny MLP pass.

Where this fits in your stack

QA & Support: continuous QA scores for call transcripts, with calibrated thresholds for escalation.
Content Ops: editorial quality meters that reflect rubrics, not just grammar.
RAG Arbitration: numeric “A beats B by X” to route picks, trigger follow‑ups, or blend answers.
RL‑Lite: soft rewards to shape agent behaviors without heavyweight RL training.

Limits & open questions

Human label noise (especially in pairwise comparisons) caps achievable CCC—design rubrics that expose objective checks.
Over‑deliberation is real: large reasoning models can center outputs on easy tasks. Detect and shorten their scaffolds.
Quantization bias: even after MENTAT, monitor decimal‑ending histograms; keep pushing for distributional fidelity.

Cognaptus: Automate the Present, Incubate the Future

What counts as “Reasoning‑Intensive Regression” (RiR)?#

Why the usual tools break#

The MENTAT recipe (simple, but sneaky‑effective)#

What moved the needle#

Practitioner’s cheat‑sheet#

A deployment‑oriented blueprint (Cognaptus edition)#

Where this fits in your stack#

Limits & open questions#