Review is not a democracy.
That sounds unpleasant, which is why it is useful. In many business settings, we like consensus because it feels stable. Three analysts agree, five reviewers approve, the dashboard turns green, and everyone can pretend the risk has been domesticated. Mathematics is less polite. One invalid theorem application, one hidden assumption, one algebraic step that does not follow, and the whole proof may collapse. The majority does not get to vote a contradiction out of existence.
That is the simple but powerful intuition behind the paper Pessimistic Verification for Open Ended Math Questions.1 The authors study a family of workflows for LLM-based proof checking where a proposed solution is rejected if any reviewer finds a critical error. Not if most reviewers object. Not if the average confidence score drops below a threshold. One serious objection is enough.
At first glance, this sounds like an obvious recipe for false alarms. Ask enough reviewers to be suspicious, and someone will eventually accuse an innocent proof of misconduct. Academic HR departments have seen this movie. But the paper’s results are more interesting than that. In open-ended mathematical verification, the hard part is not usually recognizing correct proofs. The hard part is detecting rare but fatal errors in incorrect ones. If that is the bottleneck, then majority voting is badly matched to the task. It smooths away the very signal we need.
The paper’s real contribution is therefore not “LLMs can verify math now.” That would be too broad, too convenient, and suspiciously suitable for a conference keynote. The sharper lesson is this: verification workflows should match the failure logic of the domain. For tasks where one critical flaw invalidates the whole output, pessimism is not a mood. It is an operating principle.
The mechanism shift: from averaging reviewers to hunting fatal objections
Most test-time scaling methods treat repeated model calls as a way to average out noise. Generate multiple answers, ask multiple judges, vote, aggregate, smooth. This makes sense when errors are roughly symmetric and independent. If several plausible answers compete, majority voting can push the system toward the most stable answer.
Proof verification is different. The verifier is not choosing the prettiest answer. It is checking whether a chain of reasoning survives attack.
The authors frame mathematical proof verification as a binary classification task. A verifier must decide whether an entire proposed solution is correct. They track three metrics:
- True Negative Rate (TNR): how often the verifier catches incorrect proofs;
- True Positive Rate (TPR): how often it accepts correct proofs;
- Balanced F1: the harmonic mean of TPR and TNR, used because both kinds of failure matter.
The paper writes balanced F1 as:
This choice of metric matters. If a verifier accepts everything, it may look friendly to correct solutions, but it is useless as a safety layer. If it rejects everything, it catches all errors, but it destroys usefulness. A good verifier needs both: preserve valid work and catch invalid work.
The paper argues that the differentiator among strong and weak verifiers is mostly TNR. Across the tested models and datasets, TPR is comparatively similar, while TNR varies much more. In plain English: many models can recognize a clean proof well enough; fewer can reliably find the hidden landmine.
That explains why pessimistic verification helps. If only a small number of verification attempts notice the fatal mistake, then majority voting can bury the objection. Pessimistic verification does the opposite. It treats a well-explained critical objection as decisive.
This is the main conceptual move. The workflow is not trying to make the model more agreeable. It is trying to make the model harder to fool.
Three ways to be productively suspicious
The authors test three variants: simple, vertical, and progressive pessimistic verification. They are not complex architectures. That is part of the point. The work is mostly about arranging model calls so that the verifier’s scarce error-detection ability is used better.
| Method | What it does | Operational intuition | Likely purpose in the paper |
|---|---|---|---|
| Simple pessimistic verification | Runs multiple whole-proof reviews and rejects the proof if any review finds a critical error | Repeat the full audit several times because one pass may miss the flaw | Main mechanism baseline against majority voting |
| Vertical pessimistic verification | Splits the proof into chunks and asks the model to inspect a specific chunk while keeping the full proof as context | Make the reviewer stop waving at the whole argument and look at the suspicious paragraph | Ablation-like mechanism extension: does focused local inspection help? |
| Progressive pessimistic verification | Starts with whole-proof checks, then subdivides surviving proofs into finer chunks, pruning proofs once an error is found | Catch obvious failures cheaply, then spend attention on harder survivors | Main proposed workflow and efficiency argument |
Simple pessimistic verification is the blunt version. Ask several independent reviewers to judge the whole proof. If any says “incorrect” and provides a critical reason, reject the proof. The paper compares this with majority voting and finds that pessimistic aggregation improves as review count increases, while majority voting has little effect. The lesson is not that repetition itself is magical. The aggregation rule is doing the work.
Vertical pessimistic verification changes the unit of attention. Instead of repeatedly checking the whole proof, the verifier checks chunks. The prompt tells the model to inspect only a given segment while considering the full proof as context. This is important because many proof errors are local but context-dependent. A theorem is misapplied in one line; a condition is smuggled into one paragraph; a calculation quietly stops being legal. Whole-proof review can glide over these errors because the global story sounds plausible.
Progressive pessimistic verification combines the two. It first checks the whole proof. If no error is found, it splits the proof into smaller parts and checks them. It can continue subdividing up to a chosen depth. The paper denotes this kind of setup as something like prog@n/l, where the parameters control iteration and minimum split length. The maximum number of checks grows as $2^n - 1$, but pruning matters: once a proof is rejected, the workflow does not need to keep inspecting it.
This gives progressive verification a useful asymmetry. Incorrect proofs can be filtered early if the error is easy to detect. Correct proofs may require more checks because there is no disqualifying objection to stop the process. Annoying, yes. Also exactly how serious review often works.
Why majority voting is the wrong comfort blanket
The tempting belief is that multiple reviewers should vote. That is the familiar template: if one model call is noisy, call it many times and trust the crowd.
The paper’s results challenge that template for verification. The authors report that majority voting has almost no effect in their comparison, while simple pessimistic verification improves steadily with more reviews. The mechanism is easy to miss, so let us slow down.
Suppose a proof contains one subtle invalid step. Out of twelve model reviews, perhaps only two notice it. The other ten say the proof is fine because the argument is mostly coherent. Majority voting returns “correct.” Pessimistic verification returns “incorrect,” assuming the detected error is substantive.
That looks harsh only if we think verification is about measuring average sentiment. It is not. Verification is closer to security screening. Most officers may not notice the forged document. One officer who does notice is enough.
There is a caveat: the objection must be explained. The paper’s prompts ask the verifier to provide a concise description of the most harmful error when it rejects a proof. This matters because pessimistic verification without explanation would be a paranoia machine. With explanation, the system can be audited. A downstream human, stronger model, or policy layer can inspect the objection.
For business use, this is the key design pattern:
| Reader assumption | Paper’s correction | Business consequence |
|---|---|---|
| More reviewers should mean majority voting | In proof verification, rare error detection is the scarce signal | Preserve minority objections instead of averaging them away |
| Longer chain-of-thought is the natural way to improve verification | Multiple targeted checks can be more token-efficient and parallelizable | Reliability may improve through workflow design, not only bigger reasoning traces |
| False negatives always mean the verifier wrongly rejected a correct proof | Some “false negatives” reveal annotation errors or rigor-boundary problems | QA metrics can underestimate strong reviewers when labels are imperfect |
| A stricter verifier is always better | Weaker models may produce many unwarranted rejections | Pessimism needs model quality, calibration, and escalation rules |
This table is where the paper starts to matter outside math. Many AI governance discussions still treat verification as a confidence-scoring problem. The paper suggests a different framing: some domains need objection capture.
The experiments test workflow scaling, not just model strength
The paper evaluates the methods on three datasets.
First, IMO-GradingBench, using a 300-sample subset from a human-graded IMO-level proof benchmark. For this dataset, only full-score solutions are treated as correct; everything else is classified as false. That makes the task strict, which is appropriate for proof verification but important when interpreting TPR and TNR.
Second, Hard2Verify, a benchmark of 200 challenging competition-style problems and model-generated solutions, annotated by humans.
Third, QiuZhen-Bench, introduced by the authors. It contains 1,054 advanced mathematics questions collected from the S.-T. Yau College Student Mathematics Contest and doctoral qualifying exams from QiuZhen College, Tsinghua University. The dataset covers areas such as analysis, algebra, machine learning theory, geometry, topology, probability, statistics, and theoretical physics. For experiments, the authors use a 300-problem subset answered by GPT-5-mini and labeled using GPT-5 with progressive verification.
The paper’s figures serve different evidentiary roles, and mixing them together would produce the usual soup of benchmark commentary. So here is the cleaner reading.
| Evidence in the paper | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Simple pessimistic verification vs majority voting | Direct mechanism comparison | The aggregation rule matters; majority voting is poorly suited to rare error detection | It does not prove pessimism is safe with weak verifiers |
| Token-efficiency curves on GPT-5-mini | Test-time scaling comparison | Pessimistic workflows can outperform longer CoT in balanced F1 per equivalent output token | It does not give a universal cost model for all APIs or deployments |
| Qwen3-30B-A3B scaling experiments | Robustness/sensitivity across a weaker model | The trend is not limited to one frontier model | It also shows weaker models face larger TPR degradation |
| Main results across models and datasets | Main evidence | Progressive verification improves TNR and balanced F1 for most tested models | It does not eliminate dataset-label uncertainty |
| False-negative case study | Error analysis and label-quality check | Some apparent false negatives are real objections, especially for stronger models | It is not a full relabeling of every benchmark |
This distinction matters because the paper is not simply saying “progressive is always best.” Its more careful claim is that progressive pessimistic verification tends to provide the best performance-efficiency tradeoff among the tested workflows, especially because it combines whole-proof and chunk-level inspection.
The paper reports that all three pessimistic methods benefit from more computational budget. Vertical verification is more dataset-sensitive, which is plausible: chunking helps when errors are local and the segmentation still preserves enough context. Progressive verification is more robust because it does not bet entirely on one level of granularity.
For GPT-5-mini, the paper states that simple and progressive pessimistic methods are more token-efficient than extended long-CoT test-time scaling. This is an important result, but it should not be flattened into “shorter is always better.” The reason is structural. Multiple shorter checks can run in parallel and can be stopped early when an error is found. Long-CoT puts more effort into a single reasoning trajectory. If that trajectory does not look in the right place, it can become an eloquent wrong turn. A very scenic wrong turn, naturally, but still wrong.
The most interesting table is not the benchmark chart
The paper’s false-negative case study is where the argument becomes more subtle.
In ordinary classification, a false negative here means the verifier rejects a proof labeled as correct. That sounds bad. But the authors manually inspect false negatives from pessimistic verification and classify the verifier’s objection into three categories:
- Critical: the verifier found a real serious flaw or difficult unresolved skip;
- Minor: the verifier found a small issue whose severity depends on grading standards;
- Nonsense: the verifier misunderstood the proof and rejected it for a bad reason.
The reported table is small but revealing:
| Model | Dataset | Critical | Minor | Nonsense | Total |
|---|---|---|---|---|---|
| GPT-5 | IMO-GradingBench | 9 | 9 | 2 | 20 |
| GPT-5-mini | IMO-GradingBench | 5 | 5 | 3 | 13 |
| Qwen3-30B-A3B | IMO-GradingBench | 1 | 1 | 13 | 15 |
For stronger models, many apparent false negatives are not nonsense. GPT-5’s false negatives include 9 critical and 9 minor objections out of 20 reviewed cases, with only 2 labeled nonsense. GPT-5-mini also has a substantial share of critical or minor objections. Qwen3-30B-A3B, by contrast, produces mostly nonsense objections in this sample.
This table should prevent two opposite mistakes.
The first mistake is to dismiss TPR loss too quickly. If a verifier rejects correct work for bad reasons, that is harmful. In business terms, it creates unnecessary escalation, blocks valid outputs, and annoys users who were not wrong. Smaller or weaker models are more vulnerable to this failure mode.
The second mistake is to treat benchmark labels as sacred. The paper’s appendix gives examples where the dataset label says a proof is correct, but the verifier identifies a serious gap. One example involves a solution that assumes periodicity and range restrictions in functional-equation cases without deriving them. Another involves a perfect-power counting argument where the proof improperly uses real interval length to bound integer points. A further example is brutally simple: a constructed polynomial $Q(x)$ is linear even though the problem requires $\deg(Q) \ge 2$. That is not subtle metaphysics. That is the proof equivalent of showing up to a formal dinner wearing one shoe.
This is why the case study matters. It does not merely defend the method against accusations of over-rejection. It exposes a measurement problem: stronger verifiers may be penalized for noticing errors that human annotation missed or tolerated.
QiuZhen-Bench expands the difficulty regime
The release of QiuZhen-Bench is the paper’s second contribution, but it should not be treated as a decorative dataset note. It changes the test environment.
IMO-style proof benchmarks are already difficult, but they are still contest-math flavored. QiuZhen-Bench moves into advanced undergraduate and qualifying-exam material. The appendix examples include graph balance, one-loop photon propagator correction in QED with dimensional regularization, nonlinear PDE existence questions, and algebraic geometry problems about rank-constrained matrix varieties.
This matters because open-ended verification becomes nastier as the proof domain becomes more specialized. A model must not only follow symbolic manipulation; it must know whether a theorem’s hypotheses are satisfied, whether a regularization step is legitimate, whether a uniqueness claim is being smuggled in, and whether an argument actually reaches the requested conclusion.
The dataset construction itself also deserves cautious interpretation. The authors use GPT-5 to extract and reformat problems from PDF exam papers, and they report manual spot checks that did not reveal extraction errors. For the evaluated subset, GPT-5-mini generates answers, while GPT-5 with progressive verification provides labels. This is a practical way to build a hard benchmark, but it also means QiuZhen-Bench is not a pure human-annotated gold standard in the same sense as traditional grading data. The paper is transparent enough about this; readers should keep it in mind.
For business readers, the point is not “use QiuZhen-Bench in your procurement process.” Please do not ask your vendor whether their insurance-claims chatbot can solve QED vacuum polarization. The point is that advanced-domain verification needs workflows that can inspect reasoning at multiple scales. Whole-document coherence is not enough.
The business value is better objection routing, not magical correctness
The paper directly shows that pessimistic verification improves error detection and balanced verification performance on open-ended math proof benchmarks. It also shows that progressive workflows can be more token-efficient than simply extending long-CoT, at least in the tested settings.
Cognaptus’ business inference is narrower and more useful: high-stakes AI systems should separate answer generation from adversarial verification, and the verification layer should preserve credible minority objections.
This applies most naturally to domains with proof-like failure structure:
| Domain | What counts as the “fatal error” | How pessimistic verification could help | Boundary |
|---|---|---|---|
| Code review | Security flaw, incorrect API use, unhandled edge case | Multiple reviewers inspect different files, functions, or failure modes; any critical objection triggers escalation | Static review cannot replace execution, tests, or formal methods |
| Compliance reasoning | Missing condition, wrong regulation scope, unsupported exemption | Separate reviewers check jurisdiction, rule applicability, evidence sufficiency, and exception logic | Ambiguous policy still needs human legal judgment |
| Financial analysis | Broken assumption, arithmetic inconsistency, unsupported causal claim | Reviewers attack model inputs, formulas, scenario logic, and conclusion validity | Market uncertainty is not the same as proof correctness |
| Technical reports | Unsupported derivation, invalid data transformation, inconsistent metric | Chunk-level review catches local errors hidden inside polished narrative | Subjective editorial quality should not be judged by the same rejection rule |
| Agentic workflows | Bad intermediate step that contaminates later actions | Pessimistic checkpoints stop the chain before error propagation | Too much pessimism can freeze the agent |
The important phrase is credible minority objection. A pessimistic system should not reject an output because one cheap model hallucinated a complaint. The objection must be specific, inspectable, and severe enough to matter. The paper’s prompts enforce this by asking for the most harmful error and distinguishing minor issues from critical ones.
In enterprise systems, that suggests a practical architecture:
- Generate the answer or plan.
- Run several targeted verifiers, each with a different inspection scope.
- Preserve any severe objection.
- Route the objection to a stronger model, rule-based checker, or human reviewer.
- Accept only after the objection is resolved or downgraded.
This is not glamorous. It is QA with better plumbing. Conveniently, QA with better plumbing is often what businesses actually need.
Where pessimism breaks
The paper’s limitations are not generic “more research is needed” wallpaper. They affect deployment.
First, pessimistic verification depends on verifier quality. The manual false-negative table shows the difference clearly. GPT-5 and GPT-5-mini often produce meaningful objections. Qwen3-30B-A3B produces many unwarranted ones in the inspected sample. A weak pessimistic verifier is not a skeptical expert. It is a nervous intern with a red pen.
Second, the method is best suited to objective correctness tasks. Mathematical proofs are unusually clean because correctness is not a matter of taste, even when proof rigor standards vary. Business domains are messier. A compliance memo, investment thesis, or product strategy may contain assumptions that are arguable rather than simply false. Pessimistic verification can still help, but the rejection rule must be softened into escalation rather than automatic denial.
Third, the cost profile depends on data distribution. Progressive verification prunes efficiently when many examples are wrong and errors are easy to catch early. If most candidate outputs are correct, the system may spend more effort trying to prove the absence of objections. That may be acceptable in safety-critical tasks, but not for every chatbot reply about refund policies. There are only so many times one should interrogate a shipping-date answer.
Fourth, benchmark interpretation remains delicate. If labels contain errors, apparent TPR loss can partly reflect the verifier’s superior rigor. But that does not mean every rejection is secretly correct. The paper’s own table shows a real split among critical, minor, and nonsense objections. The right operational response is not blind trust. It is evidence-preserving escalation.
The real lesson: match the verifier to the shape of failure
The paper is valuable because it shifts attention from bigger reasoning to better checking. Long-CoT asks the model to think longer. Pessimistic verification asks the system to look in more places and keep the first serious objection alive.
That is a different theory of reliability.
For open-ended math, the theory makes sense because one critical flaw invalidates the proof. For code, compliance, technical analytics, and agentic workflows, the same logic often appears in less formal clothing. One unsafe function call, one missing regulatory condition, one wrong denominator, one unsupported assumption can contaminate an otherwise polished output.
The lesson is not that every AI system should become permanently pessimistic. That would be a wonderful way to build software that refuses to do anything. The lesson is that when the task has disqualifying errors, verification should be designed to find disqualifying errors—not to average them away.
Majority voting is comforting. Pessimistic verification is useful. In serious systems, comfort has always been a poor substitute for finding the bug.
Cognaptus: Automate the Present, Incubate the Future.
-
Yanxing Huang, Zihan Tang, Zejin Lin, Peng Li, and Yang Liu, “Pessimistic Verification for Open Ended Math Questions,” arXiv:2511.21522, 2025. https://arxiv.org/abs/2511.21522 ↩︎