Clinical AI has a paperwork problem. Not the usual paperwork problem, where doctors drown in documentation and everyone promises that software will save them. The more interesting problem sits one layer below: the paperwork used to judge the software may itself be wrong.
That is the uncomfortable center of Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight, a paper that audits MedCalc-Bench, a benchmark for testing whether language models can compute medical risk scores from patient narratives.1 The paper’s target is not a toy dataset. MedCalc-Bench covers 55 medical calculators and includes 10,053 training instances plus 1,047 test instances. Its labels were produced through an LLM-assisted pipeline: GPT-3.5 matched patient contexts to calculator questions, GPT-4 extracted clinical features, and Python scripts aggregated those features into final scores.
That sounds sensible. It is also exactly where the trouble begins.
A benchmark label is often treated as a little piece of truth: fixed, objective, and safely reusable. In practice, this paper argues, it is better understood as an estimate. Sometimes a good estimate. Sometimes a bad one. And once that estimate becomes a public benchmark label, a leaderboard target, or a reinforcement-learning reward, the error stops being clerical. The yardstick becomes the teacher. If the yardstick is bent, the student learns to lean.
The failure starts before the model is evaluated
MedCalc-Bench tests a workflow that clinicians perform constantly: read a messy clinical note, identify the variables required by a score, and apply the score’s rules. A LACE score, for example, requires features such as length of stay, acuity of admission, comorbidity burden, and recent emergency department visits. Other calculators depend on lab values, symptoms, timing, comorbidities, medication doses, or population-specific assumptions.
The paper formalizes this as a two-stage process:
- Extract clinical features from the patient context.
- Aggregate those features using the relevant scoring rule.
This decomposition matters because it tells us where error can enter. A model can misread the note. A rule implementation can encode the wrong formula. The task itself can be underspecified, so that no responsible clinician should produce a single numeric answer.
That last category is the one static benchmarks handle especially badly. Benchmarks like clean answers. Clinical reality enjoys being rude.
A patient note may describe a multi-day trajectory, while the question does not say whether the score should be computed at admission, peak severity, discharge, or some other reference point. A calculator may have different versions across time or jurisdictions. Required inputs may simply not appear in the note. In those cases, forcing a numeric label does not make the task more rigorous. It trains the system to pretend the missing information exists. Very impressive, if your goal is automated overconfidence.
The paper’s contribution is a stewardship mechanism, not just a relabeling exercise
The authors do not propose the heroic fantasy version of benchmark governance: “let physicians manually relabel everything.” That would be expensive, slow, and, in many settings, operationally dead on arrival. Their more useful idea is phased stewardship: use automated systems to screen broadly, then spend expert attention only where it is most likely to matter.
The pipeline has three phases.
| Phase | What happens | Likely purpose | What it supports |
|---|---|---|---|
| Phase 1: automated audit | Five independent Gemini 2.5 Pro auditor runs review original labels, metadata, and derivations; a label is flagged only if at least four runs mark it clinically suspicious. | Main evidence and triage signal | Estimates whether the benchmark has enough label-quality problems to justify deeper review. |
| Phase 2: independent recomputation | A separate tool-augmented Gemini 2.5 Pro pipeline recomputes labels from the patient note and question alone, blind to the original label. | Main relabeling mechanism | Produces high-confidence candidate labels and identifies large disagreements with the original benchmark. |
| Phase 3: physician adjudication | Physicians independently recompute 50 high-disagreement cases, single-blind to reduce anchoring. | Targeted validation | Tests whether recomputed labels are closer to physician judgment than the original labels. |
| Controlled RL experiment | Two Qwen3-8B models are trained with the same setup, except one uses original labels as rewards and the other uses recomputed labels. | Causal downstream test | Tests whether label quality affects model training, not only benchmark scores. |
| Additional medical evaluations | The RL-trained models are tested on MedQA and newer MedCalc-related datasets. | Generalization check | Tests whether gains are narrowly memorized or at least directionally transferable. |
This structure is the paper’s practical contribution. It recognizes the binding constraint: physician time. GPU hours and API calls scale badly in budgets, but physician review scales badly in calendars. The pipeline therefore treats LLMs not as replacement clinicians, but as screening machinery for deciding where clinician judgment is worth spending.
That distinction is not cosmetic. “Human-in-the-loop” is often used as a soft pillow placed on top of an otherwise automated system. Here, the loop has an operational role: reduce the search space before expert review.
The first warning sign: more than a quarter of the test set is suspicious
Phase 1 audits all 1,047 MedCalc-Bench test instances. The result is not a few edge cases hiding in appendix dust. The auditor flags 279 instances, or 26.6%, as likely errors. These flags span 40 of the benchmark’s 55 calculator questions, which makes the issue look broad rather than isolated.
Manual inspection produces three recurring failure modes.
| Failure mode | What breaks | Why it matters |
|---|---|---|
| Feature extraction error | The label pipeline misreads a clinical value, history item, symptom, or lab result. | The aggregation rule may be correct, but it receives the wrong input. |
| Aggregation logic error | The Python implementation diverges from the clinically appropriate scoring rule. | The label can become systematically wrong for a calculator or subgroup. |
| Incomputable task | The patient note does not contain enough information, or the calculator is clinically mismatched to the case. | The right answer may be abstention, not a forced number. |
The physician spot-check is small but revealing. A physician author reviewed seven flagged cases in his clinical domain and agreed with the error identification in all seven. That is not enough to validate the whole benchmark. It is enough to say the audit was not merely hallucinating suspicion.
Phase 2 then recomputes labels independently. The pipeline reaches high-confidence labels for 887 of the 1,047 test instances, or 85%. Among those, 220 differ from the original label by more than the benchmark’s tolerance threshold, and another 66 are flagged as not computable. Taken together, the authors estimate that at least 27.3% of the original test instances are likely mislabeled or incomputable.
“At least” is doing real work here. The 160 deferred cases are not random easy leftovers. They are cases where the relabeling agents could not reach supermajority agreement, which likely means they are harder, more ambiguous, or more failure-prone. Treating them as clean would be generous. Generous to the benchmark, not necessarily to the truth.
The important interpretation is not “MedCalc-Bench is bad.” The original benchmark was a meaningful contribution because it created a difficult public test bed for clinical score computation. The sharper point is that a benchmark can be useful at release and still require maintenance later. Public datasets do not become sacred just because many papers cite them. They become more consequential, which is exactly why they need stewardship.
Physician adjudication shows the relabeling is not just model-on-model disagreement
A skeptic could reasonably object: one LLM-assisted pipeline produced labels, another LLM-assisted pipeline disagreed, and now we are supposed to trust the second one? Excellent suspicion. Keep it. It is the correct instinct.
That is why Phase 3 matters. The authors select 50 high-disagreement cases and have physician authors recompute labels independently. This is not a random sample of the whole test set. It is a deliberately contentious subset, chosen because the original and recomputed labels diverged most. So the numbers should not be naively extrapolated to all cases. But for testing whether the new labels are meaningfully closer to clinical judgment in disputed cases, the design is appropriate.
The result is large.
| Metric on 50 physician-adjudicated cases | Original labels | Recomputed labels |
|---|---|---|
| Agreement with physician labels | 10/50, or 20% | 37/50, or 74% |
| sMAPE against physician labels | 72.7% | 20.1% |
The agreement result is easy to understand. The sMAPE result needs one sentence more. Symmetric mean absolute percentage error measures relative numerical distance from the physician-computed label; lower is better. Moving from 72.7% to 20.1% means the recomputed labels are not merely winning a few threshold cases. They are much closer numerically to physician judgment on the answerable cases.
This is the point where the paper moves beyond benchmark hygiene. A mislabeled test set does not only punish good models. It can create the illusion that the wrong behavior is correct.
The paper also evaluates several frontier models against the curated labels. On the 887 high-confidence recomputed test cases, the reported accuracies against curated labels are 86.6% for GPT-5.2, 85.6% for Opus 4.6, 85.1% for Grok 4.1, and 89.6% for Gemini 3.1. Grading the same model outputs against the original labels produces much lower values: 65.7%, 69.4%, 62.5%, and 67.8%, respectively.
The model did not change. The grading key changed.
That is the quiet horror of benchmark rot: it can make capable systems look worse, less capable systems look better, or leaderboard gaps appear where the main difference is the answer key. In a sales deck, that becomes a procurement narrative. In a research paper, it becomes a “state of the art.” In a product roadmap, it becomes a resource allocation decision. Charming little spreadsheet, large downstream blast radius.
The most important experiment: the wrong label becomes the wrong reward
The paper’s strongest mechanism is the reinforcement-learning experiment. This is where the benchmark stops being only a measurement device and becomes a training signal.
The authors train two identical Qwen3-8B models using Group Relative Policy Optimization. Everything is held constant: base model, prompts, hyperparameters, compute budget, training instances, and evaluation setup. The only thing that changes is the reward label.
One model is rewarded for matching the original MedCalc-Bench labels. The other is rewarded for matching the recomputed labels.
On the 50 physician-labeled cases, both models begin at the same baseline accuracy: 28%. During the final training phase, the model trained on recomputed labels reaches 51.9% mean accuracy, compared with 38.4% for the model trained on original labels. That is a 13.5 percentage-point difference attributable to the reward labels.
This is the paper’s central business-relevant lesson: benchmark errors do not merely distort evaluation after the fact. If reused for post-training, they can actively shape model behavior.
On the larger 887-instance held-out test set, evaluated against recomputed labels, the gap narrows to 8.7 percentage points: 71.4% versus 62.6%. The smaller effect is expected because the 50 physician-labeled cases were selected from high-disagreement instances. On cases where original and recomputed labels mostly agree, reward choice matters less.
The additional medical evaluations are best read as a robustness and generalization check, not a second thesis. Across MedQA and two newer MedCalc-related datasets, the model trained on recomputed labels still performs slightly better, with raw gains from 0.2 to 1.9 percentage points and a regression-estimated 0.93 percentage-point advantage after controls. This is not a dramatic cross-domain transfer story. It is more restrained and more useful: higher-quality labels improved the target task substantially and did not damage related medical-task performance.
The business issue is benchmark governance, not medical trivia
For business readers, the temptation is to file this under “clinical AI.” That is too narrow.
The paper studies medical score computation because it is a clean high-stakes setting: the task has structured rules, expert labor is costly, and wrong labels matter. But the governance pattern applies anywhere organizations use expert-labeled benchmarks for evaluation, procurement, compliance, or model training.
Legal AI has comparable issues: jurisdiction, precedent date, procedural posture, and missing facts can change the right answer. Financial-risk AI has them too: definitions, accounting standards, market regime, and data vintage matter. Materials science, insurance underwriting, cybersecurity, tax advisory, and regulated operations all have the same uncomfortable structure: labels look like truth until you ask which expert, under which assumptions, at which time, using which rulebook.
The operational implication is straightforward. If a benchmark influences product decisions, it should be treated as infrastructure.
| Governance question | Weak answer | Better answer |
|---|---|---|
| Are labels assumed correct after release? | Yes, unless someone complains. | No. Labels are estimates with versioned confidence and review history. |
| What happens when the task is not computable? | Force a numeric or categorical answer. | Allow abstention and score it as correct when information is insufficient. |
| How is expert review allocated? | Random samples or ad hoc escalation. | Automated triage sends high-disagreement, high-impact cases to experts. |
| How are updates handled? | Silent replacement or undocumented revision. | Transparent versioning, changelogs, and compatibility notes. |
| Can labels be reused for training? | Yes, because they are “gold.” | Only after label-quality risk is assessed, especially for reward models or RL. |
The ROI story is not “buy more annotation.” It is cheaper diagnosis. A company does not need physicians, lawyers, or analysts to relabel every item upfront. It needs a system that separates likely-clean cases, high-disagreement cases, incomputable cases, and assumption-dependent cases. Expert review then becomes a targeted intervention rather than a ceremonial checkbox.
This also changes how procurement teams should read AI benchmark claims. A vendor score is not only about the model. It is also about the reference labels, the dataset version, the answer-extraction logic, the abstention policy, and the scoring tolerance. If those are not documented, the score is less a measurement than a performance-themed anecdote.
Static gold labels fail because clinical truth is sometimes conditional
One of the paper’s most useful observations is that some disagreements are not ordinary mistakes. They arise because the task itself is underspecified.
The appendix discusses examples where patient notes describe events across time but the question does not specify the scoring timepoint. It also notes that scoring rules can vary across guidelines and jurisdictions. In these cases, multiple answers may be defensible. Treating one number as universal ground truth is not precision. It is administrative impatience wearing a lab coat.
This is where abstention becomes a first-class design feature.
In many AI benchmarks, abstention is treated as a failure because the benchmark expects an answer. In safety-critical domains, that framing can be backwards. If the note lacks required inputs, the clinically aligned behavior is to say the score cannot be computed from the provided information. Rewarding a forced answer trains exactly the behavior organizations claim they want to avoid: confident completion under uncertainty.
A mature benchmark therefore needs at least three label states:
- A computable answer.
- An explicitly incomputable or “unknown” label.
- An ambiguity record describing which assumptions would change the answer.
That third state matters. Some ambiguity can be resolved by better prompting or more complete context. Other ambiguity reflects genuine disagreement across guidelines or clinical conventions. These should not be flattened into a single number unless the benchmark documents the assumption being tested.
What the paper directly shows, and what Cognaptus infers
The paper directly shows four things.
First, in MedCalc-Bench v1.0, a substantial share of test labels are likely wrong or incomputable. The conservative estimate is at least 27.3%.
Second, the authors’ recomputed labels align much better with physicians on the targeted 50-case adjudication set: 74% agreement versus 20% for original labels.
Third, using original labels can substantially understate frontier-model performance on this benchmark. The same model outputs receive much lower accuracy when graded against the original labels than when graded against curated labels.
Fourth, label choice changes training outcomes. In the controlled RL experiment, training against recomputed labels beats training against original labels by 13.5 percentage points on physician-labeled high-disagreement cases, with a smaller but still meaningful 8.7-point advantage on the 887-instance held-out set.
Cognaptus infers a broader operating rule: any organization using expert-labeled benchmarks as evaluation or training infrastructure needs a benchmark stewardship process. Not an annual spreadsheet cleanup. A real process: audit, recompute, adjudicate, version, document, and decide when abstention is the correct answer.
What remains uncertain is scope. This is one clinical benchmark. The physician-adjudicated set has only 50 cases, each reviewed by a single physician. The LLM agents in Phases 1 and 2 share the same underlying model and prompt strategy, so supermajority voting reduces stochastic variation but not necessarily systematic bias. The RL experiment uses one 8B open model; larger models or different post-training regimes may respond differently.
Those limitations do not weaken the central mechanism. They define its current evidence boundary. The paper does not prove that every LLM-assisted benchmark is rotting at the same rate. It proves that a widely used, high-stakes benchmark can contain enough label error to distort both measurement and training. That is already plenty. We do not need every bridge to fail before deciding inspection is useful.
The practical benchmark-maintenance stack
A useful benchmark in 2026 needs more than a dataset card and a leaderboard. It needs maintenance machinery.
For high-stakes domains, the minimum stack should look something like this:
| Layer | Function | Practical output |
|---|---|---|
| Label provenance | Records how each label was produced. | Human, LLM, script, source document, rule version, timestamp. |
| Automated audit | Detects suspicious labels at scale. | Flag type, confidence, explanation, affected calculator or rule. |
| Independent recomputation | Recomputes labels without seeing the original. | Candidate replacement label, abstention, disagreement score. |
| Expert adjudication | Resolves high-impact disagreements. | Final label, rationale, uncertainty note. |
| Version control | Makes benchmark changes traceable. | Dataset version, changelog, compatibility warning. |
| Training-use policy | Separates evaluation labels from reward labels. | Rules for when labels may be reused in RL or fine-tuning. |
This is not bureaucracy for its own sake. It is the price of using benchmarks as infrastructure. Once a label can affect procurement, publication, regulatory narratives, or model weights, the organization has already accepted governance cost. The only choice is whether to pay it explicitly or let hidden label debt accumulate.
The conclusion: gold labels tarnish faster than leaderboards admit
The paper’s title uses “stewardship,” which sounds gentle. The implication is not gentle at all.
Static gold labels are becoming a liability in domains where benchmarks are created with LLM assistance and then reused to judge or train later LLMs. The system becomes circular: models help produce labels, labels evaluate models, labels train models, and errors quietly travel through the loop with institutional authority attached.
The answer is not to ban LLM-assisted benchmark creation. That would be unrealistic and, in many domains, counterproductive. Expert-only labeling at sufficient scale is often too expensive. The better answer is to stop pretending that benchmark release is the end of the process.
A benchmark should behave less like a monument and more like a maintained public utility. It needs inspection, repair, versioning, and escalation paths. Sometimes the correct repair is a new number. Sometimes it is a documented assumption. Sometimes it is the permission to say: this cannot be computed from the available evidence.
That last option may be the least glamorous one. It may also be the most clinically intelligent.
Cognaptus: Automate the Present, Incubate the Future.
-
Junze (Tony) Ye, Daniel Tawfik, Alex J. Goodell, Nikhil V. Kotha, Mark K. Buyyounouski, and Mohsen Bayati, “Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight,” arXiv:2512.19691, version 3, 2026. https://arxiv.org/abs/2512.19691 ↩︎