Opening — Why this matters now

Clinical AI has entered an uncomfortable phase of maturity. Models are no longer failing loudly; they are failing quietly. They produce fluent answers, pass public benchmarks, and even outperform physicians on narrowly defined tasks — until you look closely at what those benchmarks are actually measuring.

The paper at hand dissects one such case: MedCalc-Bench, the de‑facto evaluation standard for automated medical risk-score computation. The uncomfortable conclusion is simple: when benchmarks are treated as static truth, they slowly drift away from clinical reality — and when those same labels are reused as reinforcement-learning rewards, that drift actively teaches models the wrong thing.

This is not a measurement problem. It is an infrastructure problem.

Background — From calculators to canonical truth

Medical risk scores (CURB‑65, CHA₂DS₂‑VASc, MELD, GCS, etc.) are not trivia questions. They are operational shortcuts that compress messy clinical context into actionable decisions: admit vs discharge, anticoagulate vs wait, escalate vs observe.

Historically, these scores are computed manually or via tools like MDCalc. MedCalc‑Bench attempted something ambitious and genuinely useful: turn this everyday clinical workflow into a scalable benchmark by pairing real (de‑identified) patient notes with 55 popular calculators.

To generate labels at scale, the benchmark adopted a two‑stage pipeline:

  1. Feature extraction from clinical notes using GPT‑4
  2. Rule‑based aggregation via Python implementations of calculator logic

Those outputs became the benchmark’s “gold labels.” Over time, they also became training rewards for reinforcement‑learning fine‑tuning.

That is where things started to rot.

Analysis — How label pipelines leak error

The paper formalizes medical score computation as a compositional pipeline:

  • $f_\theta$: extract clinical features from text
  • $g_\phi$: aggregate features into a scalar score

Errors can enter at either stage — or at the task definition itself.

Three failure modes surfaced

A systematic audit of the MedCalc‑Bench test set reveals three recurring issues:

Failure mode What breaks Why it matters
Feature extraction errors GPT misreads labs, history, or timing Garbage‑in silently propagates
Aggregation logic mismatches Python rules diverge from clinical calculators Scores are consistently wrong
Task underspecification Question cannot be answered from the note Correct behavior should be abstention

Notably, many errors were clinically obvious (e.g., physiologically impossible lab values) but had been canonized as ground truth.

Implementation — Physician‑in‑the‑loop, but scalable

Instead of brute‑force relabeling (expensive, slow, unrealistic), the authors propose something more operationally honest: allocate physician attention where it matters most.

The system has two automated stages:

  1. Audit pipeline

    • A reasoning‑capable LLM (Gemini‑2.5‑Pro) independently critiques existing labels
    • Five runs per instance; only supermajority flags count
  2. Independent recomputation pipeline

    • Same LLM recomputes scores from scratch, blind to original labels
    • Supermajority agreement becomes a candidate new label

Only after these steps do physicians step in — reviewing the most divergent and clinically relevant cases, not random samples.

This is not human‑in‑the‑loop as a buzzword. It is human attention as a scarce resource, deployed deliberately.

Findings — The numbers that actually matter

How bad was the drift?

  • 26.6% of MedCalc‑Bench test labels flagged as likely incorrect in Phase 1 audit
  • 32%+ estimated mislabel rate after independent recomputation

This is not noise. This is structural error.

Physician reality check

On a single‑blind physician validation subset:

Metric Original labels Maintained labels
Agreement with physicians 20% 74%
sMAPE (lower is better) 72.7% 20.1%

The “improved” labels are not marginally better. They are qualitatively different.

When labels become rewards

The most important experiment comes last.

The authors fine‑tune the same base model (Qwen3‑8B) using the same RL algorithm (GRPO), differing only in which labels define the reward:

  • Run A: original MedCalc labels
  • Run B: physician‑maintained labels

Evaluation is always against physician‑aligned labels.

Result

+8.7% absolute test accuracy — purely from fixing the reward signal.

This answers a dangerous implicit assumption in modern alignment work: label noise does not simply wash out at scale. In RL, it compounds.

Implications — Benchmark stewardship is infrastructure

Three conclusions matter for anyone building or deploying AI in safety‑critical domains:

  1. Benchmarks are living systems If labels are partially generated by models, they inherit model failure modes and age badly.

  2. Evaluation and training cannot share broken oracles Once labels become rewards, misalignment stops being diagnostic and starts being causal.

  3. Abstention is a first‑class outcome Forcing a number when the task is ill‑posed trains hallucination, not intelligence.

This paper reframes benchmark maintenance as a governance problem, not a clerical one. Auditing, versioning, documentation, and expert oversight are not optional add‑ons; they are prerequisites.

Conclusion — Gold labels tarnish faster than models improve

Static benchmarks give the illusion of progress while quietly steering models away from reality. In medicine, that is not an academic inconvenience — it is an alignment failure with operational consequences.

The uncomfortable takeaway is that model quality is now bounded by benchmark quality, and the latter degrades unless actively maintained.

If AI is to earn trust in clinical workflows, we will need fewer leaderboards — and more stewardship.

Cognaptus: Automate the Present, Incubate the Future.