Grading the Doctor: How Health-SCORE Scales Judgment in Medical AI

Checklist is a boring word. That is why it is useful.

In healthcare AI, the glamorous question is whether a model can “reason like a doctor.” The operational question is uglier: did it invent a lab value, miss an emergency referral, overstate certainty, ignore the requested format, recommend unsafe antibiotics, or fail to ask for missing context?

That second question is where most of the real work lives. A medical LLM can sound fluent while quietly violating several clinical expectations. Multiple-choice exams do not catch enough of this. Generic “helpful and harmless” principles are too vague. Physician-authored instance-level rubrics catch more, but they are expensive, slow, and tied to specific cases. Excellent judgment, unfortunately, does not arrive pre-packaged as cloud infrastructure.

The paper behind Health-SCORE tries to solve that bottleneck: not by building another medical chatbot, but by making medical judgment more reusable.¹ The important move is not merely “LLM-as-judge,” which has already become the duct tape of modern AI evaluation. Health-SCORE asks a sharper question: can expert-written criteria be compressed into a smaller, reusable rubric system, selected adaptively for each task, and then used not only to evaluate outputs but also to steer generation and reinforcement learning?

That is a more interesting proposition. It turns rubrics from a grading sheet into a control surface.

The real problem is not evaluation; it is reusable judgment

Healthcare LLM evaluation has been moving away from exam-style benchmarks for a simple reason: real clinical work is not a four-option quiz. A model may need to summarize a chart, interpret lab panels, identify contradictions, suggest next steps, communicate uncertainty, and avoid crossing the line into unsafe advice. There may be several acceptable answers, and the difference between “good” and “dangerous” often sits in missing qualifiers rather than wrong facts.

Rubric-based evaluation is the natural response. Instead of asking whether the answer matches one reference, a rubric asks whether the answer satisfies a set of criteria: accuracy, completeness, context awareness, uncertainty handling, safety, communication quality, and so on.

The catch is scale.

At one extreme, broad principles are cheap but weak. “Be helpful” is not enough to catch hallucinated creatinine values. At the other extreme, instance-level rubrics are precise but expensive. HealthBench, the source benchmark used by the paper, contains thousands of medical conversations and tens of thousands of physician-authored criteria. That is valuable, but it is not a workflow most healthcare organizations can casually reproduce on a Tuesday afternoon.

Health-SCORE sits between these extremes.

Rubric type	What it captures	Main advantage	Main weakness
Generic principles	Broad response quality	Cheap and reusable	Too coarse for clinical failure modes
Instance-specific expert rubrics	Case-level correctness and nuance	High precision	Expensive, hard to transfer, poor for inference-time use
Health-SCORE-style reusable criteria	Recurrent medical evaluation patterns	Scalable and task-aware	Depends on abstraction quality and selector reliability

This middle position is the whole paper. Health-SCORE is not claiming that 29 criteria can replace doctors. It is claiming that many physician-written criteria contain recurring judgment patterns, and those patterns can be abstracted, reused, and selected more intelligently than a fixed checklist.

A small but important distinction. Also known as the difference between “we automated medicine” and “we made one part of medical AI governance less artisanal.” The second claim is less flashy. It is also more plausible.

Step one: compress expert rubrics without flattening them into slogans

The paper builds Health-SCORE from HealthBench rubrics, focusing on the health data task subset. That focus matters. Health data tasks are difficult because they require models to interpret structured numerical and categorical clinical data, such as labs, medications, comorbidities, and clinical notes. This is exactly where polished language is most likely to hide brittle reasoning.

The construction process is simple in outline:

Take existing physician-authored rubric criteria.
Embed them semantically using text embeddings.
Cluster criteria that point to similar failure modes.
Manually inspect and refine clusters.
Derive a compact set of reusable Health-SCORE criteria.

The final result is 29 criteria. They cover issues such as following instructions, not fabricating clinical information, asking for clarification when information is missing, avoiding unsafe advice, flagging high-risk conditions, using appropriate SOAP formatting where relevant, not missing critical symptoms, considering contraindications, following guidelines, interpreting reports correctly, recommending follow-up, and identifying contradictions in the prompt.

The mechanism is not pure automation. The authors use embeddings and clustering, but they also include manual quality assurance to remove noise and refine clusters. That is not a weakness to hide. It is the point. The goal is not to pretend expert judgment emerges magically from vector space; it is to reduce repeated expert work by finding common evaluative patterns.

The business version is straightforward: if a healthcare organization already has expert review data, incident reviews, compliance checklists, or clinician-authored evaluation criteria, some of that material may be reusable across tasks. The operational asset is not merely the dataset. It is the taxonomy of repeated failure modes.

Step two: select the right criteria instead of dumping all 29 into every task

A reusable rubric set creates a new problem: not every criterion applies to every prompt.

A SOAP-note formatting criterion is useful when the user asks for structured documentation. It is noise when the user asks for a patient-friendly explanation. A criterion about urgent action matters for high-risk symptoms. It is less relevant for routine preventive screening. A contradiction-detection criterion matters when the prompt contains conflicting information. Otherwise, it can become yet another decorative instruction for the model to ignore.

Health-SCORE handles this through adaptive selection. For each medical conversation, an LLM-based selector scores each criterion for relevance on a 1–5 scale. Criteria that are sufficiently relevant are retained. The selected subset can then be used in evaluation, prompting, or reward computation.

This is the paper’s most practically important design choice.

Non-adaptive rubrics often fail quietly because they over-specify the task. A model that receives too many irrelevant constraints may become more verbose, more cautious in useless places, or less focused on the actual clinical issue. In other words, a bigger checklist is not always a better checklist. Anyone who has worked near compliance documentation is now nodding, possibly with mild trauma.

The appendix gives a useful size comparison. Single-axis rubrics use 1 criterion and about 15 tokens. Multi-axis rubrics use 5 criteria and about 158 tokens. LLM-generated rubrics average 7.8 criteria and 242.5 tokens. Instance-specific rubrics average 10.5 criteria and 452.7 tokens. Non-adaptive Health-SCORE uses all 29 criteria and about 1,117 tokens. Adaptive Health-SCORE averages 11.5 criteria and 431.4 tokens.

Method	Avg. criteria per conversation	Avg. rubric tokens	Adaptive?
Single-axis	1	15	No
Multi-axis	5	158	No
LLM-generated	7.8	242.5	Yes
Instance-specific	10.5	452.7	Yes
Health-SCORE, non-adaptive	29	1,117	No
Health-SCORE, adaptive	11.5	431.4	Yes

That table explains why adaptive selection is not a cosmetic add-on. It cuts the full Health-SCORE checklist down to a size similar to expert instance-level rubrics, while keeping the criteria reusable. The target is not minimal prompting. The target is relevance density.

Step three: use rubrics as both judge and steering wheel

Health-SCORE is tested in two main roles.

First, it is used as a reinforcement learning reward signal. For each prompt, the adaptive selector chooses relevant criteria. Candidate model outputs are then judged against those selected criteria. A satisfied positive criterion receives +1. A satisfied negative criterion receives -1. An unsatisfied criterion receives 0. The total is normalized into a sequence-level reward used during Group Relative Policy Optimization.

Second, Health-SCORE is inserted into prompts at inference time. In this setting, the model is not retrained. The selected criteria are simply placed into the context as guidance, giving the model a live checklist for what a good answer should satisfy.

This dual use is the paper’s strongest conceptual contribution. Evaluation and generation are usually treated as separate phases: generate first, grade later, complain forever. Health-SCORE collapses part of that separation. The same reusable criteria can define what quality means, guide the model while it answers, and provide the reward signal during post-training.

That does not mean the same rubric is automatically valid for every purpose. A criterion that works as a post-hoc evaluator may behave differently when placed inside a system prompt. A reward signal can also be gamed more aggressively than an evaluation checklist. Still, the paper’s architecture is useful because it treats expert criteria as operational objects, not static benchmark annotations.

Use of Health-SCORE	Technical role	Likely business meaning	Boundary
Evaluation	Scores answers against selected criteria	Lower-cost quality review	Still relies on automated judging
Prompting	Inserts relevant criteria into context	No-training improvement path	Adds prompt length and depends on model compliance
RL reward	Converts rubric satisfaction into reward	Structured post-training signal	Equal-weight, discrete rewards may miss severity differences
Adaptive selection	Filters criteria per task	Less noise, better relevance	Selector quality becomes a governance dependency

That final row deserves attention. Once criteria selection is automated, the selector itself becomes part of the safety system. If it misses the criterion that matters, the downstream model can behave “well” according to the wrong checklist. Governance does not disappear. It moves upstream.

The main evidence: Health-SCORE works best when it is adaptive

The authors evaluate models using independent, human-authored instance-level rubrics from HealthBench and CSEDB, rather than scoring them with Health-SCORE itself. This avoids the most obvious circularity problem: declaring victory because the model optimized for a rubric and then was graded by the same rubric family.

The experiments test three settings:

Test	Likely purpose	What it supports	What it does not prove
RL reward experiments	Main evidence	Health-SCORE can serve as a scalable surrogate reward signal	It does not prove full clinical deployment readiness
In-context prompting experiments	Main evidence / deployment proxy	Selected criteria can improve outputs without retraining	It does not prove gains persist across all workflows
Training dynamics analysis	Efficiency evidence	Rubric prompting can speed early learning and stabilize KL behavior	It does not isolate every cost factor in production training
Adaptive vs non-adaptive ablation	Ablation	Relevance filtering matters; all criteria at once can dilute signal	It does not prove the selector is optimal
OOD evaluations	Robustness test	Gains are not limited to the exact in-domain split	It does not prove universal generalization across healthcare

For reinforcement learning, Health-SCORE consistently outperforms single-axis, fixed multi-axis, and LLM-generated rubric baselines across in-domain and out-of-distribution evaluations. It also approaches the performance of training directly with HealthBench instance-specific rubrics. That comparison matters because instance-specific rubrics are the expensive upper-bound style of supervision. Health-SCORE’s claim is not “better than expert rubrics.” It is closer to “much cheaper reusable criteria can approximate part of the expert signal.”

The out-of-distribution tests are also meaningful, but they should be read carefully. HealthBench-Hard tests harder examples within the broader HealthBench world. CSEDB tests a different dataset with a different rubric ontology, originally in Chinese and translated into English. These are useful robustness checks, not a passport to every medical specialty, language, hospital policy, and regulatory setting.

The ablation table is the cleanest evidence for the adaptive mechanism. Adaptive Health-SCORE beats non-adaptive Health-SCORE in every reported setup and model:

Setup	Model	Non-adaptive	Adaptive
HealthBench: Health Data	Qwen3-8B	0.051	0.345
HealthBench: Health Data	Qwen3-32B	0.279	0.416
HealthBench: Health Data	GPT-4.1	0.328	0.445
HealthBench: Health Data	o3	0.391	0.509
HealthBench: Health Data	GPT-5	0.486	0.597
HealthBench: Hard (OOD)	Qwen3-8B	0.000	0.102
HealthBench: Hard (OOD)	Qwen3-32B	0.073	0.196
HealthBench: Hard (OOD)	GPT-4.1	0.136	0.198
HealthBench: Hard (OOD)	o3	0.291	0.368
HealthBench: Hard (OOD)	GPT-5	0.397	0.429
CSEDB (OOD)	Qwen3-8B	0.263	0.418
CSEDB (OOD)	Qwen3-32B	0.393	0.476
CSEDB (OOD)	GPT-4.1	0.244	0.388
CSEDB (OOD)	o3	0.445	0.615
CSEDB (OOD)	GPT-5	0.491	0.568

The pattern is especially striking for weaker models. Qwen3-8B jumps from 0.051 to 0.345 on HealthBench Health Data, and from 0.000 to 0.102 on HealthBench-Hard. This suggests that smaller models may be more vulnerable to irrelevant rubric overload and benefit more from focused criteria. Stronger models also improve, but the relative story is less dramatic.

The interpretation is not “small models become doctors if you give them a checklist.” Please do not put that on a slide. The interpretation is narrower and more useful: adaptive criteria can reduce task confusion, and that reduction matters more when the base model has less capacity to absorb irrelevant constraints.

The appendix tests robustness, not a second thesis

The paper’s appendix adds two useful pieces: per-axis analysis and implementation details.

The per-axis analysis shows that Health-SCORE improves performance across most expert-defined HealthBench dimensions, including Accuracy, Instruction Following, and Completeness. Gains are not uniform. Communication Quality appears similar across methods. This is useful because it prevents an over-broad reading of the results. Health-SCORE is not simply making responses “sound better.” Its stronger value seems closer to factual, task-completion, and clinical-coverage dimensions.

That is the right kind of improvement for medical AI. Communication style matters, but style is not the hard part when the model invents patient data or misses a contraindication.

The implementation details also clarify what kind of engineering system is being tested. The authors use GRPO, generate eight rollouts per prompt, apply adaptive KL control, and use GPT-4.1 as a judge for expert-authored rubrics during final evaluation. For reward computation, rubrics are applied to the final answer after the last thinking token, not to the intermediate chain-of-thought content. That choice matters: the system grades the answer the user receives, not the hidden reasoning performance theater behind it.

Training and evaluation used approximately 30 GPU-hours on a node with 8 NVIDIA A100 GPUs. That number is not a complete production cost estimate, but it does suggest the experiment is not in the “only sovereign-scale labs may apply” category. For enterprises, the larger cost may not be GPUs. It may be building and maintaining the expert rubric corpus, validating selectors, and integrating the system into clinical governance workflows.

As usual, the expensive part is not always the part with the invoice from NVIDIA.

The business value is cheaper diagnosis of model behavior

For healthcare AI teams, Health-SCORE points to a practical architecture.

Start with expert judgment that already exists: physician review comments, benchmark rubrics, safety taxonomies, compliance failure categories, escalation policies, and incident analyses. Abstract repeated criteria into reusable categories. Use adaptive selection to choose task-relevant criteria. Apply them in three places: pre-deployment evaluation, inference-time prompting, and post-training reward design. Then validate outputs against independent expert-authored criteria, not the same rubric set used for optimization.

That architecture is more important than the specific list of 29 criteria.

The direct paper result is that Health-SCORE improves training, prompting, and robustness under the tested healthcare settings. The Cognaptus inference is broader: regulated AI systems need reusable judgment layers, not just bigger base models or longer prompt templates. In domains where errors are multidimensional and asymmetric, organizations need a way to define, select, measure, and update quality criteria.

This has obvious relevance beyond healthcare: legal review, insurance claims, financial advice, compliance monitoring, enterprise support, and any workflow where “correct” means satisfying several professional constraints at once.

But the transfer is not automatic. A finance-SCORE or law-SCORE would need domain-specific expert criteria, abstraction, selector validation, and independent evaluation. Copying the healthcare rubric list into a loan-underwriting assistant would be an act of corporate poetry, not governance.

The boundary: scalable rubrics are not scalable responsibility

Health-SCORE has four main limitations that matter for practical use.

First, the pipeline relies on LLMs at multiple stages: embedding rubrics, selecting relevant criteria, and judging rubric satisfaction during reward computation. Final evaluation uses expert-authored instance-level rubrics, which is good, but automated judging remains part of the training loop. Bias, inconsistency, and judge-model blind spots do not disappear because the word “rubric” sounds sober.

Second, the abstraction process assumes that expert-written criteria contain reusable higher-level patterns. That assumption is reasonable, and the experiments support it in the tested settings. It is still an assumption. Some clinical judgments may be too case-specific, institution-specific, or specialty-specific to compress cleanly.

Third, the reward formulation is simple. Criteria are treated as equal-weight and discrete: satisfied, not satisfied, or harmful. In medicine, severity matters. Missing a formatting preference and missing a life-threatening symptom should not carry the same operational weight. Future systems will likely need graded satisfaction, severity weighting, or risk-sensitive reward functions.

Fourth, the paper focuses mainly on HealthBench health data tasks, with OOD tests on HealthBench-Hard and CSEDB. That is meaningful evidence, not universal validation. Clinical deployment would still require workflow-specific testing, human oversight, bias audits, monitoring, and clear boundaries on what the model is allowed to do.

The authors state a similar risk directly: abstracted rubrics may be overgeneralized beyond contexts that require instance-specific or expert judgment. That warning should be taken seriously. A reusable rubric is a tool for scaling review, not a license to scale trust faster than evidence.

From benchmark score to control layer

The easiest way to misunderstand Health-SCORE is to treat it as another benchmark paper. That misses the mechanism.

The paper’s real contribution is a pipeline:

Physician-authored criteria
        ↓
Semantic clustering and expert refinement
        ↓
Reusable Health-SCORE criteria
        ↓
Adaptive selection per task
        ↓
Prompt guidance, RL reward, and evaluation support
        ↓
Independent expert-rubric validation

That pipeline is valuable because it gives organizations a way to convert expensive expert judgment into reusable AI infrastructure. Not perfectly. Not automatically. Not without governance. But concretely enough to be operationally interesting.

In healthcare AI, the next bottleneck may not be whether models can produce more fluent answers. They can. We have noticed. The bottleneck is whether organizations can define, apply, and update the standards by which those answers are judged.

Health-SCORE is one attempt to make that judgment scalable. The clever part is not that it grades the doctor. The clever part is that it remembers how the doctor graded, distills the pattern, and applies it where it actually belongs.

That is less dramatic than replacing clinicians. Good. Medicine has enough drama already.

Cognaptus: Automate the Present, Incubate the Future.

Zhichao Yang, Sepehr Janghorbani, Dongxu Zhang, Jun Han, Qian Qian, Andrew Ressler II, Gregory D. Lyng, Sanjit Singh Batra, and Robert E. Tillman, “Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs,” arXiv:2601.18706, 2026. ↩︎

The real problem is not evaluation; it is reusable judgment#

Step one: compress expert rubrics without flattening them into slogans#

Step two: select the right criteria instead of dumping all 29 into every task#

Step three: use rubrics as both judge and steering wheel#

The main evidence: Health-SCORE works best when it is adaptive#

The appendix tests robustness, not a second thesis#

The business value is cheaper diagnosis of model behavior#

The boundary: scalable rubrics are not scalable responsibility#

From benchmark score to control layer#