Rubrics

Move the Goalposts on Purpose

TL;DR for operators A fixed rubric is a depreciating training asset. Early in reinforcement learning, it may be too demanding to distinguish one weak answer from another. Later, once the model learns to satisfy it, the same rubric becomes too easy. The score survives; the information content does not. EvoRubrics trains the evaluator alongside the model.1 A Policy LLM produces candidate answers, a Rubric Generator produces candidate evaluation criteria, and an external judge scores every answer against every rubric. The policy is rewarded for satisfying the evolving criteria. The rubric generator is rewarded for producing criteria that separate stronger from weaker answers, cover different dimensions, remain anchored to desired preferences, and help the policy revise its responses. ...

Thinking About Thinking: When LLMs Start Writing Their Own Report Cards

Report cards are usually written by teachers, managers, examiners, auditors, or other people with the institutional privilege of saying, “Nice effort, but no.” The paper Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics asks a stranger question: what if the model helps write the report card for its own reasoning process?1 That sounds like the kind of governance idea that would make a compliance officer reach for coffee. A model evaluating itself is not automatically trustworthy. Sometimes it is self-reflection. Sometimes it is theatre with JSON brackets. ...

Grading the Doctor: How Health-SCORE Scales Judgment in Medical AI

Checklist is a boring word. That is why it is useful. In healthcare AI, the glamorous question is whether a model can “reason like a doctor.” The operational question is uglier: did it invent a lab value, miss an emergency referral, overstate certainty, ignore the requested format, recommend unsafe antibiotics, or fail to ask for missing context? ...