Judge, Jury, and Calibration: Why AI Evaluation Needs Anchors

TL;DR for operators

AI is becoming very good at producing judgement-shaped output. That is not the same thing as judgement. Two recent papers make the same operational point from different sides: one shows how AI can estimate educational item difficulty before response data are available; the other shows how LLM-generated peer reviews can look serious while diverging from human reviewing behaviour.¹²

The useful pattern is not “replace the evaluator.” That is the brochure version, and brochures have committed many crimes against operational reality. The useful pattern is: build an early signal, anchor it to human evidence, test whether it engages with the right mechanism, and route uncertain cases back to people.

For businesses, this matters beyond education and academia. The same structure appears in hiring screens, compliance reviews, underwriting, procurement, technical due diligence, medical triage, R&D evaluation, grant scoring, and customer-risk workflows. Anywhere an organisation needs judgement before complete evidence is available, AI can help. But only if it is treated as a calibrated instrument, not a small oracle in a cardigan.

Operating question	Bad AI answer	Better AI answer
Can AI evaluate this before full human evidence exists?	“Yes, automate it.”	“Yes, as a provisional signal with human-calibrated targets.”
Is the output expert-like?	“It sounds expert-like.”	“It matches expert behaviour on the relevant diagnostics.”
What should humans still own?	“Exceptions.”	“Final judgement, ground truth refinement, factual verification, and policy accountability.”
Where is the business value?	“Lower headcount.”	“Earlier triage, smaller expert bottlenecks, faster calibration, measurable escalation.”

The problem: judgement arrives too late

Modern organisations have a timing problem. The moment when a judgement is most useful is often before the evidence is fully available.

A test publisher wants to know whether a new item is too easy or too hard before thousands of students have answered it. A conference wants help managing thousands of submissions before enough qualified reviewers can give each one the attention it deserves. A bank wants a first read on a risky file before senior analysts spend an afternoon on it. A hiring team wants to know which applications deserve expert review before the calendar becomes a small bonfire.

That is the opening AI keeps walking into: not because humans have become useless, but because human evidence is expensive, slow, and unevenly distributed.

The two papers in this cluster sit at different points in that chain. The first paper asks whether transformer models can provide response-free predictions of multiple-choice item difficulty, using only the wording of the passage, question, and options before student response patterns exist.¹ The second paper asks whether LLM-generated peer reviews behave like human expert reviews, or merely produce review-looking text with academic manners and suspiciously symmetrical paragraphs.²

Together, they support a sharper thesis: AI is useful in high-stakes evaluation when it becomes a calibrated layer in the judgement workflow. It is dangerous when polished output is mistaken for grounded judgement.

The chain: from early signal to behavioural audit

These papers are not saying the same thing. That is why they are useful together.

The item-difficulty paper is a mechanism paper. It studies how to construct an early predictive signal when the usual response-based calibration data are missing. The peer-review paper is a governance paper. It studies how to diagnose whether AI-generated evaluation actually resembles expert behaviour across meaningful dimensions.

The logic chain looks like this:

Step	What the chain requires	What the papers contribute
1	Acknowledge the evidence gap	New test items and overloaded review pipelines need evaluative signals before full human evidence is available.
2	Generate a provisional AI signal	Transformers can estimate item difficulty from wording; LLMs can generate review text.
3	Anchor the signal	Item-difficulty predictions are tied back to Rasch-calibrated human response patterns; peer-review diagnostics compare LLM reviews against human reviews.
4	Measure behaviour, not polish	MCQA supervision tests whether the model learns option-level discrimination; PRAIB tests readability, specificity, mathematical engagement, agreement, citation validity, and atomic coverage.
5	Route, do not replace	AI becomes a triage and calibration layer; humans retain final judgement and verification.
6	Close the loop	Later response data or human review outcomes refine the system.

This is the difference between “AI as evaluator” and “AI as judgement infrastructure.” The first is a sales pitch. The second is an operating model.

What the item-difficulty paper shows: useful prediction needs a process signal

The first paper studies response-free item difficulty modelling for reading-comprehension multiple-choice questions. The practical setting is familiar: item difficulty is usually estimated from response patterns, often through psychometric models such as Rasch difficulty. But newly written items may not have response data yet. Pre-testing may be expensive, slow, or risky because it exposes secure items. Automatically generated item pools may need provisional difficulty estimates before any student has touched them.

The paper compares three transformer-based approaches. The joint-encoding model concatenates the passage, question, and answer options into one sequence and fine-tunes an encoder to predict item difficulty. The component-wise model separately encodes the passage, question, and options before combining the representations. The multi-task model keeps joint encoding but adds an auxiliary multiple-choice question-answering objective, so the shared encoder is also trained to distinguish the correct option from distractors.

That auxiliary task is the interesting part. It is not just extra training for the sake of extra training. It is a process-aligned signal. A human test-taker experiences difficulty by reading the passage, understanding the question, and comparing options. The MCQA task pushes the model to learn representations related to that same option-discrimination process.

The paper’s findings are careful. Joint encoding is a viable end-to-end alternative to feature-engineered pipelines. Component-wise encoding does not show a detectable advantage over joint encoding; the authors interpret this as evidence that self-attention may already be capturing useful cross-component relationships without the researcher manually separating the parts. The multi-task version produces significant paired improvements in the smallest training regime, with the effect diminishing as more labelled difficulty data become available.

That last point is the operator’s lesson. When labelled judgement is scarce, a process-aligned auxiliary signal can help. When labelled judgement is abundant, the auxiliary signal matters less. This is not magic. It is regularisation with a job description.

The paper also keeps the boundary conditions visible. Purely text-based prediction has a ceiling because some difficulty-relevant factors do not live in the wording: test-taking strategies, item-position effects, visual layout, administration context, and other nuisances that quietly ruin clean abstractions. The target itself is noisy, because it is linked to finite response samples and pseudo-labelling. So the best role for response-free prediction is not final truth. It is a first-pass estimate, a prior, or a way to reduce the amount of later response-based calibration required.

That is exactly how a sensible AI judgement layer should behave: useful early, explicit about uncertainty, and designed to be refined by human-grounded evidence.

What the peer-review paper shows: expert-shaped text is not expert judgement

The PRAIB paper starts from a more treacherous problem. LLMs can generate peer reviews that look polished, detailed, and appropriately solemn. Unfortunately, peer review is not a writing contest. It is an evaluative act: a reviewer must identify strengths, weaknesses, technical risks, evidence gaps, methodological problems, mathematical issues, and citation concerns.

PRAIB asks whether LLM reviews behave like human reviews across several dimensions. The benchmark measures surface properties such as length, readability, and vocabulary diversity, but it does not stop there. It also checks content specificity through references to figures, tables, sections, equations, theorems, page numbers, and similar manuscript anchors. It evaluates mathematical engagement, rating agreement, confidence patterns, citation validity, and coverage of atomic strengths and weaknesses from human reviews.

The results are not a clean “LLMs fail” story. That would be too easy, and therefore probably wrong. The paper finds systematic divergence. LLM-generated reviews tend to be longer and more complex than human reviews. General-purpose models often produce lower readability and lower vocabulary diversity. Ratings can be positively biased, less variable, and overconfident. Prompt phrasing can shift model behaviour. Formal citations are a weak point, especially for general-purpose models. Most importantly, models can produce a lot of review text while still missing human-identified weaknesses.

That is the word “coverage” doing serious work. A model can mention many things and still miss the things human experts considered decisive. Verbosity is not density. Length is not engagement. Academic tone is not evidence. The robe does not make the judge.

PRAIB’s strongest business contribution is the behavioural audit. It treats human reviews not simply as writing samples, but as behavioural reference points. Does the model cite verifiable sources? Does it engage with equations where human reviewers do? Does it identify the same weaknesses? Does it agree with human ratings in a way that is meaningful for ordinal judgement? Does it become more grounded under better prompts, or merely longer?

The authors conclude that LLMs are not suitable as wholesale replacements for human expertise. They propose a synergistic pipeline: frontier models may support mathematical analysis, fine-tuned models may help with stylistic drafting, but humans should retain final qualitative judgement and citation verification.

That conclusion is modest. It is also much more useful than the usual theatre of “AI will disrupt peer review,” delivered with the confidence of someone who has never chased a hallucinated citation through three databases.

The shared operating principle: anchor, diagnose, escalate

The business interpretation is broader than the papers’ domains.

The item-difficulty paper shows how to build a provisional AI signal when direct human evidence is missing. The PRAIB paper shows how to audit whether AI-generated evaluation deserves operational trust. Put together, they suggest a framework for high-stakes AI judgement systems:

Anchor the target. Tie the model’s output to a human-derived quantity or behavioural reference. In the item paper, that is response-pattern-based Rasch difficulty. In PRAIB, it is human review behaviour and human-identified strengths and weaknesses.
Align the mechanism. Add signals that reflect the process generating the judgement. MCQA supervision matters because difficulty is partly created by option discrimination. Peer-review diagnostics matter because review quality depends on manuscript-specific engagement, not merely fluent critique.
Measure behavioural fit. Do not ask only whether the output looks plausible. Ask whether it agrees, covers, verifies, discriminates, cites, escalates, and fails in measurable ways.
Use AI before the bottleneck, not instead of the bottleneck. The best place for AI is often upstream: ranking, triage, drafting, pre-calibration, anomaly detection, and prioritisation.
Close the loop with later evidence. Response data refine item estimates. Human review outcomes refine AI review support. Business decisions refine risk models. The loop is the product.

Here is the compact operating model:

Layer	Purpose	Example in the papers	Business analogue
Early signal	Produce a provisional estimate before full evidence exists	Response-free item difficulty prediction	First-pass risk, quality, or complexity score
Human anchor	Ground the model in evidence from people or expert behaviour	Rasch difficulty; human peer reviews	Expert labels, audit outcomes, committee decisions
Process signal	Push the model toward the real mechanism	MCQA auxiliary task; math and citation engagement checks	Evidence checks, policy criteria, domain-specific rubrics
Behavioural audit	Test whether output behaves correctly	PRAIB metrics and atomic coverage	QA dashboards, calibration studies, escalation tests
Escalation path	Reserve final authority for scarce experts	Response-based follow-up; human reviewers	Senior review, compliance sign-off, clinical or legal oversight

This framework is deliberately unglamorous. That is its charm. It turns AI from a theatrical judge into a calibrated instrument.

Why this matters for business workflows

Most companies do not run educational testing programs or academic conferences. They do, however, run evaluation pipelines. They review loan files, supplier proposals, insurance claims, job applications, customer complaints, legal clauses, research ideas, product defects, model outputs, safety incidents, and strategic options.

These workflows have three recurring problems.

First, evidence arrives unevenly. Some cases have rich history; others are new, sparse, or ambiguous. Second, expert attention is scarce. Senior people should not spend equal time on every file just because the workflow software has the imagination of a filing cabinet. Third, superficial quality is misleading. A polished memo can be wrong. A long review can miss the key risk. A confident score can be poorly calibrated. Organisations know this already, usually after paying for it.

The combined lesson from these papers is to design AI evaluation systems around judgement logistics:

Use AI to identify which cases need attention.
Train or prompt models with signals that resemble the real decision process.
Compare AI behaviour against human experts on specific dimensions, not just aggregate scores.
Measure weakness coverage, evidence grounding, and confidence calibration.
Track where prompts change outputs too much.
Treat unsupported citations, missing evidence, and overconfident ratings as operational defects, not charming personality traits.

The payoff is not merely lower cost. The better payoff is better allocation of human judgement. A compliance team can focus on cases where evidence is thin or contradictions are high. A hiring team can ask human reviewers to inspect applications where the model’s signal is unstable. A technical due diligence team can use an LLM to surface mathematical or architectural issues, while humans verify whether those issues are material. A test developer can use response-free difficulty estimates to assemble provisional forms, then refine them with smaller response samples.

This is automation as triage. It is less exciting than “fully autonomous evaluation.” It is also less likely to set the building on fire.

The misconception to avoid

The dangerous misconception is simple: if AI can generate expert-looking judgement, the judgement is expert-like.

Both papers undermine that assumption. The item-difficulty paper does not say that a transformer simply “understands difficulty.” It shows that useful prediction depends on the relationship between text, human-calibrated difficulty, architecture, sample size, and auxiliary supervision. The PRAIB paper does not ask whether LLM reviews sound acceptable. It asks whether they behave like human reviews across dimensions that matter for evaluation.

This distinction should shape procurement and deployment decisions. A vendor demo that shows fluent scoring is not enough. A dashboard that reports average agreement is not enough. A review agent that produces many comments is not enough. The question is whether the system is anchored to the right evidence, aligned with the real mechanism, and audited for the failure modes that would matter if a human made the same mistake.

A practical governance checklist would ask:

Question	Why it matters
What human-derived target anchors the model?	Prevents free-floating judgement.
What evidence does the model actually inspect?	Separates grounded evaluation from plausible language.
Which expert behaviours are being measured?	Avoids judging the model by style alone.
How does the model handle uncertainty?	Detects overconfidence and false precision.
What kinds of weaknesses does it miss?	Reveals whether the model fails on decisive negatives.
Which outputs require human escalation?	Preserves accountability where stakes are high.
How is later evidence fed back?	Turns one-off automation into a learning system.

The checklist is not glamorous. Neither is a seatbelt. Both exist because impact has a habit of arriving suddenly.

What the papers do not prove

A good synthesis should also avoid overclaiming. The item-difficulty study uses pseudo-labelled targets anchored to Rasch estimates, so the modelling task inherits noise from both finite human response samples and the pseudo-labelling process. Its results support response-free prediction as a useful early layer, not as a replacement for response-based calibration when that calibration is feasible.

PRAIB focuses on ICLR and NeurIPS data and uses a zero-shot setting. Its citation checks depend on structured citation formats and public bibliographic databases. Its mathematical engagement classifier is itself model-based, although validated on expert annotations. The benchmark is useful because it makes behavioural divergence measurable, not because it settles every possible question about AI-assisted review.

For business readers, this boundary matters. The takeaway is not that every workflow needs a custom benchmark of PRAIB-level sophistication before using AI. The takeaway is that every serious AI evaluation workflow needs some equivalent of three things: a human anchor, a process-aware diagnostic, and an escalation policy.

Without those, the organisation is not deploying AI judgement. It is deploying confident formatting.

The real product is the calibration loop

The most useful way to read these papers is as a design pattern for AI operations.

In the old workflow, human judgement arrives after enough evidence accumulates. In the naive AI workflow, model output replaces that judgement too early. In the better workflow, AI produces provisional signals, humans provide anchors and corrections, and the system learns where it is useful, where it is brittle, and where it must defer.

That loop can be written simply:

$$ \text{Useful AI Evaluation} = \text{Early Signal} + \text{Human Anchor} + \text{Behavioural Diagnostic} + \text{Escalation Loop} $$

The formula is not mathematically profound. It is operationally unforgiving.

The item-difficulty paper gives the constructive half: build the early signal and align it with the process that creates the target. The peer-review paper gives the audit half: check whether the generated judgement behaves like expert judgement, especially where weaknesses, citations, mathematical engagement, and confidence matter.

Together, they make a strong case for AI as a judgement sidecar. It watches early, ranks attention, drafts structure, surfaces risks, and reduces the cost of getting to the real decision. It does not get to wear the wig and bang the gavel just because it writes in complete sentences.

That is the mature version of AI evaluation: not artificial authority, but calibrated assistance.

Cognaptus: Automate the Present, Incubate the Future.

Jan Netík and Patrícia Martinková, “Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning,” arXiv:2605.16991, 2026. https://arxiv.org/html/2605.16991 ↩︎ ↩︎
Krzysztof Żurawicki, Julia Farganus, Arkadiusz Gaweł, Mateusz Bystroński, and Tomasz Jan Kajdanowicz, “PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing,” arXiv:2605.29815, 2026. https://arxiv.org/html/2605.29815 ↩︎ ↩︎

TL;DR for operators#

The problem: judgement arrives too late#

The chain: from early signal to behavioural audit#

What the item-difficulty paper shows: useful prediction needs a process signal#

What the peer-review paper shows: expert-shaped text is not expert judgement#

The shared operating principle: anchor, diagnose, escalate#

Why this matters for business workflows#

The misconception to avoid#

What the papers do not prove#

The real product is the calibration loop#