When AI Grades Itself: The Quiet Failure of LLM-as-a-Judge in Clinical Translation

Translation is one of those AI use cases that sounds almost too reasonable to argue with. English medical data exist in large quantities. Many healthcare systems, researchers, and educators need non-English clinical text. Large language models are fluent, cheap, and obedient enough to produce thousands of translated reports before lunch. The spreadsheet smiles. The budget owner relaxes. The governance team is told that quality will be checked by another LLM.

That last sentence is where the trouble begins.

A recent comparative study on Japanese translations of chest CT reports does not simply ask whether an LLM can translate radiology reports well. It asks the more operationally dangerous question: can LLMs judge those translations the way radiologists do?¹ The answer is not a clean “no.” It is worse, and therefore more useful. The LLM judges were not random. They were systematic. They strongly preferred the LLM-generated translations across criteria, even when radiologists were far more divided. A machine grading another machine did not collapse into chaos. It collapsed into confidence.

For businesses building AI-assisted translation, dataset localization, clinical content workflows, or automated quality assurance, this is the part worth reading slowly. The paper is not an anti-LLM sermon. The LLM translations often looked good. One radiologist even preferred them on readability and overall quality. The failure is narrower and sharper: automated evaluation looked aligned with expert review because it used expert-sounding criteria, but its choices followed a different value function.

In other words, the judge wore a white coat. It still graded like a language model.

The study compares three things that should not be collapsed

The paper uses 150 chest CT reports from the validation set of CT-RATE-JPN. For each English report, the authors compared two Japanese versions: a human-edited translation from a multi-stage pipeline and a new LLM-generated translation produced by DeepSeek-V3.2 at temperature zero.

The human-edited version was not merely “human translation” in the casual sense. It began with machine translation by GPT-4o mini, then received review and correction by a radiology resident, and then refinement by a board-certified radiologist. The LLM-generated version was produced by DeepSeek-V3.2 using a prompt asking for natural, clinically accurate Japanese radiology-report language, with specific terminology constraints.

The evaluation had two tracks. Two radiologists, one board-certified radiologist and one postgraduate-year-5 resident, independently performed blinded pairwise evaluation. They did not know which translation was human-edited and which came from DeepSeek. Three LLM judges—DeepSeek-V3.2, Mistral Large 3, and GPT-5—evaluated the same translation pairs using the same four criteria: terminology accuracy, readability and fluency, overall clinical suitability, and radiologist-style authenticity.

This gives the paper its real structure. It is not one comparison. It is three comparisons stacked on top of one another:

Comparison	What it tests	Why it matters operationally
Human-edited vs. LLM-generated translation	Whether LLM output is competitive with a curated translation pipeline	Determines whether translation generation can scale low-risk corpora
Board-certified radiologist vs. resident	Whether expert judgment is stable even inside the same domain	Determines whether one human reviewer is enough to define “ground truth”
Radiologists vs. LLM judges	Whether automated judging can substitute for expert review	Determines whether quality assurance can be automated safely

A weaker article would summarize the study as “LLM judges are biased.” True, but too easy. The more useful lesson is that translation quality, expert preference, and automated evaluation alignment are separate layers. Treating them as one layer is how an AI workflow becomes very efficient at approving the wrong thing.

The LLM translations were shorter, smoother, and not obviously inferior

The first comparison is almost reassuring. DeepSeek’s translations were systematically different from the human-edited translations, but not in a way that screams failure.

The LLM-generated translations were shorter: median 481.5 characters versus 517.5 for the human-edited versions. They also used more sentences per report on average, 20.3 versus 19.6, and shorter sentences, 24.7 versus 27.3 characters. The text therefore became more segmented and compact. Anyone who has read enough LLM output can recognize the shape: smoother, cleaner, slightly more modular, and often a little too pleased with its own readability.

Lexical diversity complicates the story. Type-token ratio was higher for the DeepSeek translations when all parts of speech were counted, but that is partly expected because shorter texts tend to inflate this measure. When restricted to content words, the difference disappeared: 0.550 for DeepSeek versus 0.549 for the human-edited translations. The clinically meaningful vocabulary breadth, at least under this metric, was essentially comparable.

So the paper does not present LLM translation as crude machine output wearing a fake badge. The LLM translation often had surface strengths. It was concise. It was natural. It was readable. Those qualities matter in education, research, and dataset construction. The question is whether they matter more than fidelity, convention, and clinical nuance.

That is where the two radiologists begin to diverge.

The two radiologists saw different versions of “good”

Radiologist 1 was relatively favorable toward the LLM translation. On terminology accuracy, this evaluator saw most pairs as equivalent: 59% ties, with 23% favoring the LLM translation and 17% favoring the human-edited translation. On readability, radiologist 1 favored the LLM in 51% of cases. On overall report quality, again 51% favored the LLM. For radiologist-style authenticity, radiologist 1 chose the LLM translation in 64% of definitive responses.

Radiologist 2 saw the world differently. On terminology accuracy, radiologist 2 also found many pairs equivalent, but favored the human-edited version more often: 35% human-edited versus 14% LLM. On readability, the dominant answer was equivalence: 75% ties. On overall quality, radiologist 2 favored the human-edited translation in 40% of cases, found 39% equivalent, and favored the LLM in only 21%. For radiologist-style authenticity, radiologist 2 chose the human-edited translation in 66% of definitive responses.

The disagreement is not a footnote. It is one of the paper’s most important findings. Quadratic weighted kappa between the two radiologists ranged only from 0.012 to 0.059 across the four criteria, with raw agreement from 28% to 37%. That is slight-to-no agreement by conventional interpretation.

This looks alarming, but it should not be read as “humans failed, therefore use machines.” That would be the kind of conclusion an automation vendor writes on a slide at 11:47 p.m.

The authors note that in post hoc debriefing, the radiologists often found the translations nearly indistinguishable in overall quality, with preferences driven by subtle wording differences. In that setting, disagreement is not surprising. When two outputs are both plausible, a pairwise preference task can magnify small stylistic or interpretive differences. One evaluator may reward concise naturalness. Another may be more sensitive to conventional phrasing, uncertainty marking, or standard Japanese radiological register.

The business implication is uncomfortable but important: expert review is not a magic oracle. It is a professional judgment process. If a company wants to validate translated clinical materials, one expert reviewer may not define a stable universal standard. But expert variability is still variability within the right problem. The reviewers are arguing about clinical language. The LLM judges, as the next comparison shows, appear to be rewarding something else.

The LLM judges preferred the LLM output with suspicious enthusiasm

The third comparison is the quiet failure in the title.

All three LLM judges strongly favored the DeepSeek-generated translations across all criteria. For terminology accuracy, LLM preference rates ranged from 79% to 91%. For readability and fluency, they ranged from 70% to 95%. For overall report quality, 83% to 95%. For radiologist-style authenticity, the result was almost theatrical: DeepSeek favored the LLM translation in 99% of cases, Mistral in 93%, and GPT-5 in 95%.

This is not merely stronger confidence than the humans showed. It is a different distribution of judgment.

Criterion	Radiologist 1 pattern	Radiologist 2 pattern	LLM judge pattern
Terminology accuracy	Mostly equivalent; slight LLM preference	Mostly equivalent; human-edited preference when forced	Strong LLM preference, 79–91%
Readability / fluency	LLM preferred in 51%	Equivalent in 75%	Strong LLM preference, 70–95%
Overall quality	LLM preferred in 51%	Human-edited 40%, equivalent 39%, LLM 21%	Strong LLM preference, 83–95%
Radiologist-style authenticity	LLM preferred among definitive responses	Human-edited preferred among definitive responses	LLM preferred in 93–99%

The agreement statistics make the misalignment explicit. Agreement between either radiologist and any LLM judge was near zero, with QWK ranging from -0.038 to 0.148. The LLMs sometimes agreed more with one another, with LLM-LLM QWK values reaching 0.286 in some pairings, but that internal agreement did not translate into meaningful agreement with radiologists.

This is the governance problem in one sentence: a group of automated judges can be mutually consistent while collectively misaligned.

Consistency is attractive because it is easy to measure, easy to dashboard, and easy to sell. But consistency only matters after the objective function is right. A spam filter that consistently labels invoices as spam is not “stable.” It is just a reliable nuisance with a graph.

The qualitative cases show why fluency is a dangerous proxy

The paper’s qualitative disagreement analysis is especially useful because it moves the discussion from aggregate bias to failure mechanism.

The authors identified cases where both radiologists preferred the human-edited translation while all three LLM judges unanimously preferred the LLM-generated version for the same criterion. These were not the majority of cases, and the paper is careful about that. They occurred in 5 cases for terminology accuracy, 4 for readability and fluency, 10 for overall quality, and 12 for radiologist-style authenticity. No case met this condition across all four domains simultaneously.

That makes this section exploratory rather than the main statistical proof. Its purpose is not to show that every LLM preference was wrong. Its purpose is to inspect what kinds of errors survived automated judgment.

The examples are not comforting. In one case, “sequela fibrotic changes” was rendered in a way approximating “prognostic symptom,” a clinically meaningful mistranslation. GPT-5 did not acknowledge the error in its rationale, while DeepSeek and Mistral explicitly praised the terminology as accurate. In another case, the Japanese term for the right thyroid lobe used reversed word order, deviating from standard anatomical nomenclature. All three LLM judges missed it. In a third case, the radiological term “tree-in-bud appearance” was translated literally into Japanese, even though in Japanese clinical radiology the term is conventionally retained in English. One LLM judge favored the literal rendering, and the others did not explicitly penalize it. Both radiologists preferred the human-edited version.

This is not a generic “AI makes mistakes” story. The sharper point is that the LLM judges’ rationales repeatedly leaned on words such as “concise” and “natural,” especially in readability and radiologist-style judgments. Those are not bad qualities. They are just not sufficient qualities. A translation can be natural and wrong. It can be concise because it removed a convention that mattered. It can sound like a report to a model because it resembles model-preferred report style, not because it matches local professional practice.

In clinical translation, style is not decoration. Register carries domain convention. Uncertainty wording, anatomical naming, loanword usage, and phrase order are part of professional meaning. A model that rewards surface fluency may accidentally punish the very awkwardness that preserves clinical fidelity.

What each test in the paper actually supports

One reason this paper is useful for business readers is that it separates evidence types. Not every number in a paper carries the same burden. Some results show the main effect. Some explain possible mechanisms. Some define boundaries. Mixing them together produces the familiar executive-summary soup: “LLMs good but risky.” Deliciously vague. Operationally useless.

Paper component	Likely purpose	What it supports	What it does not prove
Linguistic comparison of translation pairs	Implementation/context analysis	DeepSeek outputs were structurally shorter, more segmented, and comparable in content-word lexical diversity	That the LLM translation is clinically superior
Blinded radiologist pairwise ratings	Main expert-evaluation evidence	Human experts did not converge on a single preference; one favored LLM more than the other	That either radiologist alone represents universal clinical truth
LLM-as-a-judge ratings	Main automated-evaluation evidence	LLM judges systematically favored LLM-generated translations	That every LLM preference was wrong
QWK and percent agreement table	Alignment test	LLM judges had near-zero agreement with radiologists, despite some LLM-LLM agreement	That LLM judges are useless in all domains or all prompt settings
Representative disagreement cases	Exploratory mechanism evidence	LLM judges can miss clinically meaningful translation errors while praising fluency	The overall error rate of all LLM-generated translations
Rationale term analysis	Exploratory bias signal	“Concise” and “natural” appeared repeatedly when LLM judges favored LLM outputs in disagreement cases	A complete causal explanation of all judging behavior

This distinction matters. The paper’s strongest claim is not that DeepSeek translations are unsafe. The stronger claim is that LLM-based judging, under this prompt framework and domain setting, was not aligned with radiologist assessment. That is already enough to change how a serious organization should design quality control.

The business lesson is not “never use LLMs”; it is “separate scaling from certification”

The practical temptation is obvious. If LLMs can translate medical reports and LLMs can judge translations, then a company can build a closed loop:

LLM translates → LLM evaluates → accepted outputs enter the dataset

That loop is cheap, fast, and terrifyingly clean. It also creates a self-reinforcing approval system where the evaluator may prefer the linguistic fingerprints of the generator. Even when the judge is not the same model, the paper suggests that different LLM judges can share similar stylistic priors. The bias is not merely “DeepSeek likes DeepSeek.” Mistral and GPT-5 also strongly favored the LLM translations.

A better workflow separates scaling from certification.

Use case	Reasonable LLM role	Required quality gate
Low-risk parallel corpus expansion	Generate translations and flag obvious defects	Sampling-based expert review
Internal education search or language-learning support	Produce readable translated materials	Expert review for canonical teaching files
Multilingual pre-training or fine-tuning corpus enrichment	Increase breadth and fluency at scale	Representative audits for terminology and uncertainty preservation
Clinical AI training data for downstream report generation	Draft or augment candidate data	Domain-expert adjudication before high-stakes use
Client-facing clinical, legal, or regulatory translation	Assist human professionals	Human professional approval; LLM judge cannot be final gate

The economic message is not anti-automation. It is about where automation belongs in the control stack. LLMs are useful for generating candidates, comparing obvious alternatives, detecting gross inconsistencies, and prioritizing review queues. They are less suitable as the final authority in a domain where the cost of being “smoothly wrong” is high.

For Cognaptus-style business process design, the architecture should be boring in exactly the right way:

Use LLMs to create volume.
Use automated checks to reduce noise.
Use domain experts to anchor the scoring rubric.
Feed expert disagreements back into rubric design.
Audit whether the automated judge agrees with experts before treating it as a control.

The expensive step does not disappear. It becomes more targeted. That is still automation. It is just automation with adult supervision, a rare but underrated product category.

The hardest finding is that expert disagreement does not rescue automated judging

A lazy objection is available here: if the two radiologists had poor agreement, why privilege radiologists over LLM judges at all?

Because disagreement is not the same as irrelevance.

The two radiologists disagreed partly because the translation pairs were often close, and because professional style can be plural. That tells us the evaluation problem is hard. It does not imply that any confident judge is acceptable. The LLM judges did not merely disagree with one radiologist and align with the other. They produced a much stronger, more directional preference pattern than either human evaluator. On radiologist-style authenticity, for example, all LLM judges favored LLM output in more than 93% of cases, while the two human evaluators split in opposite directions.

That pattern suggests the LLM judges may be over-weighting surface polish. The paper links this to possible self-preference bias, fluency bias, and limited grounding in local radiology-report conventions. These are plausible mechanisms, not fully isolated causal proof. Still, they are sufficient to reject the operational fantasy that an LLM judge is a cheap stand-in for expert review.

The better conclusion is this: when experts disagree, evaluation design needs more expert structure, not less. Companies need clearer rubrics, adjudication samples, multiple reviewers for critical data, and measurements of judge alignment. Replacing a noisy expert process with a smooth automated process may improve the dashboard while degrading the truth.

Boundaries: where this evidence applies and where it does not

The study has a narrow and valuable scope. It used 150 chest CT reports, English-to-Japanese translation, one LLM generator, three LLM judges, two human evaluators, and one judge-prompt framework. It did not test every language pair, every clinical specialty, every model, or every possible prompt design. It also did not measure downstream educational outcomes, such as whether trainees learn differently from LLM-generated versus human-edited translations.

Those boundaries matter. A different language pair might behave differently. A stronger judge prompt with explicit terminology checklists might reduce some bias. A domain-specific evaluator trained or calibrated on radiologist adjudication could perform better. A larger panel of radiologists could reveal more stable preference clusters. None of those possibilities is ruled out.

But these limitations do not weaken the article’s central business takeaway. They make it more precise.

The paper shows that, in a realistic specialized translation task, fluent LLM-generated text can receive overwhelming approval from LLM judges while expert radiologists remain divided and sometimes prefer the curated human-edited version. Therefore, LLM-as-a-judge should not be treated as a validated quality-control substitute merely because it uses a medical-sounding rubric and produces structured JSON. The JSON may be strict. The judgment may still be loose.

Conclusion: the failure is quiet because the output looks good

The most dangerous AI failures are not always dramatic hallucinations. Sometimes the text is readable. The formatting is clean. The terminology is mostly plausible. The judge gives a confident rationale. The workflow passes.

That is what makes this paper worth business attention. It does not tell us that LLM translation is useless. It tells us that quality assurance is the fragile part of the system. Generation can scale faster than evaluation, and when evaluation is handed to another model without validation, the whole pipeline may begin optimizing for fluency, naturalness, and model-familiar style.

For low-risk corpus enrichment and broad educational support, LLM translation can be genuinely useful. For high-stakes teaching files, clinical AI training data, and any workflow where uncertainty wording or local professional convention matters, expert review remains a necessary quality gate. Not because humans are perfectly consistent. They are not. The paper makes that clear. But because they are at least contesting the right criteria.

The quiet failure of LLM-as-a-judge is not that it cannot explain itself. It explains itself beautifully. That is the problem.

Cognaptus: Automate the Present, Incubate the Future.

Yosuke Yamagishi et al., “Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study,” arXiv:2604.02207, 2026. https://arxiv.org/pdf/2604.02207 ↩︎

The study compares three things that should not be collapsed#

The LLM translations were shorter, smoother, and not obviously inferior#

The two radiologists saw different versions of “good”#

The LLM judges preferred the LLM output with suspicious enthusiasm#

The qualitative cases show why fluency is a dangerous proxy#

What each test in the paper actually supports#

The business lesson is not “never use LLMs”; it is “separate scaling from certification”#

The hardest finding is that expert disagreement does not rescue automated judging#

Boundaries: where this evidence applies and where it does not#

Conclusion: the failure is quiet because the output looks good#