When AI Grades Itself: The Quiet Failure of LLM-as-a-Judge in Clinical Translation
Translation is one of those AI use cases that sounds almost too reasonable to argue with. English medical data exist in large quantities. Many healthcare systems, researchers, and educators need non-English clinical text. Large language models are fluent, cheap, and obedient enough to produce thousands of translated reports before lunch. The spreadsheet smiles. The budget owner relaxes. The governance team is told that quality will be checked by another LLM. ...