Medical-Ai

The Mole Is Not the Model: Dermoscopy AI Needs a Chain of Custody

TL;DR for operators This paper is not trying to win a skin-lesion classification leaderboard. Good. We have enough leaderboards already, many of them decorated with the usual confetti of optimistic AUCs and conveniently unexamined data provenance. The real contribution is a reproducible mechanism for constructing a clinically verified dermoscopic image dataset: standardized mobile-image acquisition, a structured 16-field metadata model, multi-stage diagnostic label verification, deduplication by cryptographic hash, and normalized diagnostic categories.1 The authors then demonstrate the method by building a dataset of 1026 unique dermoscopic images from 443 patients collected in Russian outpatient practice between June 2025 and May 2026. The malignant cases are small in number—39 images—but all are histologically verified. ...

Pretty in Pink Is Not Enough: Virtual 3D H&E Needs Structural Proof

TL;DR for operators The useful part of this paper is not that it makes label-free microscopy look like H&E. That is the easy headline, and also the easiest way to misunderstand the work. The paper introduces HistoBIT3D, a dataset that pairs phase-contrast Back-illumination Interference Tomography, or BIT, with voxel-wise registered fluorescence-labelled nuclei in 3D tissue volumes.1 That matters because virtual staining has a basic governance problem: a generated image can look histological while quietly moving, deleting, or inventing cellular structure. In pathology, that is not a charming hallucination. It is the sort of thing that gets written up after the incident review. ...

Sink or Skill: Why Agent Experience Needs Governance

TL;DR for operators AI agents do not become useful by remembering everything. That is not intelligence; it is a data landfill with a chatbot interface. Two recent arXiv papers, one on medical reasoning agents and one on physically based swimming control, make a shared operational point from very different directions. SkeMex shows how a medical agent can improve after deployment by converting interaction trajectories into structured, evaluated, and governed clinical skills.1 SWIM shows how a simulated swimmer can learn robust control from a single reference motion when body-fluid interaction is represented at the right level and scarce experience is sampled efficiently.2 ...

Label Me Twice, Generate Me Once: The New Discipline of Data-Efficient AI

In enterprise AI, the glamorous part is still the model. Bigger context windows, better agents, faster inference, shinier demos—the usual fireworks display. But for many real deployments, especially in healthcare, legal review, insurance, industrial inspection, and compliance, the real bottleneck is less theatrical: labeled data. Not just data. Labeled data. Not just labeled data. Correct labeled data. ...

Heart of Scale: Why Bigger ECG Models Don’t Always Beat Better Biases

Heart of Scale: Why Bigger ECG Models Don’t Always Beat Better Biases A hospital does not buy an ECG model because it enjoys leaderboard furniture. It buys one because somebody wants a cheap, reliable signal from a noisy waveform: rhythm abnormality, structural heart disease, ICU risk, mortality risk, maybe a demographic or physiological clue that was not explicitly labeled during pre-training. ...

Scan You Believe It? Why RadAgent Makes Medical AI Show Its Work

Scan You Believe It? Why RadAgent Makes Medical AI Show Its Work Hospitals do not merely need an AI that can write a radiology report. They need an AI whose work can be checked before the report becomes somebody else’s problem. That sounds obvious, which is exactly why it is often ignored. A chest CT is a dense three-dimensional diagnostic object. A radiologist does not just glance at it, produce prose, and walk away. They inspect anatomy, compare regions, test impressions, look for omissions, and decide whether a finding is actually supported by the scan. Many vision-language models, by contrast, still behave like a polished black box: scan in, report out, confidence implied by typography. ...

Process Reward Agents — When Reasoning Learns to Judge Itself (Before It’s Too Late)

Reasoning systems have a familiar failure mode: they can sound calm while quietly walking off a cliff. A model begins with a plausible assumption, adds a second plausible sentence, then a third. By the time the final answer arrives, the mistake is no longer obvious because it has been wrapped in a competent-looking explanation. In low-stakes writing, this is annoying. In medicine, finance, compliance, or legal reasoning, it is a process failure masquerading as intelligence. ...

When Models Learn… or Just Get Easier: Decoding Adaptive AI Evaluation

Update Day Is Where Evaluation Gets Weird Update day is usually presented as a clean managerial ritual. A model gets retrained. A validation report arrives. The new AUROC is higher, or at least not embarrassing. Everyone is invited to believe that the system has improved. That belief is comfortable. It is also incomplete. ...

When AI Grades Itself: The Quiet Failure of LLM-as-a-Judge in Clinical Translation

Translation is one of those AI use cases that sounds almost too reasonable to argue with. English medical data exist in large quantities. Many healthcare systems, researchers, and educators need non-English clinical text. Large language models are fluent, cheap, and obedient enough to produce thousands of translated reports before lunch. The spreadsheet smiles. The budget owner relaxes. The governance team is told that quality will be checked by another LLM. ...

When AI Starts Writing Papers: The Rise of the Medical AI Scientist

Papers used to have a useful quality: they were difficult to produce. Not always good, unfortunately, but difficult. Someone had to identify a problem, read the literature, design the method, write the code, run the experiment, repair the code, compare the result, draw the figures, write the manuscript, and then survive peer review with only minor emotional damage. ...