The Mole Is Not the Model: Dermoscopy AI Needs a Chain of Custody

TL;DR for operators

This paper is not trying to win a skin-lesion classification leaderboard. Good. We have enough leaderboards already, many of them decorated with the usual confetti of optimistic AUCs and conveniently unexamined data provenance.

The real contribution is a reproducible mechanism for constructing a clinically verified dermoscopic image dataset: standardized mobile-image acquisition, a structured 16-field metadata model, multi-stage diagnostic label verification, deduplication by cryptographic hash, and normalized diagnostic categories.¹ The authors then demonstrate the method by building a dataset of 1026 unique dermoscopic images from 443 patients collected in Russian outpatient practice between June 2025 and May 2026. The malignant cases are small in number—39 images—but all are histologically verified.

For operators, the practical lesson is simple but frequently ignored: in medical AI, the image is not the asset. The auditable record is the asset. A skin-lesion photo without acquisition context, patient-level linkage, verification status, and category policy is not a foundation for clinical AI. It is a JPEG with aspirations.

The paper’s strongest business relevance is not “train a better melanoma classifier.” At this size and imbalance, that would be a bold way to misunderstand the paper. The stronger pathway is external validation, domain-shift analysis, model interpretability studies, and dataset-governance design for dermatology decision-support tools.

The main boundary is also clear. “Clinically verified” does not mean every lesion is histologically confirmed. All malignant lesions are histology-confirmed, while benign and borderline lesions are consensus-confirmed because biopsy is not clinically indicated for many such cases. That distinction is not a footnote. It determines what this dataset can and cannot prove.

The dataset begins before the image exists

A common business mistake in medical AI is to treat a dataset as a container of labeled examples. Count the images, count the labels, split the rows, train the model, produce the deck. The dermatology version is especially tempting because skin lesions are visual. Surely the photo is the data.

The paper quietly dismantles that assumption. Its core object is not an image file. It is an image–patient pair converted into a de-identified dataset record containing the image, structured metadata, a diagnostic label, and a recorded verification level.

That distinction changes the entire evaluation problem. A dermoscopic image may look technically adequate, but if the acquisition conditions are inconsistent, the label pathway is opaque, the same patient leaks across train and test, and the diagnostic categories are normalized after the fact by someone’s private spreadsheet theology, then the model evaluation is already compromised. The neural network merely arrives later to formalize the confusion.

The authors frame dataset construction as a medical informatics process with three interlocking components:

Component	What it controls	Why it matters operationally
Image acquisition SOP	Equipment, optical conditions, framing, illumination, focus, quality scoring	Reduces avoidable input variation and makes image quality auditable
Metadata information model	16 structured fields across six clinical blocks	Turns images into clinically interpretable records rather than isolated pixels
Multi-stage expert verification	Initial annotation, three-specialist consensus, histology for malignant lesions	Separates clinical hypothesis from confirmed label and records reliability level

This is mechanism-first work. The paper’s importance lies less in the final count of 1026 images than in the repeatable route by which those images become usable records. In regulated or quasi-regulated AI workflows, that route is often more valuable than another thousand unloved training examples.

SOP is variance control wearing a lab coat

The first mechanism is the standard operating procedure for mobile dermoscopy. The paper focuses on outpatient conditions where an optical dermatoscope is coupled with a smartphone. That matters because the acquisition setup is not a stationary, tightly controlled imaging environment. It is closer to real clinical practice: distributed, variable, and therefore dangerous if treated casually.

The SOP regulates four groups of acquisition conditions: optical setup, acquisition parameters, illumination, and technical quality control. It specifies the use of polarized dermoscopy, standardization of dermatoscope tilt and contact pressure, lesion centering, focus checking, minimum resolution and file format, and avoidance of mixed illumination that could distort color balance.

The most operationally interesting part is the image quality score. Images are scored from 1 to 5. Scores 1 and 2 are excluded at acquisition; scores 3 to 5 proceed and retain the quality score as metadata. During construction, 32 images were excluded at this SOP stage.

That is not just cleaning. It is a design decision about variance. A model that performs badly on low-quality images may be revealing a real deployment weakness, or it may be punished by garbage acquisition that no clinician should have accepted in the first place. By documenting image quality and excluding unusable images early, the dataset separates acquisition failure from model failure.

For a business building or validating dermatology AI, this matters because performance diagnostics need source attribution. Did sensitivity fall because the model missed subtle malignant structure, because the lighting shifted, because focus degraded, or because the input distribution no longer matches the training environment? Without acquisition controls, everyone gets to speculate. Conveniently, everyone will also blame the model.

The SOP turns at least part of that blame game into observable process metadata.

Metadata is where the photo becomes a clinical record

The second mechanism is the 16-field metadata model. The paper organizes metadata into six blocks: demographic, anamnestic, morphological, dermoscopic, diagnostic, and service.

Block	Fields included	Practical use
Demographic	`age_at_exam`, `sex`, `fitzpatrick_type`	Subgroup analysis and representativeness checks
Anamnestic	`sunburn_history`, `personal_ca_history`	Risk-context analysis
Morphological	`anatomical_site`, `lesion_diameter_mm`, `dominant_colors`, `elevation`, `border_regularity`	Clinical feature stratification
Dermoscopic	`dermoscopic_structures`	Interpretability and clinical-pattern comparison
Diagnostic	`clinical_diagnosis`, `histopathology_result`, `verification_stage`	Label provenance and reliability filtering
Service	`examiner_id`, `image_quality_score`	Acquisition and annotation auditability

The dermoscopic structures field is especially important. It records expert-annotated patterns such as network, globules, pseudopods, blue veil, and vascular structures. That makes the dataset useful not only for classification but also for interpretability analysis: comparing where a model “looks” against clinically meaningful structures.

This is where the paper becomes more relevant to AI governance than to pure computer vision. Business users often ask whether a clinical AI system can be trusted. That question is usually too vague to be useful. A better question is: when the model predicts melanoma, does its attention or explanation align with dermoscopic features a specialist would recognize as relevant?

The paper does not perform that interpretability analysis. It creates the conditions for it. That distinction matters. The contribution is not an explanation result; it is infrastructure for future explanation testing.

A less charitable summary would be: it adds fields. But that misses the point. In clinical AI, fields are where many of the actual controls live. The difference between a photo archive and a validation resource is not glamour. It is metadata discipline.

Verification_stage prevents the classic “gold standard” category error

The most important small field in the paper is verification_stage.

A reader may see “clinically verified dataset” and assume every lesion has histological confirmation. That is not what the paper says, and the difference matters. The authors distinguish between clinical confirmation and histological confirmation directly in the dataset model.

The verification chain has three stages:

Stage	What happens	What it supports
Initial clinical annotation	Dermatologist records diagnosis and dermoscopic structures	Preserves the initial clinical hypothesis
Consensus review	Three specialists independently review and resolve disagreements	Produces a confirmed consensus label
Histological verification	Suspected malignant cases receive histopathology result	Provides morphological confirmation for malignant classes

All 1026 unique images underwent consensus review by three specialists. The 39 malignant images—18 melanomas, 15 basal cell carcinomas, and 6 squamous cell carcinomas—also have histological verification. The remaining 987 records are consensus-verified, not histology-confirmed.

That is not a flaw by itself. Biopsying every benign lesion simply to make an AI dataset tidier would be clinically odd and ethically unsatisfying. The useful move is to preserve the verification level rather than pretending all labels have the same evidential status.

This has a direct consequence for evaluation design. The dataset is stronger for evaluating malignant-class sensitivity than for strict benign-class specificity under a full histological gold-standard interpretation. If an operator wants a subset with stricter verification, verification_stage becomes a filter. If the operator ignores it, the evaluation becomes a small ritual performed for the comfort of the slide deck.

Deduplication and category policy are not housekeeping

The paper’s cross-cutting procedures deserve attention because they are exactly the kind of boring controls that prevent expensive nonsense later.

The authors start with 1044 records after SOP quality control. They remove 18 duplicate records using a cryptographic hash of image file content, leaving 1026 unique images. The duplicate exclusion rate is 1.7%.

A 1.7% duplicate rate may sound small. In machine learning evaluation, small leaks can still be poisonous if duplicates or near-duplicates cross the train-test boundary. The paper’s method catches exact file duplicates, regardless of filename, container format, or re-upload path. The authors also note that perceptual hashing could later detect visually similar images that are not byte-identical.

The second control is diagnostic category normalization. The final dataset uses nine classes: nevus, dysplastic nevus, melanoma, seborrheic keratosis, hemangioma, dermatofibroma, basal cell carcinoma, squamous cell carcinoma, and papilloma.

Dysplastic nevi are handled carefully. They are designated as a borderline group: melanocytic and potentially atypical, but not malignant by definition. In a binary benign/malignant screening task, the authors assign dysplastic nevi to the benign group by default, while preserving fields that allow users to change that policy or treat them separately.

This is a quiet but important governance point. Category policy is part of the dataset, not an afterthought. In medical AI, the same lesion category can play different roles depending on whether the task is screening, differential diagnosis, triage, or clinician education. If the original annotation is preserved, the downstream policy can be explicit. If it is collapsed too early, the policy is hidden inside the label. Naturally, hidden policy later reappears as “model behavior.”

The evidence is descriptive, not experimental

The paper contains no model benchmark, no ablation study, and no performance comparison of algorithms. That is not an omission; it reflects the paper’s actual thesis.

The evidence consists mainly of methodology architecture, dataset construction statistics, descriptive distributions, and comparison with prior dataset documentation practices. The figures are descriptive: architecture, category distribution, age distribution, sex distribution, and images per patient. Their purpose is to characterize the constructed resource, not to prove classifier superiority.

Paper element	Likely purpose	What it supports	What it does not prove
Figure 1 methodology architecture	Implementation detail / main mechanism	Shows how SOP, metadata, verification, de-identification, deduplication, and normalization connect	Does not validate a model
Table 1 dataset comparison	Comparison with prior work	Positions the dataset against PH2, HAM10000, ISIC, BCN20000, PAD-UFES-20 by metadata, verification, SOP, and size	Does not rank datasets by clinical utility
Table 2 metadata model	Main mechanism	Defines the 16-field structure that makes records clinically interpretable	Does not guarantee metadata completeness for every field
Table 3 verification stages	Main mechanism	Defines the label reliability chain	Does not make benign labels histological gold standards
Figures 2–5 distributions	Main evidence / descriptive characterization	Shows class imbalance, demographics, and patient-level multiplicity	Does not establish general population prevalence
Table 5 summary characteristics	Main evidence	Consolidates dataset size, exclusions, duplicates, verification levels, and metadata count	Does not establish suitability as a full training base

This distinction is not pedantry. If an executive asks, “How accurate is the model?” after reading this paper, the correct answer is: wrong paper. The better question is: “What would a dataset need to contain before model accuracy claims become interpretable?” This paper answers that.

The constructed dataset is small, local, and more useful because it admits both

The final dataset contains 1026 unique dermoscopic images from 443 patients. Patients contributed between 1 and 10 images, with a median of 2 and mean of 2.32 images per patient. That patient-level multiplicity matters because splitting by image rather than patient could inflate model performance if related lesions from the same person appear across training and test sets.

The class distribution is heavily imbalanced:

Category	Images	Share
Nevus	733	71.4%
Seborrheic keratosis	118	11.5%
Hemangioma	87	8.5%
Dermatofibroma	31	3.0%
Melanoma	18	1.8%
Dysplastic nevus	16	1.6%
Basal cell carcinoma	15	1.5%
Squamous cell carcinoma	6	0.6%
Papilloma	2	0.2%

This is not balanced benchmark material. It reflects outpatient practice, where benign pigmented lesions dominate preventive visits and differential diagnosis workflows. The dataset includes 39 malignant images in total, representing 3.8% of the dataset.

The paper is explicit that this resource is not a complete training base for challenging multi-class diagnosis. That admission is refreshing, if only because medical AI literature does not always resist the temptation to make a small dataset sound like a universal engine of clinical transformation.

Its better role is as a pilot clinical resource: independent testing, domain adaptation analysis, interpretability study design, and method demonstration under Russian outpatient mobile-dermoscopy conditions.

That local specificity is not a weakness in the abstract. It is a feature if the business question is local deployment. A model trained on large international datasets may perform differently when imaging devices, skin phototype distributions, referral patterns, clinical workflows, and metadata practices shift. The paper’s dataset gives operators a way to examine that shift in a documented environment.

The boundary is that local validation is not global validation. A dataset collected in Russian outpatient practice should not be marketed as proof of worldwide generalization. This should be obvious. It often is not.

What Cognaptus infers for business use

The paper directly shows a methodology and a pilot dataset. It does not show clinical deployment impact, diagnostic accuracy improvement, cost reduction, or physician workflow adoption.

From that direct evidence, Cognaptus infers four practical business uses.

First, the methodology can inform dataset governance. The combination of SOP, structured metadata, verification-stage recording, and deduplication creates a template for documenting how medical image records are produced. This is useful for AI teams that need auditable data lineage rather than a loose folder of images with optimistic filenames.

Second, the dataset can support external validation of dermatology models trained elsewhere. Because the source environment differs from large international datasets, it can help reveal domain-shift behavior under mobile dermoscopy and Russian outpatient conditions. The strength here is not sample size. It is controlled description of the target environment.

Third, the dermoscopic_structures field can support interpretability studies. If model attention or explanation methods are compared against expert-annotated patterns, the evaluation becomes more clinically grounded. This still requires careful design; attention maps are not automatically explanations, despite the industry’s heroic effort to pretend otherwise.

Fourth, the verification-stage field enables risk-aware evaluation subsets. A team can evaluate malignant-class sensitivity on histologically verified cases while treating benign-class specificity claims more cautiously. This is exactly the sort of nuance that should appear in model validation reports and rarely survives contact with marketing.

Business question	What the paper enables	What remains uncertain
Can we build a better local validation set?	Yes, by replicating the three-component construction methodology	Whether other clinics can implement the SOP consistently
Can this dataset train a production classifier?	Not as a complete base; the paper positions it for pilot analysis and validation	Performance would depend on expansion, balancing, and external validation
Can we audit model behavior by subgroup?	Metadata fields create the possibility	Field completeness and sample sizes may limit subgroup power
Can we claim histological gold standard across all labels?	No	Only malignant cases are histologically confirmed
Can we study domain shift from international datasets?	Yes, at pilot scale	Generalization beyond Russian outpatient mobile dermoscopy remains unproven

The operational lesson is chain of custody

The useful business abstraction is chain of custody. Every dataset record should answer five questions:

How was the image acquired?
What patient and lesion context is available?
Who annotated it, and under what controlled vocabulary?
How was the diagnosis verified?
Could the same patient, lesion, or image leak across evaluation boundaries?

This paper implements those questions in a specific dermatology setting. The wider lesson extends to other clinical imaging domains. Radiology, pathology, ophthalmology, ultrasound, wound imaging—the modalities differ, but the data-governance problem is familiar. Models inherit the sins of the dataset. They merely express them at scale.

For AI vendors, the paper suggests a practical procurement question: when a dataset provider says “verified,” ask for the verification field, not the adjective. Ask whether labels distinguish initial clinical impression, consensus review, and histological or morphological confirmation. Ask whether patient-stratified splitting is possible. Ask whether acquisition quality is stored as metadata. Ask whether duplicates were removed by content hash, and whether near-duplicate policy exists.

For healthcare organizations, the paper suggests that data readiness is not achieved by exporting images from clinical systems. The organization needs capture protocols, metadata forms, specialist review workflows, de-identification rules, and maintenance procedures. The fashionable term is “AI readiness.” The unfashionable reality is administrative discipline.

Boundaries that materially change interpretation

The paper’s limitations are not ceremonial. They determine how the resource should be used.

The first boundary is class imbalance. Nevi dominate the dataset, while malignant classes are sparse. This limits direct metric evaluation for rare classes and makes naïve multi-class training unattractive. Any evaluation should use patient-stratified validation and should avoid inflated confidence around rare-class metrics.

The second boundary is regional specificity. The dataset reflects Russian outpatient practice and mobile dermoscopy. That is valuable for local relevance but insufficient for global claims. A model performing well here might still behave differently elsewhere; a model performing poorly here might reveal local domain mismatch rather than universal inadequacy.

The third boundary is metadata completeness. The information model defines the fields, but some real-world records may have incomplete metadata. Retaining such records with explicit completeness labeling is sensible, but downstream analyses must account for missingness rather than quietly dropping inconvenient rows until the results improve.

The fourth boundary is verification asymmetry. Malignant lesions are histologically verified; benign and borderline lesions are consensus-confirmed. This supports stronger claims about malignant-label reliability than about benign histological specificity. It is not a scandal. It is medicine. The scandal would be pretending otherwise.

The fifth boundary is access. The dataset is non-public and used within a closed research environment. That limits immediate external reproducibility of analyses on the data itself. The methodology is therefore more transferable than the dataset.

Conclusion: reliable medical AI starts before training

The paper’s message is not glamorous, which is part of its value. It says that medical image AI begins with controlled acquisition, structured metadata, explicit label provenance, patient-level linkage, deduplication, and category policy. Only then do model claims become interpretable.

This is not the kind of work that produces an instantly viral benchmark result. It is the kind of work that prevents expensive benchmark results from being meaningless. In clinical AI, that is a useful trade.

The broader industry should read this as a reminder that “data quality” is too vague to guide action. The actionable version is: document the acquisition procedure, store the clinical context, preserve the verification level, prevent patient leakage, normalize categories without destroying source meaning, and admit where the dataset stops.

The mole is not the model. The model is not the product. And the product is not clinically credible unless the data record has a chain of custody. Annoying, perhaps. But considerably less annoying than deploying a confident classifier whose labels were built on vibes and resized thumbnails.

Cognaptus: Automate the Present, Incubate the Future.

Elena S. Kozachok, “Methodology for Creating a Clinically Verified Dermoscopic Image Dataset,” arXiv:2605.25168v1, 24 May 2026, https://arxiv.org/abs/2605.25168. ↩︎

TL;DR for operators#

The dataset begins before the image exists#

SOP is variance control wearing a lab coat#

Metadata is where the photo becomes a clinical record#

Verification_stage prevents the classic “gold standard” category error#

Deduplication and category policy are not housekeeping#

The evidence is descriptive, not experimental#

The constructed dataset is small, local, and more useful because it admits both#

What Cognaptus infers for business use#

The operational lesson is chain of custody#

Boundaries that materially change interpretation#

Conclusion: reliable medical AI starts before training#