Symbols are easy to digitize and surprisingly hard to respect.

A business team sees two product names, two supplier records, two compliance clauses, or two scanned forms that look related. The lazy engineering answer is: “label the matches, label the non-matches, train a contrastive model.” That answer often works. It is also how many embedding systems quietly turn uncertainty into false certainty, then call the result “semantic similarity.” Very tidy. Very confident. Occasionally very wrong.

The paper Contrastive-to-Self-Supervised: A Two-Stage Framework for Script Similarity Learning studies a more delicate version of the same problem: how to learn visual similarity between glyphs and writing systems when some labels are reliable but historical relationships between scripts are uncertain.1 The application is ancient and historical scripts. The operational lesson is broader: when supervision is asymmetric, the training pipeline should also be asymmetric.

The paper’s key move is simple enough to explain, but easy to miss. It does not say “use contrastive learning everywhere.” It says: use contrastive learning only where negative examples are epistemically safe, then switch to self-supervised adaptation where negative examples would smuggle in assumptions.

That is the mechanism worth caring about.

The real problem is not missing labels; it is unsafe negatives

In ordinary contrastive learning, positive pairs are pulled together and negative pairs are pushed apart. For handwritten character recognition, this is natural. Two drawings of the same glyph should be close. Two different fictional glyphs from invented alphabets can be treated as different classes. The model learns invariance to handwriting, rotation, shear, translation, and other variation without needing a lecture on archaeology.

Historical scripts are less cooperative.

A Greek character, a Phoenician form, a Brahmi-derived sign, or a later script variant may be visually similar because of lineage, borrowing, shared graphic conventions, independent convergence, or merely because humans keep making strokes with hands rather than tentacles. Some relationships are known. Some are debated. Some are incomplete. Treating every cross-script pair as a negative example means the model is not just learning visual structure; it is being told a theory of history.

That is the paper’s central warning. The danger is not that the model lacks enough examples. The danger is that the “negative” label may be a disguised historical claim.

For business readers, this is the important translation. In many enterprise datasets, the risky label is not the obvious positive label. It is the negative label: “not the same customer,” “not the same vendor,” “not the same risk category,” “not the same policy intent,” “not a related complaint,” “not a duplicate product.” When the organization is unsure, forcing negatives can make the embedding space look clean while making the decision logic brittle.

The paper’s answer is a two-stage training design:

Stage Data where it is used What supervision is allowed to do What it avoids
Stage 1: supervised contrastive learning Invented or modern fictional scripts with reliable glyph identities Build a discriminative teacher model using safe positives and safe negatives Avoid using uncertain historical relationships as training labels
Stage 2: teacher-initialized self-distillation Unlabeled historical scripts Adapt the representation to real historical glyph variation without cross-script negatives Avoid pushing potentially related scripts apart
Evaluation Held-out historical scripts and curated script relationships Test whether the resulting geometry ranks related scripts closer Avoid claiming full evolutionary reconstruction from visual similarity alone

This is not a new trick pasted onto an old problem. The split between “what we can supervise” and “what we should leave exploratory” is the actual contribution.

Stage 1 builds a clean prior where history cannot complain

The first stage trains an encoder using supervised contrastive learning on invented or modern alphabets in Omniglot. These include scripts where character identity is unambiguous and historical dependence is not the target problem.

The paper reports 15 fictional or modern scripts for supervised training, with $G = 350$ distinct glyph classes. Each class has multiple handwritten instances, and the authors add affine augmentations such as rotation, shear, zoom, and translation. This matters because the teacher is not being trained to memorize a single clean symbol. It is trained to recognize a character despite deformation.

The supervised contrastive loss has the usual structure: embeddings of the same glyph class are attracted; embeddings from different glyph classes are treated as negatives. In this controlled setting, that is acceptable. A Futurama alien glyph and a Tolkien Tengwar glyph do not carry the same historical uncertainty as, say, a Greek-Latin-Cyrillic comparison. The model can be strict because the labels are not pretending to settle an archaeological debate.

The output of this stage is a teacher encoder: a representation space with strong intra-class clustering and inter-class separation. In business language, Stage 1 learns a disciplined visual prior from a domain where labels are clean enough to justify discipline.

That prior is useful. It is also not enough.

A frozen teacher trained only on safe invented scripts may distinguish characters well, but the downstream target is script similarity among historical systems. The paper therefore asks whether the clean prior can be adapted without imposing false separations. This is where Stage 2 becomes more interesting than yet another contrastive-learning recipe.

Stage 2 adapts without turning uncertainty into a negative label

The second stage uses a BYOL-inspired teacher-student setup. BYOL is a self-supervised method that learns by making an online network predict the representation of a target network under different views, with the target updated by exponential moving average. Crucially, it does not require explicit negative pairs.

The authors adapt this idea in three ways.

First, both the student and target network are initialized from the supervised teacher produced in Stage 1. Standard BYOL starts from scratch; here, the model begins with a structured representation already learned from safe labels.

Second, the model omits the projection MLP normally used in BYOL. The paper argues that the backbones already produce compact low-dimensional embeddings, so an extra projector would add complexity and overfitting risk.

Third, instead of relying only on two synthetic augmentations of the same image, the method uses genuine handwritten variants of the same character, plus geometric augmentations. That is a good fit for glyphs: handwriting variation is not noise added for machine learning convenience; it is part of the phenomenon.

The operational logic is clean:

  1. Learn discrimination where discrimination is legitimate.
  2. Transfer that structure into the uncertain domain.
  3. Let the representation reorganize under real target-domain variation.
  4. Do not define cross-script negatives where the relationship may be historically unresolved.

This is the paper’s mechanism-first insight. It is less glamorous than declaring that AI can “read history,” but considerably more useful. The model is not discovering human civilization in vector space. It is avoiding a common training error: using the absence of certainty as if it were certainty of absence.

Script similarity is built from glyph similarity, not declared directly

The paper does not train a model with direct script-lineage labels. Instead, it learns glyph embeddings and then derives script-level distances from them.

At the glyph level, each image is mapped into a normalized embedding space. Similarity is cosine similarity; dissimilarity is $1 - \text{cosine similarity}$.

At the script level, each script is treated as a set of glyphs. To compare two scripts, the method uses nearest-neighbor matching: for every glyph in script A, find the closest glyph in script B, average those distances, then symmetrize the result by doing the comparison in both directions.

This design has an important historical intuition. Multiple glyphs from one script may map to the same glyph in another. That is not necessarily a bug. In real writing-system evolution, one form may split, merge, simplify, or be reused. A rigid one-to-one matching would make the model look mathematically tidy while treating historical transformation like a spreadsheet join. History, inconsiderately, did not optimize for clean database schemas.

The paper then evaluates whether these induced script distances align with curated linguistic and historical similarity levels.

Two script-level metrics matter most:

Metric What it asks Why it matters
NDCG@10 Are historically related scripts placed near the top of the nearest-neighbor ranking? Measures top-ranked retrieval quality, which is often what users inspect first
Spearman correlation Does the global ranking of distances increase as historical similarity decreases? Measures broad monotonic alignment across script pairs

This distinction becomes important in the results. The hybrid method is strongest at improving top-neighbor ranking, not uniformly dominant on every global ordering metric.

The main evidence supports top-neighbor quality, not universal superiority

The paper compares the hybrid framework against supervised contrastive learning alone, BYOL, Barlow Twins, and DINOv2-ViT-S/14. It tests several backbones: Simple CNN, Siamese CNN, ResNet-18, ResNet-34, and ResNet-50. The reported metrics include 20-way 1-shot glyph retrieval accuracy, NDCG@10, and Spearman correlation.

The headline result is narrower and more useful than “our model wins.”

The hybrid method achieves the best NDCG@10 on three backbone families: Simple CNN, ResNet-34, and ResNet-50. It does not win NDCG@10 on Siamese CNN, where SupCon is best, or on ResNet-18, where Barlow Twins is best. On ResNet-50, the hybrid method reaches NDCG@10 of 0.3178, compared with 0.2997 for Barlow Twins and 0.2708 for BYOL. That is the cleanest numerical support for the claim that teacher-initialized self-distillation improves top-ranked script-neighbor quality.

Spearman correlation tells a different story. The hybrid method achieves the best Spearman result on Simple CNN, with $\rho = 0.640$, and remains competitive on Siamese CNN at $\rho = 0.594$. But BYOL performs better on Spearman for the ResNet-based architectures. This is not a contradiction. NDCG@10 rewards getting the closest neighbors right; Spearman rewards the full global ordering across pairs. A model can be better at the top of the list while less smooth across the entire ranked universe.

That distinction is not a footnote. It is the difference between two product use cases.

If the application is “show me the most plausible neighboring scripts,” NDCG@10 is the more relevant signal. If the application is “produce a full ranked map of all script distances,” Spearman deserves more attention. The paper’s hybrid method looks most convincing for the first case.

A compact reading of the evidence looks like this:

Evidence item Likely purpose What it supports What it does not prove
NDCG@10 benchmark across backbones Main evidence The hybrid method often improves top-ranked script-neighbor quality, especially on ResNet-50 It does not prove dominance on every architecture
Spearman correlation Complementary global ranking test The learned distances can align with curated historical similarity levels It does not show the hybrid method is always best globally
20-way 1-shot glyph retrieval Secondary discriminative test The representation remains useful for glyph recognition It is not the central evidence for script-level historical similarity
SupCon-only comparison Ablation-like test for Stage 2 Tests whether supervised pretraining alone is enough It cannot isolate every architectural interaction
Target vs student inference results Implementation/stability comparison Checks whether EMA target or trained student gives better embeddings It is not a separate conceptual method
DINOv2 frozen and fine-tuned baseline Comparison with general vision foundation features Generic natural-image features transfer weakly to this glyph domain It does not imply foundation models are useless after serious domain adaptation
Separability ratio with Greek, Latin, and CJK Geometric diagnostic / exploratory support Stage 2 can sharpen historically plausible proximities in a local example It is not a full phylogenetic validation

The separability ratio is especially useful, but only if read correctly. The paper compares the distance between Greek and Latin against their average distance to CJK scripts. A lower ratio means Greek and Latin are proportionally closer to each other than to CJK. The teacher model yields $\mathcal{R} = 0.323$; the student model yields $\mathcal{R} = 0.210$, a 35% reduction. That supports the claim that Stage 2 does not merely compress all distances. It selectively sharpens a historically plausible neighborhood.

But this is a diagnostic, not a declaration that the model has reconstructed writing-system evolution. The paper itself points toward future phylogenetic analyses rather than claiming that they are already solved. This restraint is welcome. We should enjoy it. It is becoming rare.

DINOv2 is the useful disappointment

One of the more business-relevant results is not the hybrid model’s win. It is DINOv2’s underwhelming transfer.

DINOv2-ViT-S/14 is a large self-supervised vision model trained on natural images. The paper evaluates it both as frozen features and with fine-tuning through BYOL on the unlabeled data. The results are modest. Frozen DINOv2 reports 43.75 on N20R1, 74.75 on N20R5, NDCG@10 of 0.2612, and Spearman of 0.477. Fine-tuning improves glyph retrieval to 61 and 90.5, and Spearman to 0.609, but NDCG@10 falls to 0.2366.

This does not mean general foundation models are bad. It means “large and general” is not a magic bridge into narrow visual domains where the relevant structure is not the structure of natural photographs. Glyphs are sparse, symbolic, high-contrast, and historically patterned. The discriminative features that help a model understand dogs, cars, and kitchen counters may not be the right features for separating script geometry.

The business translation is straightforward. Off-the-shelf embeddings are often a good first baseline. They are not a governance strategy. When the task depends on subtle domain-specific similarity, especially where false negatives matter, a domain-adapted training path may beat a bigger generic encoder. Yes, even if the bigger model has a more impressive model card and better conference aura.

The business pattern: reliable supervision first, exploratory adaptation second

The obvious business use of this paper is not “ancient script analysis as a service.” That market exists, but it is not what most companies will care about.

The reusable pattern is for domains with asymmetric certainty:

  • You know some within-class identities confidently.
  • You have clean labels in a safer subdomain.
  • You suspect cross-category relationships, but cannot label negatives reliably.
  • You need a similarity space that supports exploration rather than premature judgment.

This pattern appears in several business workflows.

In document intelligence, a firm may have reliable labels for standardized forms but uncertain relationships among legacy documents, scanned templates, or regional variants. A two-stage approach could learn clean document features from standardized examples, then adapt on unlabeled legacy material without forcing uncertain document families apart.

In compliance and legal taxonomy, some clauses are clearly equivalent or clearly distinct within a controlled policy library. But across jurisdictions, vendors, and historical versions, relationships become ambiguous. Treating every unlabeled cross-policy pair as unrelated can damage retrieval and risk analysis.

In product and entity deduplication, a company may know that certain catalog items are the same SKU and certain controlled examples are distinct. But marketplace data, multilingual labels, abbreviations, supplier aliases, and regional spellings make negative pairs dangerous. A model that over-separates uncertain items will create clean dashboards and messy operations.

In knowledge-graph construction, relation extraction often suffers from the same problem. Positive edges may be known; missing edges are not necessarily negative edges. Training as if “not observed” means “false” is a surprisingly efficient way to build a graph that confuses ignorance with knowledge. A classic enterprise achievement.

The paper’s principle can be expressed as a design rule:

Business condition Training design implication
Positive identity is reliable in a controlled subdomain Use supervised or contrastive learning to build a strong prior
Negative relationships are uncertain in the target domain Avoid cross-category negatives during adaptation
Target data has real variation but weak labels Use self-supervised or teacher-student adaptation
Users care about nearest useful neighbors Evaluate top-ranked retrieval, not only global correlation
Downstream decisions are high-stakes Treat embedding output as evidence for review, not automatic truth

The central ROI argument is not “better accuracy” in the abstract. It is cheaper and safer similarity discovery. The model reduces the manual search space without pretending that every non-match is known. For many enterprise workflows, that is more valuable than squeezing another point out of a classification benchmark.

Where the evidence should not be overextended

The paper is careful about the nature of its evidence, and the article should be too.

First, visual similarity is not historical proof. A glyph embedding can suggest proximity; it cannot by itself establish cultural transmission, lineage, borrowing, or independent invention. The model produces evidence that can inform expert analysis. It does not replace that analysis, unless one’s expert standard is “the scatterplot looked persuasive,” in which case archaeology is the least of our problems.

Second, the results vary by backbone and metric. The hybrid method’s strongest claim is about NDCG@10, especially on certain architectures. BYOL performs better on Spearman for ResNet-based models. Barlow Twins is strong on glyph-level retrieval in several settings. Any business implementation should choose metrics based on the actual decision task.

Third, the Unicode benchmark is valuable as a broadened dataset construction effort, but rendered Unicode glyphs are not the same as messy historical inscriptions, manuscripts, damaged artifacts, or stylized administrative documents. Font-normalized glyphs control variation; real-world archives enjoy no such manners.

Fourth, the method depends on having a safe supervised source domain. The invented-script setup works because the authors can identify a domain where contrastive negatives are defensible. In business settings, that safe domain must be selected carefully. A bad “clean” source domain simply gives the second stage a well-organized mistake.

Finally, the paper does not solve general similarity learning. It offers a good design pattern for a specific class of problems: reliable within-class supervision, uncertain cross-category relationships, and a need for exploratory neighborhood geometry.

That class is smaller than “all AI.” It is also larger than ancient scripts.

The mechanism is the message

The paper’s most useful idea is not that AI can cluster glyphs. It is that a training pipeline can respect the boundary between what is known and what is merely unproven.

Stage 1 uses contrastive learning where negatives are safe. Stage 2 uses teacher-initialized self-distillation where negatives would be speculative. The evaluation then checks whether the resulting space retrieves historically plausible neighbors and preserves meaningful script-level structure.

This is a mature pattern for AI system design. Do not force the model to learn from labels your organization does not actually possess. Do not confuse unlabeled pairs with negative pairs. Do not reward a model for making uncertain relationships look clean. Clean wrongness is still wrongness. It just looks better in a quarterly review.

For Cognaptus readers, the practical lesson is direct: when building similarity systems for documents, entities, risks, policies, or knowledge graphs, ask not only “what labels do we have?” but “which labels are safe enough to shape the geometry?” The answer may lead to a two-stage design: discriminative learning where truth is reliable, exploratory adaptation where relationships remain open.

That is a more honest architecture. It may also be a better one.

Cognaptus: Automate the Present, Incubate the Future.


  1. Claire Roman and Philippe Meyer, “Contrastive-to-Self-Supervised: A Two-Stage Framework for Script Similarity Learning,” arXiv:2603.06180, 2026. https://arxiv.org/html/2603.06180 ↩︎