Opening — Why this matters now

Human writing systems are historical artifacts as much as they are tools of communication. Latin letters, Greek symbols, Brahmi scripts, and Chinese characters all carry traces of cultural transmission, migration, and design conventions spanning millennia.

The problem is simple to state but notoriously difficult to solve: how do you measure similarity between writing systems when historians themselves disagree about their relationships?

Most machine learning methods assume the world is neatly labeled. Ancient scripts are not. A model trained with incorrect assumptions about which glyphs are “different” risks encoding speculative historical claims into its training objective.

A recent research effort proposes an elegant compromise: teach the model what we know with certainty, and let it discover the rest.

Instead of forcing AI to guess historical relationships, the framework separates reliable supervision from uncertain knowledge. The result is a hybrid learning strategy that may not only help historians analyze writing systems, but also offers a broader blueprint for AI training in domains where truth is incomplete.


Background — The Limits of Conventional Representation Learning

Most modern visual representation learning techniques fall into two camps.

Approach Core Idea Hidden Assumption
Contrastive Learning Pull similar samples together and push others apart All non‑matching samples are unrelated
Self‑Supervised Learning Learn invariances without labels Structure must emerge purely from data

Both assumptions become problematic when studying ancient scripts.

Consider two characters that look similar across alphabets. They might:

  • share a historical ancestor
  • reflect aesthetic conventions
  • or simply resemble each other by coincidence

If a model explicitly treats them as negative examples, it may erase meaningful relationships.

This is the central insight of the research: script evolution creates an asymmetric supervision problem.

Certain facts are reliable:

  • Different drawings of the same glyph are equivalent.
  • Characters inside fictional alphabets are intentionally independent.

But historical relationships between scripts remain uncertain.

The proposed solution is a training pipeline that respects this asymmetry.


Analysis — The Two‑Stage Learning Strategy

The framework combines supervised contrastive learning and self‑supervised distillation into a single pipeline.

Stage 1 — Learning Reliable Structure

The first stage trains a model using invented alphabets whose character identities are unambiguous.

Examples include fictional scripts such as those from literature or modern designed alphabets.

These datasets provide clean supervision:

  • each glyph belongs to a well‑defined class
  • characters from different alphabets are guaranteed independent

Using supervised contrastive learning, the encoder learns a geometric embedding space where:

  • instances of the same glyph cluster together
  • different glyph classes remain clearly separated

The training objective is the supervised contrastive loss:

$$ \mathcal{L}{sup} = \frac{1}{|I|} \sum{i \in I} \ell_i $$

This stage produces a teacher model that encodes a robust visual prior about how glyphs differ.

Stage 2 — Discovering Historical Relationships

Once the teacher is trained, the model is adapted to real historical scripts.

This stage uses a self‑supervised teacher–student setup inspired by BYOL (Bootstrap Your Own Latent).

Key components include:

  • Student network: learns updated representations
  • Target network: momentum‑updated teacher
  • Stop‑gradient mechanism: prevents collapse

The loss minimizes cosine distance between two augmented views:

$$ \mathcal{L}_{BYOL} = \frac{1}{B’} \sum_i [D(p_i^1, z_i^2) + D(p_i^2, z_i^1)] $$

Unlike contrastive methods, this framework avoids negative pairs entirely during the historical adaptation phase.

This allows the model to reorganize the embedding space based on real data rather than pre‑imposed assumptions.

Why This Matters

The two stages play complementary roles.

Stage Purpose Data Type
Stage 1 Build a discriminative visual prior Labeled fictional alphabets
Stage 2 Discover cross‑script similarities Unlabeled historical scripts

In short:

Stage 1 teaches structure. Stage 2 learns history.


Findings — What the Experiments Show

The framework was evaluated on two datasets:

  • Omniglot: a dataset of handwritten characters across many scripts
  • Unicode‑derived glyph dataset rendered using the Noto font family

Two types of performance metrics were used.

Glyph Recognition

The system was tested using a 20‑way 1‑shot retrieval task.

Backbone Top‑1 Accuracy Top‑5 Accuracy
Simple CNN 88% 98.75%
ResNet‑50 93% 98.75%

Results show the hybrid model remains competitive with leading self‑supervised methods for glyph recognition.

Script‑Level Similarity Ranking

More interesting is the ability to reconstruct historical relationships between writing systems.

This was measured using NDCG@10, a ranking metric that evaluates whether historically related scripts appear near each other.

Method NDCG@10
BYOL 0.2708
Barlow Twins 0.2997
Proposed Framework 0.3178

The hybrid method consistently produced better script similarity rankings.

In other words, it captured relationships between writing systems more effectively.

Geometric Separability

Another metric compares distances between related and unrelated scripts.

Example comparison:

Relationship Distance Ratio
Greek ↔ Latin small
Greek ↔ CJK large

The proposed method reduced the separability ratio by 35%, indicating a more historically coherent embedding space.

Visualizations of the learned embeddings show clearer clustering of related scripts.


Implications — Beyond Ancient Scripts

At first glance, this work appears narrowly focused on paleography.

It is not.

The deeper idea is methodological: AI training should respect the reliability of knowledge sources.

Many domains share the same structure:

Domain Reliable Knowledge Uncertain Relationships
Biology species morphology evolutionary lineage
Finance transaction data causal market dynamics
Law case facts precedent interpretation

Forcing models to treat uncertain relationships as negatives can distort the learned representation space.

The two‑stage strategy offers a general pattern:

  1. Learn discriminative structure from reliable labels
  2. Adapt representations without imposing speculative constraints

This approach could influence future work in:

  • historical linguistics
  • cultural artifact analysis
  • scientific discovery datasets
  • AI systems trained on incomplete knowledge

In short, it provides a practical architecture for learning under epistemic uncertainty.


Conclusion — Let the Model Discover the Past

Machine learning often assumes the world is labeled and neatly structured.

History rarely cooperates.

By separating reliable supervision from uncertain knowledge, this two‑stage framework shows that AI can still extract meaningful structure without embedding questionable assumptions.

The result is a model that not only recognizes glyphs but begins to map the geometry of human writing itself.

If scaled further, such systems could help scholars explore the evolutionary networks of scripts worldwide—revealing patterns that have remained hidden across centuries of human writing.

And perhaps more importantly, the method reminds us of something AI research occasionally forgets:

not every relationship should be hard‑coded into a loss function.

Sometimes the most powerful models are those allowed to discover history rather than assume it.

Cognaptus: Automate the Present, Incubate the Future.