Glyphs That Remember the Past: Teaching AI to Read History Without Being Told It

Opening — Why this matters now

Human writing systems are historical artifacts as much as they are tools of communication. Latin letters, Greek symbols, Brahmi scripts, and Chinese characters all carry traces of cultural transmission, migration, and design conventions spanning millennia.

The problem is simple to state but notoriously difficult to solve: how do you measure similarity between writing systems when historians themselves disagree about their relationships?

Most machine learning methods assume the world is neatly labeled. Ancient scripts are not. A model trained with incorrect assumptions about which glyphs are “different” risks encoding speculative historical claims into its training objective.

A recent research effort proposes an elegant compromise: teach the model what we know with certainty, and let it discover the rest.

Instead of forcing AI to guess historical relationships, the framework separates reliable supervision from uncertain knowledge. The result is a hybrid learning strategy that may not only help historians analyze writing systems, but also offers a broader blueprint for AI training in domains where truth is incomplete.

Background — The Limits of Conventional Representation Learning

Most modern visual representation learning techniques fall into two camps.

Approach	Core Idea	Hidden Assumption
Contrastive Learning	Pull similar samples together and push others apart	All non‑matching samples are unrelated
Self‑Supervised Learning	Learn invariances without labels	Structure must emerge purely from data

Both assumptions become problematic when studying ancient scripts.

Consider two characters that look similar across alphabets. They might:

share a historical ancestor
reflect aesthetic conventions
or simply resemble each other by coincidence

If a model explicitly treats them as negative examples, it may erase meaningful relationships.

This is the central insight of the research: script evolution creates an asymmetric supervision problem.

Certain facts are reliable:

Different drawings of the same glyph are equivalent.
Characters inside fictional alphabets are intentionally independent.

But historical relationships between scripts remain uncertain.

The proposed solution is a training pipeline that respects this asymmetry.

Analysis — The Two‑Stage Learning Strategy

The framework combines supervised contrastive learning and self‑supervised distillation into a single pipeline.

Stage 1 — Learning Reliable Structure

The first stage trains a model using invented alphabets whose character identities are unambiguous.

Examples include fictional scripts such as those from literature or modern designed alphabets.

These datasets provide clean supervision:

each glyph belongs to a well‑defined class
characters from different alphabets are guaranteed independent

Using supervised contrastive learning, the encoder learns a geometric embedding space where:

instances of the same glyph cluster together
different glyph classes remain clearly separated

The training objective is the supervised contrastive loss:

$$ \mathcal{L}{sup} = \frac{1}{|I|} \sum{i \in I} \ell_i $$

This stage produces a teacher model that encodes a robust visual prior about how glyphs differ.

Stage 2 — Discovering Historical Relationships

Once the teacher is trained, the model is adapted to real historical scripts.

This stage uses a self‑supervised teacher–student setup inspired by BYOL (Bootstrap Your Own Latent).

Key components include:

Student network: learns updated representations
Target network: momentum‑updated teacher
Stop‑gradient mechanism: prevents collapse

The loss minimizes cosine distance between two augmented views:

$$ \mathcal{L}_{BYOL} = \frac{1}{B’} \sum_i [D(p_i^1, z_i^2) + D(p_i^2, z_i^1)] $$

Unlike contrastive methods, this framework avoids negative pairs entirely during the historical adaptation phase.

This allows the model to reorganize the embedding space based on real data rather than pre‑imposed assumptions.

Why This Matters

The two stages play complementary roles.

Stage	Purpose	Data Type
Stage 1	Build a discriminative visual prior	Labeled fictional alphabets
Stage 2	Discover cross‑script similarities	Unlabeled historical scripts

In short:

Stage 1 teaches structure. Stage 2 learns history.

Findings — What the Experiments Show

The framework was evaluated on two datasets:

Omniglot: a dataset of handwritten characters across many scripts
Unicode‑derived glyph dataset rendered using the Noto font family

Two types of performance metrics were used.

Glyph Recognition

The system was tested using a 20‑way 1‑shot retrieval task.

Backbone	Top‑1 Accuracy	Top‑5 Accuracy
Simple CNN	88%	98.75%
ResNet‑50	93%	98.75%

Results show the hybrid model remains competitive with leading self‑supervised methods for glyph recognition.

Script‑Level Similarity Ranking

More interesting is the ability to reconstruct historical relationships between writing systems.

This was measured using NDCG@10, a ranking metric that evaluates whether historically related scripts appear near each other.

Method	NDCG@10
BYOL	0.2708
Barlow Twins	0.2997
Proposed Framework	0.3178

The hybrid method consistently produced better script similarity rankings.

In other words, it captured relationships between writing systems more effectively.

Geometric Separability

Another metric compares distances between related and unrelated scripts.

Example comparison:

Relationship	Distance Ratio
Greek ↔ Latin	small
Greek ↔ CJK	large

The proposed method reduced the separability ratio by 35%, indicating a more historically coherent embedding space.

Visualizations of the learned embeddings show clearer clustering of related scripts.

Implications — Beyond Ancient Scripts

At first glance, this work appears narrowly focused on paleography.

It is not.

The deeper idea is methodological: AI training should respect the reliability of knowledge sources.

Many domains share the same structure:

Domain	Reliable Knowledge	Uncertain Relationships
Biology	species morphology	evolutionary lineage
Finance	transaction data	causal market dynamics
Law	case facts	precedent interpretation

Forcing models to treat uncertain relationships as negatives can distort the learned representation space.

The two‑stage strategy offers a general pattern:

Learn discriminative structure from reliable labels
Adapt representations without imposing speculative constraints

This approach could influence future work in:

historical linguistics
cultural artifact analysis
scientific discovery datasets
AI systems trained on incomplete knowledge

In short, it provides a practical architecture for learning under epistemic uncertainty.

Conclusion — Let the Model Discover the Past

Machine learning often assumes the world is labeled and neatly structured.

History rarely cooperates.

By separating reliable supervision from uncertain knowledge, this two‑stage framework shows that AI can still extract meaningful structure without embedding questionable assumptions.

The result is a model that not only recognizes glyphs but begins to map the geometry of human writing itself.

If scaled further, such systems could help scholars explore the evolutionary networks of scripts worldwide—revealing patterns that have remained hidden across centuries of human writing.

And perhaps more importantly, the method reminds us of something AI research occasionally forgets:

not every relationship should be hard‑coded into a loss function.

Sometimes the most powerful models are those allowed to discover history rather than assume it.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The Limits of Conventional Representation Learning#

Analysis — The Two‑Stage Learning Strategy#

Stage 1 — Learning Reliable Structure#

Stage 2 — Discovering Historical Relationships#

Why This Matters#

Findings — What the Experiments Show#

Glyph Recognition#

Script‑Level Similarity Ranking#

Geometric Separability#

Implications — Beyond Ancient Scripts#

Conclusion — Let the Model Discover the Past#