Opening — Why this matters now
Human writing systems are historical artifacts as much as they are tools of communication. Latin letters, Greek symbols, Brahmi scripts, and Chinese characters all carry traces of cultural transmission, migration, and design conventions spanning millennia.
The problem is simple to state but notoriously difficult to solve: how do you measure similarity between writing systems when historians themselves disagree about their relationships?
Most machine learning methods assume the world is neatly labeled. Ancient scripts are not. A model trained with incorrect assumptions about which glyphs are “different” risks encoding speculative historical claims into its training objective.
A recent research effort proposes an elegant compromise: teach the model what we know with certainty, and let it discover the rest.
Instead of forcing AI to guess historical relationships, the framework separates reliable supervision from uncertain knowledge. The result is a hybrid learning strategy that may not only help historians analyze writing systems, but also offers a broader blueprint for AI training in domains where truth is incomplete.
Background — The Limits of Conventional Representation Learning
Most modern visual representation learning techniques fall into two camps.
| Approach | Core Idea | Hidden Assumption |
|---|---|---|
| Contrastive Learning | Pull similar samples together and push others apart | All non‑matching samples are unrelated |
| Self‑Supervised Learning | Learn invariances without labels | Structure must emerge purely from data |
Both assumptions become problematic when studying ancient scripts.
Consider two characters that look similar across alphabets. They might:
- share a historical ancestor
- reflect aesthetic conventions
- or simply resemble each other by coincidence
If a model explicitly treats them as negative examples, it may erase meaningful relationships.
This is the central insight of the research: script evolution creates an asymmetric supervision problem.
Certain facts are reliable:
- Different drawings of the same glyph are equivalent.
- Characters inside fictional alphabets are intentionally independent.
But historical relationships between scripts remain uncertain.
The proposed solution is a training pipeline that respects this asymmetry.
Analysis — The Two‑Stage Learning Strategy
The framework combines supervised contrastive learning and self‑supervised distillation into a single pipeline.
Stage 1 — Learning Reliable Structure
The first stage trains a model using invented alphabets whose character identities are unambiguous.
Examples include fictional scripts such as those from literature or modern designed alphabets.
These datasets provide clean supervision:
- each glyph belongs to a well‑defined class
- characters from different alphabets are guaranteed independent
Using supervised contrastive learning, the encoder learns a geometric embedding space where:
- instances of the same glyph cluster together
- different glyph classes remain clearly separated
The training objective is the supervised contrastive loss:
$$ \mathcal{L}{sup} = \frac{1}{|I|} \sum{i \in I} \ell_i $$
This stage produces a teacher model that encodes a robust visual prior about how glyphs differ.
Stage 2 — Discovering Historical Relationships
Once the teacher is trained, the model is adapted to real historical scripts.
This stage uses a self‑supervised teacher–student setup inspired by BYOL (Bootstrap Your Own Latent).
Key components include:
- Student network: learns updated representations
- Target network: momentum‑updated teacher
- Stop‑gradient mechanism: prevents collapse
The loss minimizes cosine distance between two augmented views:
$$ \mathcal{L}_{BYOL} = \frac{1}{B’} \sum_i [D(p_i^1, z_i^2) + D(p_i^2, z_i^1)] $$
Unlike contrastive methods, this framework avoids negative pairs entirely during the historical adaptation phase.
This allows the model to reorganize the embedding space based on real data rather than pre‑imposed assumptions.
Why This Matters
The two stages play complementary roles.
| Stage | Purpose | Data Type |
|---|---|---|
| Stage 1 | Build a discriminative visual prior | Labeled fictional alphabets |
| Stage 2 | Discover cross‑script similarities | Unlabeled historical scripts |
In short:
Stage 1 teaches structure. Stage 2 learns history.
Findings — What the Experiments Show
The framework was evaluated on two datasets:
- Omniglot: a dataset of handwritten characters across many scripts
- Unicode‑derived glyph dataset rendered using the Noto font family
Two types of performance metrics were used.
Glyph Recognition
The system was tested using a 20‑way 1‑shot retrieval task.
| Backbone | Top‑1 Accuracy | Top‑5 Accuracy |
|---|---|---|
| Simple CNN | 88% | 98.75% |
| ResNet‑50 | 93% | 98.75% |
Results show the hybrid model remains competitive with leading self‑supervised methods for glyph recognition.
Script‑Level Similarity Ranking
More interesting is the ability to reconstruct historical relationships between writing systems.
This was measured using NDCG@10, a ranking metric that evaluates whether historically related scripts appear near each other.
| Method | NDCG@10 |
|---|---|
| BYOL | 0.2708 |
| Barlow Twins | 0.2997 |
| Proposed Framework | 0.3178 |
The hybrid method consistently produced better script similarity rankings.
In other words, it captured relationships between writing systems more effectively.
Geometric Separability
Another metric compares distances between related and unrelated scripts.
Example comparison:
| Relationship | Distance Ratio |
|---|---|
| Greek ↔ Latin | small |
| Greek ↔ CJK | large |
The proposed method reduced the separability ratio by 35%, indicating a more historically coherent embedding space.
Visualizations of the learned embeddings show clearer clustering of related scripts.
Implications — Beyond Ancient Scripts
At first glance, this work appears narrowly focused on paleography.
It is not.
The deeper idea is methodological: AI training should respect the reliability of knowledge sources.
Many domains share the same structure:
| Domain | Reliable Knowledge | Uncertain Relationships |
|---|---|---|
| Biology | species morphology | evolutionary lineage |
| Finance | transaction data | causal market dynamics |
| Law | case facts | precedent interpretation |
Forcing models to treat uncertain relationships as negatives can distort the learned representation space.
The two‑stage strategy offers a general pattern:
- Learn discriminative structure from reliable labels
- Adapt representations without imposing speculative constraints
This approach could influence future work in:
- historical linguistics
- cultural artifact analysis
- scientific discovery datasets
- AI systems trained on incomplete knowledge
In short, it provides a practical architecture for learning under epistemic uncertainty.
Conclusion — Let the Model Discover the Past
Machine learning often assumes the world is labeled and neatly structured.
History rarely cooperates.
By separating reliable supervision from uncertain knowledge, this two‑stage framework shows that AI can still extract meaningful structure without embedding questionable assumptions.
The result is a model that not only recognizes glyphs but begins to map the geometry of human writing itself.
If scaled further, such systems could help scholars explore the evolutionary networks of scripts worldwide—revealing patterns that have remained hidden across centuries of human writing.
And perhaps more importantly, the method reminds us of something AI research occasionally forgets:
not every relationship should be hard‑coded into a loss function.
Sometimes the most powerful models are those allowed to discover history rather than assume it.
Cognaptus: Automate the Present, Incubate the Future.