Numbers look simple until you ask a model to continue them.

That is the quiet trap in Testing Transformer Learnability on the Arithmetic Sequence of Rooted Trees.1 The paper does not ask whether a transformer can chat about prime numbers, recite factorization facts, or hallucinate Euclid with confidence. It asks a cleaner question: if we translate the natural numbers into a symbolic language whose grammar is generated by prime factorization, can a GPT-2-style transformer learn that grammar from sequence data alone?

This sounds like a “machines learn arithmetic” story. It is not. Thankfully.

The useful reading is more precise: the authors build a controlled learnability test. They convert integers into rooted-tree structures, convert those trees into Dyck words, concatenate those words into a deterministic arithmetic text, and train a transformer on that text. The model learns some real structure. It beats a simple Markov-chain baseline. It almost never emits illegal Dyck words. It performs better on recurring composite-like motifs than on prime positions. And when asked to identify primes, its precision, recall, and F1 sit only around 0.3.

So the punchline is not “LLMs can factor numbers.” The punchline is more interesting: when the data-generating rule is known, we can watch a transformer succeed exactly where local grammar is exploitable—and fail where prediction requires something closer to global arithmetic reasoning.

That is a much better test than another leaderboard with a heroic name and suspiciously convenient examples.

The paper starts by changing the object, not the model

The model is not given raw integers. That matters.

Raw integers are a terrible input for this kind of experiment because the numerical symbol “55340232221128654848” does not expose its multiplicative anatomy to a sequence model. To the model, the decimal digits are mostly surface. The factorization structure is hidden behind notation.

The authors instead use a representation based on iterated prime factorization. For an integer $n$, write its prime factorization as:

$$ n = \prod_{k=1}^{\omega(n)} p_k^{a_k}. $$

Then build a rooted structure by attaching nodes for the distinct prime factors and recursively applying the same construction to the exponents $a_k$. If an exponent is one, the recursion stops.

This turns arithmetic into topology. The integer is no longer primarily a magnitude. It becomes a rooted planar tree.

Then comes the crucial simplification: the authors ignore the prime labels and retain the undecorated tree topology. That makes the representation lossy. Different integers can collapse into the same tree shape. For example, products of two distinct primes share the same basic structural form. All prime numbers map to the same minimal tree.

Finally, each rooted planar tree is translated into a Dyck word: a balanced binary string generated by walking around the tree, writing 1 when moving upward along an edge and 0 when moving downward. In this encoding, prime numbers correspond to the word 10.

That last sentence is easy to overread. Predicting 10 at the right position means predicting that the integer at that position is prime in this tree-language representation. It does not mean the model has identified the prime’s value, discovered a new prime, or produced a full factorization. The labels have already been discarded. The model is learning a compressed grammar of factorization topology, not operating as a number-theory oracle.

Still, the setup is clever. The sequence of natural numbers becomes an “arithmetic text,” denoted $N_T$, whose words are Dyck encodings of rooted trees. This text is deterministic. It has recurring motifs. It has directionality: some phrases occur while their reversals may never occur. It has a countably infinite dictionary of possible Dyck words. It also carries arithmetic meaning, even after losing prime labels.

In other words, the authors do not merely feed numbers to a transformer. They build a language in which the hidden structure of numbers becomes visible enough to test.

The translation pipeline is the real benchmark

The mechanism can be read as a four-step pipeline:

Step What happens Why it matters
Integer Start from $n$ and its prime factorization The generating rule is exact, not empirical
Rooted tree Recursively encode factorization and exponent structure Multiplicative structure becomes hierarchical
Dyck word Convert tree topology into a balanced binary string The model sees symbolic “words,” not raw numbers
Arithmetic text Concatenate the Dyck words across integers The sequence becomes language-model training data

This pipeline is the paper’s main contribution. The transformer experiment matters because this representation makes the experiment interpretable.

In ordinary enterprise AI evaluation, the data-generating process is often messy. A customer-support log mixes policy, human mood, product defects, reporting incentives, and operational noise. A compliance file mixes regulation, workflow, spreadsheet habits, and someone’s Friday afternoon energy level. When a model succeeds, we often do not know whether it learned a rule, a shortcut, a formatting convention, or the office politics hidden in the labels.

Here, the generative structure is known. The arithmetic text has a rule underneath it. That makes failure informative. It also makes partial success more meaningful.

The authors train a GPT-2-style decoder architecture from scratch: 12 layers, 12 attention heads, embedding dimension 768, and roughly $8.7 \times 10^7$ trainable parameters. The corpus is built from the Dyck-word sequence for integers from 2 up to $10^{11}$. The text is tokenized with Byte-Pair Encoding using vocabularies of size 64, 256, and 1024, while sentence length is kept at 1024 tokens.

The dataset split also matters. The ordered corpus is divided into 10 chunks. The first nine chunks contribute training and validation segments. The final chunk is held out for testing, giving the model a later interval of the arithmetic text rather than merely shuffled near-neighbors. This is not perfect “future reasoning,” but it is stricter than randomizing away the sequence’s order and then celebrating generalization. A low bar, yes, but a bar nonetheless.

The two tasks ask different questions

The paper evaluates two self-supervised tasks.

The first is next-word prediction. Given 1024 tokens of prior context, the model generates the continuation. This is the natural GPT-style task: predict what comes next in the arithmetic text.

The second is masked language modeling. Some tokens are masked, replaced, or left unchanged according to a masking procedure, and the model tries to reconstruct the missing content from surrounding context.

These tasks are not interchangeable. They probe different forms of structure.

Experiment component Likely purpose What it supports What it does not prove
Training and validation loss curves Main training evidence plus learning-dynamics observation The model learns compressible structure beyond uniform guessing That it has recovered arithmetic laws
Markov-chain baseline Minimal comparison The transformer beats local one-step transition statistics That it beats strong symbolic or arithmetic baselines
Temperature sweep in next-word prediction Sensitivity test Generation quality depends strongly on sampling temperature That one temperature reveals stable reasoning ability
Per-word precision, recall, and F1 Main evidence for arithmetic classes Prime and square-free motifs can be evaluated as specific classes That rare or global arithmetic events are solved
Prime confusion analysis Diagnostic error analysis Errors are structured, especially confusion with square-free words That the model has a complete theory of the error classes
Masking-rate and temperature sweep in MLM Robustness/sensitivity test for reconstruction Context helps recover missing structure under moderate corruption That bidirectional reconstruction equals causal prediction

This distinction is important because the strongest business lesson is not “the model learned the task.” It is “different tests expose different kinds of learnability.”

A global word-accuracy score can look decent while hiding weak performance on the class you actually care about. Anyone who has deployed fraud detection, compliance screening, or anomaly triage has met this problem in less mathematical clothing. The system performs nicely on aggregate and then develops sudden philosophical uncertainty precisely when the expensive cases appear.

The model learns local grammar before it learns hard arithmetic—if it ever does

The next-word prediction results show a clear pattern.

The model’s loss decreases during training. The curves display an initial descent, then a plateau, then a second descent. The authors interpret this as a possible sign that the model first learns basic sequence properties and later learns more complex relations. That is plausible, but it should be treated as an exploratory learning-dynamics observation rather than a second thesis of the paper. The main result is not the shape of the loss curve. The main result is what the trained model can and cannot predict.

On word-level accuracy, the transformer outperforms the Markov-chain baseline across temperature settings. Accuracy is best at low temperature, roughly $0.1 \leq T \leq 0.3$. For Kullback–Leibler divergence between generated and true word-frequency distributions, the best range is broader, roughly $0.3 \leq T \leq 0.7$.

That split is revealing. Accuracy rewards getting common words exactly right at the right positions. KL divergence rewards matching the distribution of words. The paper notes that word accuracy can be dominated by high-frequency motifs. In plain language: a model can look competent by correctly handling common structures while still being bad at rare or strategically important ones.

The authors therefore inspect precision, recall, and F1 for individual Dyck words. This is where the arithmetic interpretation becomes sharper.

For the Dyck word 10, corresponding to prime numbers, precision, recall, and F1 are all close to 0.3. The paper explains this as roughly detecting one prime for every three existing primes and correctly predicting one prime for every three predicted primes. That is not trivial. It is also not a triumph. It is the uncomfortable middle ground where something is being learned, but not enough to deserve the victory music.

For square-free-related structures such as 1010, 101010, and 10101010, the metrics are higher, around 0.4 to 0.5. This makes sense. These motifs have more visible local structure in the tree-language representation. The model is better when the grammar gives it more handles.

Then the error analysis adds the important detail: the model frequently confuses primes with square-free numbers. That is not random failure. It suggests the model has learned a neighborhood of structural similarity but does not fully separate the arithmetic class boundary.

This is the boundary the article should care about. The transformer captures enough grammar to avoid behaving like a toy Markov model. It learns recurring motifs. It often produces valid Dyck-word structure. But primes are hard because primality is not simply a local continuation pattern in the preceding text. At prime boundaries, the model’s statistical comfort begins to run out. As usual, the interesting part is where the machine stops sounding fluent.

Masked reconstruction confirms structure, but under controlled damage

The masked-language-modeling experiment gives complementary evidence.

Here the model is not asked to generate the future from the past. It is asked to recover missing tokens from context. The authors vary both the mask probability $p_m$ and the temperature $T$. Performance worsens when both increase, which is exactly what one would expect. More missing information and more sampling randomness make reconstruction harder. We did not need a theorem for that; reality still occasionally behaves.

The useful result is the magnitude under moderate settings. In the low-temperature regime, roughly $0.1 \leq T \leq 0.3$, token accuracy is above 0.4. The paper also reports that accuracy remains above 0.3 even as masking probability and temperature rise up to about 0.4 and 0.5 respectively.

This supports the claim that the model has learned contextual regularities in the arithmetic text. But it should not be inflated into a claim about robust mathematical reasoning. MLM gives the model bidirectional context. That makes it a reconstruction test, not the same challenge as causally predicting unseen continuation.

For business readers, the distinction is familiar. Filling a missing field in a semi-structured document is easier than forecasting the next anomalous event in a process stream. Both are useful. Only one deserves to be called forward prediction.

The real finding is partial grammar learning, not prime prediction

The paper’s results are best summarized as follows:

Paper result Direct interpretation Business interpretation Boundary
GPT-2 beats a Markov-chain baseline The model learns more than immediate token transitions Transformers can capture nontrivial process grammar from symbolic logs Baseline is intentionally simple
Generated outputs almost never violate Dyck-word validity The model internalizes basic formal constraints Sequence models can learn structural validity without explicit rule engines Valid syntax is not full semantic correctness
Prime 10 metrics are near 0.3 Prime-position prediction is weak but nonrandom Rare boundary cases remain difficult even when aggregate metrics look good Not a factoring or cryptography result
Square-free motifs reach higher metrics Repeated local structures are easier to learn Recurring operational patterns are more learnable than exceptional cases Similarity can produce systematic confusion
MLM token accuracy stays above 0.4 at low temperature Context supports partial reconstruction Missing-data repair may be easier than causal event prediction Reconstruction does not equal reasoning

This is why the paper is valuable as a benchmark-design study. It does not give us a production system. It gives us a controlled chamber where model behavior can be inspected against a known rule.

That is rare. And useful.

In enterprise AI, many evaluations ask whether a model gives the “right” answer on a set of examples. The weakness is that the examples often do not reveal why the answer is right. The model may rely on shallow correlations. It may exploit formatting artifacts. It may learn the labeler’s habits. It may simply memorize. The result is a beautiful dashboard attached to a fog machine.

A deterministic symbolic corpus like this changes the testing logic. Because the underlying process is known, we can ask sharper diagnostic questions:

  • Does the model learn syntax-level validity?
  • Does it learn local transition structure?
  • Does it distinguish common motifs from rare classes?
  • Does it confuse structurally similar categories?
  • Does performance degrade smoothly under masking, temperature, or vocabulary changes?
  • Does a larger context window actually help with the dependencies that matter?

Those questions transfer well beyond number theory. A bank can encode transaction workflows. A manufacturer can encode process states. A compliance team can encode rule sequences. A software company can encode API-call traces. The goal is not to pretend these domains are arithmetic. The goal is to build evaluation corpora where the rule is known, the failure modes are classifiable, and aggregate accuracy cannot hide the costliest mistakes.

That is the business pathway: not “use transformers to discover primes,” but “use known-rule symbolic environments to test whether AI systems learn process structure or merely imitate surface frequency.”

Less glamorous. More useful. A tragic fate for good research.

Where the paper’s boundary should be drawn

The limitations are not footnotes to be sprinkled nervously across every paragraph. They define how the result should be used.

First, the representation is lossy. The undecorated tree keeps topology but discards prime labels. Predicting the Dyck word 10 is predicting a prime-shaped tree, not identifying the prime’s numerical value. This is the biggest reason the paper should not be sold as neural factorization.

Second, the model is GPT-2-style and relatively compact by today’s frontier-model standards. Its 1024-token context window and 768-dimensional embedding space are part of the experiment. A larger model might do better. It might also merely overfit more elegantly. The paper points to larger models and wider context windows as future work, not as settled evidence.

Third, the Markov-chain comparison is a minimal baseline. Beating it proves that the transformer captures more than immediate transition probabilities. It does not prove superiority over specialized arithmetic algorithms, symbolic methods, or stronger sequence baselines.

Fourth, temperature matters. The best settings are not incidental. Low temperature improves word accuracy because it reduces randomness in generation. But a model that performs well only under careful sampling control is still a system whose behavior depends heavily on inference configuration.

Fifth, the dataset is specialized. Rooted-tree Dyck words form a beautiful arithmetic language, but they are not natural language, legal text, source code, or market data. The transferable lesson lies in evaluation design, not in pretending every business process secretly wants to become a binary tree.

What Cognaptus infers for AI practice

The paper directly shows that a transformer can partially learn the grammar of a deterministic arithmetic text encoded through rooted-tree Dyck words. It also shows that this learning is uneven: stronger for recurring structural motifs, weaker for primes, and meaningfully better than a simple Markov baseline.

Cognaptus would infer three practical lessons.

First, known-rule synthetic corpora are underrated. They let teams test whether a model learns structure, not just correlations in messy historical data. This is valuable for AI assurance, process automation, and agent evaluation.

Second, class-level diagnostics matter more than global scores. The model’s aggregate performance improves, but prime-level F1 remains around 0.3. In business settings, the equivalent is an AI system that handles routine tickets well but fails on the rare compliance exception. The average score smiles. The lawyer does not.

Third, syntax validity is not semantic mastery. The model almost never generates invalid Dyck words, which is impressive. But producing valid structure is not the same as recovering the full arithmetic rule. This distinction should be tattooed on every enterprise AI dashboard, preferably next to the ROI estimate.

What remains uncertain is scaling. A larger architecture, richer tokenizer, longer context, or different training objective may improve prime-level prediction. But the paper does not demonstrate that scaling alone crosses the boundary from local statistical grammar to global arithmetic reasoning. That remains the research question.

The language of numbers is learnable, but not obedient

The most interesting thing about this paper is that it refuses to give either camp an easy slogan.

For AI optimists, it shows that transformers can learn meaningful structure from a deterministic non-human language. They do not need messy human text to discover hierarchy, recurrence, and formal constraints.

For AI skeptics, it shows that this learning has a sharp boundary. The model is not “understanding arithmetic” in the strong sense. It is approximating the grammar exposed by the representation, doing well where the grammar is locally learnable, and stumbling where the sequence demands more global information.

That makes the study valuable precisely because it is neither hype nor dismissal. It gives us a measuring instrument.

And that is the deeper business relevance. The future of enterprise AI will not be decided by whether models sound intelligent in demos. It will be decided by whether we can diagnose what kind of structure they have learned, what kind they have merely imitated, and where the expensive boundary lies.

This paper gives one clean way to draw that boundary: translate a known system into a symbolic language, train the model, and inspect where fluency breaks.

Sometimes the model learns the grammar. Sometimes it learns the accent. The bill depends on knowing the difference.

Cognaptus: Automate the Present, Incubate the Future.


  1. Alessandro Breccia, Federica Gerace, Marco Lippi, Gabriele Sicuro, and Pierluigi Contucci, “Testing Transformer Learnability on the Arithmetic Sequence of Rooted Trees,” arXiv:2512.01870, version dated December 2, 2025. ↩︎