Mind the Units: Why LLMs Still Can't Count (And How CONE Fixes It)

Opening — Why this matters now

Large language models can write essays, generate code, and even explain quantum physics. Yet ask them a deceptively simple question involving numbers—which value is larger, 9000 or 12000?—and things occasionally fall apart.

The problem is structural. Most language models treat numbers as if they were ordinary words. The token “42” is just another symbol, not something that carries magnitude, units, or measurement semantics.

For business applications, this limitation is more than an academic curiosity. Financial models, medical records, engineering datasets, and operational dashboards all rely heavily on numerical data. When AI systems misinterpret numbers—or fail to distinguish 30 years from 30 months—automation pipelines quietly accumulate errors.

A recent research paper proposes an elegant fix: CONE (Composite Numerical Embeddings)—a framework that teaches language models to represent numbers the way humans implicitly understand them: through value, units, and attributes.

In other words, the model finally learns that 5 kg and 5 km are not remotely the same thing.

Background — Context and prior art

Traditional embedding systems such as BERT, BioBERT, and similar transformer encoders excel at capturing relationships between words. They operate on tokenized text, where each token is mapped into a vector in semantic space.

That works well for language.

Numbers, however, obey very different rules.

Property	Words	Numbers
Ordering	Usually none	Strict order
Distance meaning	Semantic similarity	Quantitative magnitude
Units	Rare	Essential
Representations	Discrete tokens	Scalar, range, distribution

Most existing models ignore these differences. A number like 28,600 might even be split into tokens such as "28" and "-600", destroying its mathematical meaning during tokenization.

The situation becomes even worse in structured data, where numbers are tied to attributes.

Consider the following simplified table:

Attribute	Value
Age	30
Follow‑up (months)	30

A conventional embedding model might treat these two columns as nearly identical because the distributions overlap.

But semantically, they represent completely different quantities.

This gap—between linguistic representation and numerical semantics—is what CONE attempts to close.

Analysis — What the paper proposes

CONE introduces a composite embedding architecture designed specifically for structured numerical data.

Instead of encoding a number as a single token embedding, the system decomposes it into three independent components:

Attribute context – what the number describes
Numerical value – the magnitude
Unit semantics – the measurement scale

The final representation is constructed by concatenating these components into a unified embedding vector.

Conceptually:

$$ E_{comp} = LayerNorm(W[E_a \oplus E_v \oplus E_u]) $$

Where:

$E_a$ = attribute embedding
$E_v$ = numeric value embedding
$E_u$ = unit embedding

This design ensures that identical values become distinct when their context differs.

Example:

Input	Composite representation meaning
5 km	distance measurement
5 kg	weight measurement
5 years	temporal duration

Under standard embeddings, these may cluster closely. Under CONE, they occupy different regions of vector space.

Handling ranges and uncertainty

Real-world datasets rarely contain simple scalars.

Medical or engineering data often appear as ranges or distributions:

Example	Meaning
18–24 kg/m²	BMI range
1302 ± 0.25 nm	particle size distribution

CONE represents these using additional components.

Ranges are encoded using center and length:

$$ center = \frac{a + b}{2}, \quad length = |b-a| $$

Gaussian measurements are represented using:

mean − SD
mean
mean + SD

This preserves the statistical structure of the measurement inside the embedding space.

In practical terms, similar ranges cluster together while dissimilar ranges remain separated.

Numerical reasoning training

To teach the model numeric understanding, the authors introduce masked numeral prediction.

This is analogous to masked language modeling but for numbers.

The training objective combines two losses:

Magnitude regression (predicting numeric value)
Classification (predicting numeric category)

$$ L_{num} = (\log y - \log \hat{y})^2 + CE $$

This hybrid loss ensures the model captures both:

precise magnitude
contextual usage

A subtle but important detail: the logarithmic term stabilizes learning across different numeric scales.

Findings — Results with visualization

Across multiple benchmarks, CONE significantly improves numerical reasoning.

Numerical probing tasks

Task	Metric	Baseline trend	CONE result
List maximum	Accuracy	Degrades with scale	~0.95 stable
Number decoding	RMSE	Large error growth	Lower error
Addition	RMSE	Increasing drift	Improved accuracy

These tests confirm that CONE embeddings preserve numeric magnitude more reliably than standard language embeddings.

Question answering benchmark (DROP)

Model	EM	F1
NumNet	83.40	86.42
AeNER	83.69	86.98
CONE	83.74	87.28

The improvement may appear modest, but in numerical reasoning benchmarks even a fraction of a point is meaningful.

Structured data retrieval

CONE truly shines in table matching tasks.

Dataset	Recall@10 improvement
WebTables	+25% vs NumNet
CancerKG	Highest retrieval accuracy
CovidKG	Consistent gains

In other words, when searching for similar columns or records in large datasets, CONE retrieves more relevant matches.

That capability has clear implications for data integration, knowledge graphs, and automated analytics pipelines.

Implications — What this means for business and AI systems

CONE highlights a broader lesson for applied AI:

Language models are not naturally numerical models.

When companies deploy AI on financial, operational, or scientific data, ignoring numeric semantics creates subtle but dangerous errors.

Three practical implications emerge.

1. AI data pipelines need numeric‑aware embeddings

Most enterprise AI stacks rely on general-purpose embeddings for retrieval or RAG pipelines.

If those embeddings misrepresent numbers, downstream reasoning becomes unreliable.

2. Structured data requires composite representations

Tabular datasets encode meaning through multiple axes:

attribute
value
units

Treating numbers as tokens discards that structure.

3. Numerical reasoning is still an open frontier

Despite rapid progress in LLM capabilities, numeric reasoning remains one of the most fragile aspects of AI systems.

Architectures like CONE demonstrate that improvements may come not from scaling models—but from better representation design.

Conclusion — The quiet revolution in embeddings

CONE does something deceptively simple.

Instead of forcing numbers to behave like words, it treats them as what they actually are: measurements embedded in context.

The result is a model that preserves magnitude, units, and attribute semantics simultaneously—unlocking more reliable reasoning across scientific, financial, and operational datasets.

For organizations building AI systems around structured data, that distinction may prove decisive.

Because when machines finally understand numbers properly, a surprising amount of “AI hallucination” disappears.

And sometimes, the difference between 30 months and 30 years matters quite a lot.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper proposes#

Handling ranges and uncertainty#

Numerical reasoning training#

Findings — Results with visualization#

Numerical probing tasks#

Question answering benchmark (DROP)#

Structured data retrieval#

Implications — What this means for business and AI systems#

1. AI data pipelines need numeric‑aware embeddings#

2. Structured data requires composite representations#

3. Numerical reasoning is still an open frontier#

Conclusion — The quiet revolution in embeddings#