Mind the Units: Why LLMs Still Can't Count (And How CONE Fixes It)

Numbers look harmless until they enter a business database.

A revenue field says 50. A dosage field says 50. An age field says 50. A follow-up period says 50. A unit may be present, missing, abbreviated, buried in the column header, or inconsistently written as ml, mL, or something the spreadsheet inherited from a PDF extraction pipeline during its villain era.

A human reader sees the trap immediately. 50 years, 50 kg, 50 mg, and 50 months are not four stylistic variants of the same idea. They are different measurements. They live in different semantic neighborhoods. They should not collapse into one generic “fifty-ish” representation.

Many language-model embedding systems, however, are still too casual here. They are excellent at text similarity, but numerical data is not merely text with digits sprinkled in. This is the central point of CONE, a 2026 paper by Gyanendra Shrestha, Anna Pyayt, and Michael Gubanov titled “CONE: Embeddings for Complex Numerical Data Preserving Unit and Variable Semantics.”¹ The paper is not simply saying, “LLMs are bad at arithmetic.” That would be the easy version, and naturally the less useful one.

The sharper argument is this: many AI systems fail with numbers before reasoning even begins, because their embedding space does not preserve what numbers mean.

The real failure is semantic collapse, not just weak arithmetic

When people complain that language models “cannot count,” they usually point to calculation errors. The model adds badly, compares badly, or produces a confident answer that has the same relationship to arithmetic as a weather forecast has to astrology. Entertaining, but not ideal.

CONE focuses on a more basic failure. Before a model can reason over numerical data, it must first represent that data in a way that preserves three things:

Component	What it captures	Why it matters
Numerical value	Magnitude and distance	`60` should be closer to `61` than to `600`
Unit	Measurement scale	`5 kg` and `5 km` are not interchangeable
Attribute	Variable meaning	`Age: 30` and `Follow-up: 30 months` should not match just because the numbers overlap

This matters especially in structured data. Tables do not encode meaning only through cell values. Meaning is distributed across headers, units, row context, column context, and sometimes conventions that are obvious to domain experts but invisible to a generic tokenizer.

The paper gives a useful example: an Age column and a Follow-up (months) column may contain overlapping numerical distributions. A text-based or number-naive embedding model may place them close together because the values look similar. But operationally they are different variables. Matching them is not “almost right.” It is wrong in a way that can poison downstream analytics.

For enterprise AI, this is not a philosophical inconvenience. RAG systems, schema matching tools, medical knowledge graphs, financial data pipelines, and automated data-cleaning workflows often depend on vector similarity. If the vector space treats numerically similar but semantically different fields as close neighbors, retrieval becomes a quiet source of error.

Quiet errors are the expensive kind. They do not crash the system. They simply make the system look competent while being wrong.

CONE changes the representation before asking the model to reason

The mechanism is refreshingly direct. CONE does not ask a general-purpose encoder to magically infer measurement semantics from raw text. It builds a composite embedding that separates the parts of numerical meaning and then recombines them.

For a scalar numerical entry, CONE constructs a representation from:

$$ \text{attribute} \oplus \text{value} \oplus \text{unit} $$

The model uses a transformer encoder initialized from BioBERT, but it changes how numerical tokens are handled. Instead of letting the standard tokenizer split numbers as if they were ordinary subword fragments, CONE treats numerical values as single numerical tokens when needed and augments them with value-aware embeddings. The paper uses DICE-style numerical embeddings as one component to preserve magnitude, then fuses these with contextual token embeddings through a number-specific transformer block.

That is the first mechanism: numbers get magnitude-aware treatment.

The second mechanism is the composite structure. CONE concatenates embeddings for the attribute, value, and unit, then uses a slot-based structure with zero padding and masking so that the resulting representation has fixed dimensionality. A linear autoencoder projection compresses the concatenated slots into the final embedding, followed by layer normalization.

The point is not architectural decoration. The point is to keep separate sources of meaning separate long enough for the model not to confuse them. Attribute, value, and unit each contribute to the final distance. If the unit and attribute are fixed, distance should mostly reflect the value. If the value is the same but the unit changes, the embedding should move. If the value and unit look similar but the attribute changes, it should move again.

That is the boring engineering sentence. It is also the whole business case.

Ranges and Gaussians are not exotic edge cases in real data

The paper also handles a problem that many AI demos politely ignore: real numerical data often does not arrive as clean scalar values.

A medical table may contain blood pressure ranges. A materials-science table may report particle size as a mean with standard deviation. A finance table may contain intervals, estimates, or thresholds. Treating all of these as flat text strings is a good way to make the retrieval layer look modern and behave like a filing cabinet that had a nervous breakdown.

CONE represents ranges by decomposing them into center and length:

$$ \text{center} = \frac{a+b}{2}, \quad \text{length} = |b-a| $$

So a range is not merely the string "18-24". It becomes a structure with location and spread. Two nearby ranges can be close. Wider or shifted ranges can be farther away.

For Gaussian-style values, the paper represents distributional information through components based on the mean and standard deviation. The exact implementation uses a composite structure analogous to the range representation, encoding the distribution with context rather than flattening it into text.

This matters because business datasets rarely wait until they are perfectly normalized before causing trouble. Procurement data, lab measurements, operational dashboards, and financial reports routinely mix exact values, ranges, and estimates. A retrieval system that only behaves well on clean scalar values is not robust. It is rehearsed.

The distance tests show whether the embedding space learned the right geometry

The paper’s first major evidence block is not a downstream leaderboard. It is a geometry check.

That is the right order. If the claim is that CONE preserves numerical meaning, then we should first ask whether embedding distances behave like numerical distances. Otherwise, downstream gains may be accidental.

The authors compare analytic distances against embedding cosine distances for scalars, ranges, and Gaussians. The results are stark:

Object type	Distance notion	BioBERT correlation	CONE correlation	Interpretation
Scalars	Absolute difference	Pearson 0.067 / Spearman 0.064	Pearson 0.989 / Spearman 0.798	CONE strongly restores magnitude geometry
Ranges	Euclidean over center/length	Pearson 0.398 / Spearman 0.355	Pearson 0.997 / Spearman 0.786	CONE makes nearby ranges embed nearby
Ranges	IoU/Jaccard-style distance	Pearson 0.267 / Spearman 0.248	Pearson 0.498 / Spearman 0.465	Improvement is real, though weaker than center/length geometry
Gaussians	2-Wasserstein distance	Pearson 0.038 / Spearman 0.039	Pearson 0.689 / Spearman 0.663	CONE captures distributional distance better, but this is the hardest case

The scalar result is the cleanest. BioBERT’s embedding distances barely align with numeric distance. CONE’s scalar Pearson correlation reaches 0.989. In plain English: the vector space starts to behave as if numbers have magnitude. Revolutionary stuff, apparently.

The Gaussian result is more modest and more interesting. CONE improves substantially over BioBERT, but it does not reach the same strength as scalar and range representations. That boundary matters. The paper supports the claim that CONE improves distributional numerical representation; it does not prove that all uncertainty-bearing measurements are now solved.

The mechanism-first reading helps here. The distance analysis is not just “nice supporting evidence.” It checks whether the internal representation has the geometry that downstream retrieval will later depend on.

The downstream tests ask whether better geometry becomes better retrieval

After the geometry checks, the paper moves to practical tasks: numerical reasoning, column matching, tuple matching, and schema matching.

The DROP benchmark gives a useful but limited signal. CONE reports a test F1 of 87.28, compared with 86.42 for NumNet, 77.91 for NC-BERT, and 86.98 for AeNER. This is an improvement, but not the most important part of the paper. On DROP, CONE is slightly ahead of strong numerical-reasoning baselines, especially AeNER. Useful, yes. A revolution in question answering, no.

The stronger business-relevant evidence appears in structured retrieval.

For column and tuple matching across datasets such as CancerKG, CovidKG, WebTables, CIUS, and SAUS, CONE outperforms TAPAS, NumNet, NC-BERT, Magneto, CARTE, and several general-purpose retrieval embedding models including BGE-M3, Stella, Qwen3, and KaLM.

A few numbers are worth keeping:

Task	Reported result	Business reading
WebTables column matching	CONE Recall@10 reaches 0.950	Better candidate retrieval for heterogeneous tables
WebTables tuple matching	CONE Recall@10 reaches 0.900	Better matching of records with mixed text/numeric values
WebTables vs NumNet	+25 percentage points Recall@10 in column matching	Numeric-aware retrieval matters when table semantics are noisy
WebTables vs NumNet	+17.7 percentage points Recall@10 in tuple matching	Gains are not limited to column headers
Vector lookup	Top-10 retrieval from 200K indexed vectors in 8.57 ms	The approach is not merely theoretical; it can sit inside a retrieval pipeline

The paper also reports schema matching results on GDC, Magellan, WikiData, Open Data, ChEMBL, and TPC-DI. CONE is comparable to or better than the reported baselines on recall. Magneto sometimes has slightly better MRR because it uses LLM-based re-ranking, but CONE achieves competitive retrieval without LLM calls.

That distinction is operationally important. In production systems, “use an LLM to rerank everything” is often the most expensive possible way to discover that mg and mL are different. It may be acceptable for high-value workflows. It is not always acceptable for high-throughput data infrastructure.

CONE’s practical promise is cheaper first-stage retrieval: make the embedding space less foolish before asking more expensive models to reason over candidates.

The ablations show that the composite design is doing real work

The ablation study is not a second thesis. It answers a narrow but important question: which parts of CONE matter?

The authors remove four components in separate variants:

Variant	Removed component	Likely purpose of test
CONE1	Numerical value embedding module	Tests whether magnitude-aware number representation matters
CONE2	Composite embedding structure	Tests whether attribute-unit-value composition matters
CONE3	Unit component	Tests whether units add independent value
CONE4	Range and Gaussian encoding	Tests whether complex numerical forms matter

The results are consistent: removing any major component reduces Recall@10, MAP@10, and MRR@10 across column and tuple matching tasks.

The largest practical warning comes from removing the composite structure. CONE2 shows a Recall decrease of 16.7 percentage points for tuple matching on CancerKG. Removing the numerical module also hurts sharply, including a 10-point Recall drop in reported cases. Removing units reduces Recall by up to 8 points. Removing range and Gaussian handling causes a smaller but still visible reduction, up to 4.8 points.

This is useful because it prevents a lazy interpretation: “Just add a numeric embedding and the job is done.” No. The paper suggests that numeric magnitude helps, but business-grade structured data also needs attribute and unit context. A number-aware model that ignores units is still only partially house-trained.

The business value is better data infrastructure, not a smarter chatbot

The natural but wrong framing is to treat CONE as another attempt to make chatbots better at math. The more relevant framing is data infrastructure.

CONE is about embedding structured numerical data so that retrieval and matching behave correctly. That maps to several enterprise use cases:

Enterprise problem	Where CONE-like embeddings help	Boundary
RAG over financial or operational tables	Retrieves fields with similar measurement meaning, not merely similar tokens	Needs reliable parsing of units and attributes
Data lake search	Finds semantically equivalent columns across messy sources	Rare or ambiguous units remain difficult
Schema matching	Improves candidate matching before rule-based or LLM-based validation	Does not replace governance or human approval
Medical or scientific knowledge graphs	Distinguishes measurements with overlapping values but different variables	Domain-specific evaluation is still needed
Automated analytics pipelines	Reduces silent mismatches in joins, search, and feature discovery	Quality depends on table extraction and metadata quality

Cognaptus’ inference is straightforward: numeric-aware embeddings should be treated as a specialized infrastructure layer for structured AI systems. They are not a decorative add-on. They shape what the system can retrieve, compare, and reason over before the final model ever sees the prompt.

This also changes how organizations should evaluate AI retrieval systems. A generic semantic-search benchmark is not enough. If the business domain contains measurements, the evaluation set should include adversarial numeric cases:

same value, different unit;
same unit, different attribute;
same attribute, incompatible unit;
overlapping distributions, different business meaning;
abbreviations and synonyms in column names;
scalar versus range versus distributional value;
rare units and rare attribute-unit pairs.

A RAG system that passes ordinary text retrieval tests may still fail these cases. That is not an edge-case failure. That is a sign that the benchmark was too polite.

The parser is part of the model’s fate

One implementation detail deserves more attention than it usually gets: CONE extracts attribute names, numerical values, and unit symbols using rule-based and regular-expression parsing adapted from prior work on unit identification. It also applies unit canonicalization, so variants like ml and mL can map to a consistent representation. Missing units may be inferred from surrounding column or tuple context; when no unit information exists, the unit component is zero-padded.

This is practical. It is also a dependency.

The representation can only preserve semantics that the preprocessing layer successfully identifies. If a PDF table extraction step mangles the unit, if a column header is ambiguous, if a local business abbreviation is undocumented, or if a rare unit appears once in the dataset, the embedding may still be wrong.

The paper’s challenging-case discussion makes this point clearly. In one CovidKG example, columns with highly similar names and identical units—such as systolic versus diastolic blood pressure variants—can remain close despite meaningful differences in values. The authors attribute this partly to rare occurrence: some attribute-unit combinations appear only once in the relevant dataset. Sparse context makes embedding formation harder.

This limitation is not a footnote to be waved away. It tells businesses where the implementation risk sits. CONE improves representation once the system can identify value, unit, and attribute. It does not magically repair every upstream metadata failure.

What the paper directly shows, and what businesses should infer

The paper directly shows three things.

First, CONE’s embedding space better preserves numerical distance for scalars, ranges, and Gaussians than BioBERT, with especially strong results for scalar values and center/length-based range distances.

Second, CONE improves numerical reasoning performance on DROP, reaching 87.28 F1 on the test set and slightly exceeding the reported AeNER result.

Third, CONE improves retrieval-oriented structured data tasks, including column matching, tuple matching, and schema matching, across several large-scale datasets. Its strongest practical evidence is in Recall@10, MAP@10, and MRR@10 gains for table matching.

The business inference is narrower but valuable: when structured numerical data matters, companies should not rely blindly on general-purpose text embeddings. Numeric-aware, unit-aware, attribute-aware representations can reduce retrieval errors before expensive reasoning layers are invoked.

What remains uncertain is deployment breadth. The paper evaluates large and diverse datasets, including medical, web, government, finance-related, and schema-matching benchmarks. But production data has its own charming habits: local abbreviations, broken headers, OCR scars, inconsistent measurement conventions, and old spreadsheets maintained by someone named “final_v7_REAL.xlsx.” CONE-like methods need domain-specific validation before being trusted inside high-stakes workflows.

The practical lesson: fix the vector space before blaming the model

CONE’s deeper lesson is not that every business should immediately rebuild its embedding stack around this exact architecture. The lesson is that representation design still matters.

The current AI industry often treats embeddings as a solved utility layer: choose a strong general model, index everything, retrieve chunks, and let the LLM sort out the mess. That works surprisingly well for prose. It works less well when the “chunk” is a measurement, a column, a tuple, or a distribution.

For numerical data, the model needs a vector space where distance means something. Age: 50 years should be close to age-like fields, not follow-up periods that merely share similar values. Dose: 40 mg should not be confused with Dose: 40 mL. A range should preserve both position and width. A Gaussian should preserve at least some distributional structure.

CONE does not make numerical AI perfect. It does something more useful: it identifies where the failure begins and provides a mechanism for reducing it.

The old mistake was pretending that numbers are just words with digits.

The better approach is to treat numbers as measurements embedded in context. A small distinction, perhaps. But in business systems, small distinctions are often where the invoice, the diagnosis, or the risk model quietly goes to die.

Cognaptus: Automate the Present, Incubate the Future.

Gyanendra Shrestha, Anna Pyayt, and Michael Gubanov, “CONE: Embeddings for Complex Numerical Data Preserving Unit and Variable Semantics,” arXiv:2603.04741, 2026, https://arxiv.org/html/2603.04741. ↩︎

The real failure is semantic collapse, not just weak arithmetic#

CONE changes the representation before asking the model to reason#

Ranges and Gaussians are not exotic edge cases in real data#

The distance tests show whether the embedding space learned the right geometry#

The downstream tests ask whether better geometry becomes better retrieval#

The ablations show that the composite design is doing real work#

The business value is better data infrastructure, not a smarter chatbot#

The parser is part of the model’s fate#

What the paper directly shows, and what businesses should infer#

The practical lesson: fix the vector space before blaming the model#