Numbers look harmless until they enter a business database.
A revenue field says 50. A dosage field says 50. An age field says 50. A follow-up period says 50. A unit may be present, missing, abbreviated, buried in the column header, or inconsistently written as ml, mL, or something the spreadsheet inherited from a PDF extraction pipeline during its villain era.
A human reader sees the trap immediately. 50 years, 50 kg, 50 mg, and 50 months are not four stylistic variants of the same idea. They are different measurements. They live in different semantic neighborhoods. They should not collapse into one generic “fifty-ish” representation.
Many language-model embedding systems, however, are still too casual here. They are excellent at text similarity, but numerical data is not merely text with digits sprinkled in. This is the central point of CONE, a 2026 paper by Gyanendra Shrestha, Anna Pyayt, and Michael Gubanov titled “CONE: Embeddings for Complex Numerical Data Preserving Unit and Variable Semantics.”1 The paper is not simply saying, “LLMs are bad at arithmetic.” That would be the easy version, and naturally the less useful one.
The sharper argument is this: many AI systems fail with numbers before reasoning even begins, because their embedding space does not preserve what numbers mean.
The real failure is semantic collapse, not just weak arithmetic
When people complain that language models “cannot count,” they usually point to calculation errors. The model adds badly, compares badly, or produces a confident answer that has the same relationship to arithmetic as a weather forecast has to astrology. Entertaining, but not ideal.
CONE focuses on a more basic failure. Before a model can reason over numerical data, it must first represent that data in a way that preserves three things:
| Component | What it captures | Why it matters |
|---|---|---|
| Numerical value | Magnitude and distance | 60 should be closer to 61 than to 600 |
| Unit | Measurement scale | 5 kg and 5 km are not interchangeable |
| Attribute | Variable meaning | Age: 30 and Follow-up: 30 months should not match just because the numbers overlap |
This matters especially in structured data. Tables do not encode meaning only through cell values. Meaning is distributed across headers, units, row context, column context, and sometimes conventions that are obvious to domain experts but invisible to a generic tokenizer.
The paper gives a useful example: an Age column and a Follow-up (months) column may contain overlapping numerical distributions. A text-based or number-naive embedding model may place them close together because the values look similar. But operationally they are different variables. Matching them is not “almost right.” It is wrong in a way that can poison downstream analytics.
For enterprise AI, this is not a philosophical inconvenience. RAG systems, schema matching tools, medical knowledge graphs, financial data pipelines, and automated data-cleaning workflows often depend on vector similarity. If the vector space treats numerically similar but semantically different fields as close neighbors, retrieval becomes a quiet source of error.
Quiet errors are the expensive kind. They do not crash the system. They simply make the system look competent while being wrong.
CONE changes the representation before asking the model to reason
The mechanism is refreshingly direct. CONE does not ask a general-purpose encoder to magically infer measurement semantics from raw text. It builds a composite embedding that separates the parts of numerical meaning and then recombines them.
For a scalar numerical entry, CONE constructs a representation from:
$$ \text{attribute} \oplus \text{value} \oplus \text{unit} $$
The model uses a transformer encoder initialized from BioBERT, but it changes how numerical tokens are handled. Instead of letting the standard tokenizer split numbers as if they were ordinary subword fragments, CONE treats numerical values as single numerical tokens when needed and augments them with value-aware embeddings. The paper uses DICE-style numerical embeddings as one component to preserve magnitude, then fuses these with contextual token embeddings through a number-specific transformer block.
That is the first mechanism: numbers get magnitude-aware treatment.
The second mechanism is the composite structure. CONE concatenates embeddings for the attribute, value, and unit, then uses a slot-based structure with zero padding and masking so that the resulting representation has fixed dimensionality. A linear autoencoder projection compresses the concatenated slots into the final embedding, followed by layer normalization.
The point is not architectural decoration. The point is to keep separate sources of meaning separate long enough for the model not to confuse them. Attribute, value, and unit each contribute to the final distance. If the unit and attribute are fixed, distance should mostly reflect the value. If the value is the same but the unit changes, the embedding should move. If the value and unit look similar but the attribute changes, it should move again.
That is the boring engineering sentence. It is also the whole business case.
Ranges and Gaussians are not exotic edge cases in real data
The paper also handles a problem that many AI demos politely ignore: real numerical data often does not arrive as clean scalar values.
A medical table may contain blood pressure ranges. A materials-science table may report particle size as a mean with standard deviation. A finance table may contain intervals, estimates, or thresholds. Treating all of these as flat text strings is a good way to make the retrieval layer look modern and behave like a filing cabinet that had a nervous breakdown.
CONE represents ranges by decomposing them into center and length:
$$ \text{center} = \frac{a+b}{2}, \quad \text{length} = |b-a| $$
So a range is not merely the string "18-24". It becomes a structure with location and spread. Two nearby ranges can be close. Wider or shifted ranges can be farther away.
For Gaussian-style values, the paper represents distributional information through components based on the mean and standard deviation. The exact implementation uses a composite structure analogous to the range representation, encoding the distribution with context rather than flattening it into text.
This matters because business datasets rarely wait until they are perfectly normalized before causing trouble. Procurement data, lab measurements, operational dashboards, and financial reports routinely mix exact values, ranges, and estimates. A retrieval system that only behaves well on clean scalar values is not robust. It is rehearsed.
The distance tests show whether the embedding space learned the right geometry
The paper’s first major evidence block is not a downstream leaderboard. It is a geometry check.
That is the right order. If the claim is that CONE preserves numerical meaning, then we should first ask whether embedding distances behave like numerical distances. Otherwise, downstream gains may be accidental.
The authors compare analytic distances against embedding cosine distances for scalars, ranges, and Gaussians. The results are stark:
| Object type | Distance notion | BioBERT correlation | CONE correlation | Interpretation |
|---|---|---|---|---|
| Scalars | Absolute difference | Pearson 0.067 / Spearman 0.064 | Pearson 0.989 / Spearman 0.798 | CONE strongly restores magnitude geometry |
| Ranges | Euclidean over center/length | Pearson 0.398 / Spearman 0.355 | Pearson 0.997 / Spearman 0.786 | CONE makes nearby ranges embed nearby |
| Ranges | IoU/Jaccard-style distance | Pearson 0.267 / Spearman 0.248 | Pearson 0.498 / Spearman 0.465 | Improvement is real, though weaker than center/length geometry |
| Gaussians | 2-Wasserstein distance | Pearson 0.038 / Spearman 0.039 | Pearson 0.689 / Spearman 0.663 | CONE captures distributional distance better, but this is the hardest case |
The scalar result is the cleanest. BioBERT’s embedding distances barely align with numeric distance. CONE’s scalar Pearson correlation reaches 0.989. In plain English: the vector space starts to behave as if numbers have magnitude. Revolutionary stuff, apparently.
The Gaussian result is more modest and more interesting. CONE improves substantially over BioBERT, but it does not reach the same strength as scalar and range representations. That boundary matters. The paper supports the claim that CONE improves distributional numerical representation; it does not prove that all uncertainty-bearing measurements are now solved.
The mechanism-first reading helps here. The distance analysis is not just “nice supporting evidence.” It checks whether the internal representation has the geometry that downstream retrieval will later depend on.
The downstream tests ask whether better geometry becomes better retrieval
After the geometry checks, the paper moves to practical tasks: numerical reasoning, column matching, tuple matching, and schema matching.
The DROP benchmark gives a useful but limited signal. CONE reports a test F1 of 87.28, compared with 86.42 for NumNet, 77.91 for NC-BERT, and 86.98 for AeNER. This is an improvement, but not the most important part of the paper. On DROP, CONE is slightly ahead of strong numerical-reasoning baselines, especially AeNER. Useful, yes. A revolution in question answering, no.
The stronger business-relevant evidence appears in structured retrieval.
For column and tuple matching across datasets such as CancerKG, CovidKG, WebTables, CIUS, and SAUS, CONE outperforms TAPAS, NumNet, NC-BERT, Magneto, CARTE, and several general-purpose retrieval embedding models including BGE-M3, Stella, Qwen3, and KaLM.
A few numbers are worth keeping:
| Task | Reported result | Business reading |
|---|---|---|
| WebTables column matching | CONE Recall@10 reaches 0.950 | Better candidate retrieval for heterogeneous tables |
| WebTables tuple matching | CONE Recall@10 reaches 0.900 | Better matching of records with mixed text/numeric values |
| WebTables vs NumNet | +25 percentage points Recall@10 in column matching | Numeric-aware retrieval matters when table semantics are noisy |
| WebTables vs NumNet | +17.7 percentage points Recall@10 in tuple matching | Gains are not limited to column headers |
| Vector lookup | Top-10 retrieval from 200K indexed vectors in 8.57 ms | The approach is not merely theoretical; it can sit inside a retrieval pipeline |
The paper also reports schema matching results on GDC, Magellan, WikiData, Open Data, ChEMBL, and TPC-DI. CONE is comparable to or better than the reported baselines on recall. Magneto sometimes has slightly better MRR because it uses LLM-based re-ranking, but CONE achieves competitive retrieval without LLM calls.
That distinction is operationally important. In production systems, “use an LLM to rerank everything” is often the most expensive possible way to discover that mg and mL are different. It may be acceptable for high-value workflows. It is not always acceptable for high-throughput data infrastructure.
CONE’s practical promise is cheaper first-stage retrieval: make the embedding space less foolish before asking more expensive models to reason over candidates.
The ablations show that the composite design is doing real work
The ablation study is not a second thesis. It answers a narrow but important question: which parts of CONE matter?
The authors remove four components in separate variants:
| Variant | Removed component | Likely purpose of test |
|---|---|---|
| CONE1 | Numerical value embedding module | Tests whether magnitude-aware number representation matters |
| CONE2 | Composite embedding structure | Tests whether attribute-unit-value composition matters |
| CONE3 | Unit component | Tests whether units add independent value |
| CONE4 | Range and Gaussian encoding | Tests whether complex numerical forms matter |
The results are consistent: removing any major component reduces Recall@10, MAP@10, and MRR@10 across column and tuple matching tasks.
The largest practical warning comes from removing the composite structure. CONE2 shows a Recall decrease of 16.7 percentage points for tuple matching on CancerKG. Removing the numerical module also hurts sharply, including a 10-point Recall drop in reported cases. Removing units reduces Recall by up to 8 points. Removing range and Gaussian handling causes a smaller but still visible reduction, up to 4.8 points.
This is useful because it prevents a lazy interpretation: “Just add a numeric embedding and the job is done.” No. The paper suggests that numeric magnitude helps, but business-grade structured data also needs attribute and unit context. A number-aware model that ignores units is still only partially house-trained.
The business value is better data infrastructure, not a smarter chatbot
The natural but wrong framing is to treat CONE as another attempt to make chatbots better at math. The more relevant framing is data infrastructure.
CONE is about embedding structured numerical data so that retrieval and matching behave correctly. That maps to several enterprise use cases:
| Enterprise problem | Where CONE-like embeddings help | Boundary |
|---|---|---|
| RAG over financial or operational tables | Retrieves fields with similar measurement meaning, not merely similar tokens | Needs reliable parsing of units and attributes |
| Data lake search | Finds semantically equivalent columns across messy sources | Rare or ambiguous units remain difficult |
| Schema matching | Improves candidate matching before rule-based or LLM-based validation | Does not replace governance or human approval |
| Medical or scientific knowledge graphs | Distinguishes measurements with overlapping values but different variables | Domain-specific evaluation is still needed |
| Automated analytics pipelines | Reduces silent mismatches in joins, search, and feature discovery | Quality depends on table extraction and metadata quality |
Cognaptus’ inference is straightforward: numeric-aware embeddings should be treated as a specialized infrastructure layer for structured AI systems. They are not a decorative add-on. They shape what the system can retrieve, compare, and reason over before the final model ever sees the prompt.
This also changes how organizations should evaluate AI retrieval systems. A generic semantic-search benchmark is not enough. If the business domain contains measurements, the evaluation set should include adversarial numeric cases:
- same value, different unit;
- same unit, different attribute;
- same attribute, incompatible unit;
- overlapping distributions, different business meaning;
- abbreviations and synonyms in column names;
- scalar versus range versus distributional value;
- rare units and rare attribute-unit pairs.
A RAG system that passes ordinary text retrieval tests may still fail these cases. That is not an edge-case failure. That is a sign that the benchmark was too polite.
The parser is part of the model’s fate
One implementation detail deserves more attention than it usually gets: CONE extracts attribute names, numerical values, and unit symbols using rule-based and regular-expression parsing adapted from prior work on unit identification. It also applies unit canonicalization, so variants like ml and mL can map to a consistent representation. Missing units may be inferred from surrounding column or tuple context; when no unit information exists, the unit component is zero-padded.
This is practical. It is also a dependency.
The representation can only preserve semantics that the preprocessing layer successfully identifies. If a PDF table extraction step mangles the unit, if a column header is ambiguous, if a local business abbreviation is undocumented, or if a rare unit appears once in the dataset, the embedding may still be wrong.
The paper’s challenging-case discussion makes this point clearly. In one CovidKG example, columns with highly similar names and identical units—such as systolic versus diastolic blood pressure variants—can remain close despite meaningful differences in values. The authors attribute this partly to rare occurrence: some attribute-unit combinations appear only once in the relevant dataset. Sparse context makes embedding formation harder.
This limitation is not a footnote to be waved away. It tells businesses where the implementation risk sits. CONE improves representation once the system can identify value, unit, and attribute. It does not magically repair every upstream metadata failure.
What the paper directly shows, and what businesses should infer
The paper directly shows three things.
First, CONE’s embedding space better preserves numerical distance for scalars, ranges, and Gaussians than BioBERT, with especially strong results for scalar values and center/length-based range distances.
Second, CONE improves numerical reasoning performance on DROP, reaching 87.28 F1 on the test set and slightly exceeding the reported AeNER result.
Third, CONE improves retrieval-oriented structured data tasks, including column matching, tuple matching, and schema matching, across several large-scale datasets. Its strongest practical evidence is in Recall@10, MAP@10, and MRR@10 gains for table matching.
The business inference is narrower but valuable: when structured numerical data matters, companies should not rely blindly on general-purpose text embeddings. Numeric-aware, unit-aware, attribute-aware representations can reduce retrieval errors before expensive reasoning layers are invoked.
What remains uncertain is deployment breadth. The paper evaluates large and diverse datasets, including medical, web, government, finance-related, and schema-matching benchmarks. But production data has its own charming habits: local abbreviations, broken headers, OCR scars, inconsistent measurement conventions, and old spreadsheets maintained by someone named “final_v7_REAL.xlsx.” CONE-like methods need domain-specific validation before being trusted inside high-stakes workflows.
The practical lesson: fix the vector space before blaming the model
CONE’s deeper lesson is not that every business should immediately rebuild its embedding stack around this exact architecture. The lesson is that representation design still matters.
The current AI industry often treats embeddings as a solved utility layer: choose a strong general model, index everything, retrieve chunks, and let the LLM sort out the mess. That works surprisingly well for prose. It works less well when the “chunk” is a measurement, a column, a tuple, or a distribution.
For numerical data, the model needs a vector space where distance means something. Age: 50 years should be close to age-like fields, not follow-up periods that merely share similar values. Dose: 40 mg should not be confused with Dose: 40 mL. A range should preserve both position and width. A Gaussian should preserve at least some distributional structure.
CONE does not make numerical AI perfect. It does something more useful: it identifies where the failure begins and provides a mechanism for reducing it.
The old mistake was pretending that numbers are just words with digits.
The better approach is to treat numbers as measurements embedded in context. A small distinction, perhaps. But in business systems, small distinctions are often where the invoice, the diagnosis, or the risk model quietly goes to die.
Cognaptus: Automate the Present, Incubate the Future.
-
Gyanendra Shrestha, Anna Pyayt, and Michael Gubanov, “CONE: Embeddings for Complex Numerical Data Preserving Unit and Variable Semantics,” arXiv:2603.04741, 2026, https://arxiv.org/html/2603.04741. ↩︎