Beyond Cosine: When Order Beats Angle in Embedding Similarity

Search has a small ritual. Take two embeddings, compute cosine similarity, rank the results, and move on. The ritual is fast, familiar, and usually good enough. It is also so deeply embedded in AI infrastructure that many teams treat it less like a modeling choice and more like plumbing.

That is convenient. It is not always innocent.

The paper Beyond Cosine Similarity proposes a direct challenge to that default: a new metric called recos, short for Rearrangement-inequality-based Cosine Similarity.¹ The argument is not that embeddings are broken, nor that cosine similarity is useless. The argument is narrower and more interesting: cosine assumes that semantic similarity should be captured by angular alignment, while some embedding spaces may also encode meaningful ordinal structure across dimensions. In plainer terms, two vectors may not point in exactly the same direction, but their components may still rise and fall in the same order. Cosine may mark that relationship as imperfect. recos asks whether that imperfection is sometimes the wrong penalty.

This matters because many production AI systems are built on one quiet assumption: once the embedding model is selected, similarity scoring is an implementation detail. The paper says, politely enough, that this is a little too comfortable. One might even call it “cosine maximalism,” if one wanted to ruin a perfectly good engineering meeting.

Cosine measures angle; recos measures ordered agreement

Cosine similarity comes from the Cauchy–Schwarz inequality. For vectors $x$ and $y$, cosine normalizes the dot product by the product of their Euclidean norms:

$$ \cos(x,y)=\frac{x \cdot y}{\lVert x\rVert \lVert y\rVert} $$

Its maximum value is reached when the two vectors are linearly dependent. That is a strong condition. It says that one vector must be a scalar multiple of the other. This is mathematically clean, computationally cheap, and geometrically intuitive. It is also a particular theory of what similarity means.

The paper’s alternative starts from a different inequality: the Rearrangement Inequality. Instead of using vector norms as the upper bound for the dot product, it sorts vector components and uses the best possible ordered alignment as the normalizing denominator.

For a positive dot product, the paper’s implementation can be read as:

$$ \mathrm{recos}(x,y)=\frac{x \cdot y}{x^{\uparrow}\cdot y^{\uparrow}} $$

where $x^{\uparrow}$ and $y^{\uparrow}$ are the components of $x$ and $y$ sorted in ascending order. For a negative dot product, the denominator uses the oppositely sorted arrangement so that the score remains bounded. The practical implementation sorts both vectors, computes the ordinary dot product, then normalizes by the appropriate rearranged bound.

The important conceptual shift is this: cosine reaches perfect similarity under linear dependence; recos can reach perfect similarity under ordinal concordance. If two vectors preserve the same ordering of components, recos treats that as a stronger kind of similarity than cosine would.

This is the mechanism-first reading of the paper. Without it, the empirical section looks like a table of small score gains. With it, the table becomes a test of a sharper question: do modern embeddings contain useful rank-order information that cosine leaves partially unused?

The “expert ratings” example explains the metric better than the formula

The paper uses an intuitive example: experts rating candidates. Two experts may assign different absolute scores but still rank the candidates in the same order. One expert might give scores like 2, 8, 5, 7; another might give 1, 10, 4, 9. These are not identical. They are not necessarily proportional. But they express the same preference order.

Cosine is designed to reward angular alignment. It is comfortable when the second vector is basically a scaled version of the first. recos is designed to reward ordered agreement. It is comfortable when the relationship is monotonic but not necessarily linear.

That distinction is easy to dismiss in low dimensions. In a real embedding space, however, each dimension is not a neatly labeled business variable. The paper is careful about this point near the end: the semantic interpretation of exact ordinal concordance in high-dimensional, non-axis-aligned embedding spaces is not straightforward. That is the right caution. A dimension in an embedding vector is not “candidate leadership quality” or “invoice urgency.” It is a learned coordinate in a representational system.

Still, the business question survives. We do not need each dimension to be human-readable. We only need to know whether ordered patterns across dimensions carry signal that improves downstream ranking. The paper tests exactly that.

The empirical test is simple: same embeddings, different similarity metrics

The paper evaluates recos on Semantic Textual Similarity benchmarks. These datasets ask whether a similarity score computed from sentence embeddings correlates with human judgments of sentence similarity. The evaluation uses Spearman rank correlation, which is appropriate because the task is fundamentally about ranking pairs by similarity.

The experimental design is useful because it isolates the metric. The models are not retrained. There is no fine-tuning, no prompt trick, no extra supervised layer, no heroic benchmark grooming. The paper computes embeddings with existing models and compares three similarity measures:

Metric	What it emphasizes	Role in the paper
decos	Near-identity relationships	A stricter baseline derived from a looser bound
cos	Angular / linear dependence	The industry default
recos	Ordinal concordance	The proposed metric

The evaluation spans 11 embedding models across 7 STS datasets, producing 77 model-dataset comparisons. The model set includes static embeddings such as Word2Vec, FastText, and GloVe; contextualized embeddings such as BERT, SGPT, and DPR; and broader universal or specialized embeddings such as E5, BGE, GTE, SPECTER, and CLIP-ViT.

This makes the test reasonably broad. It does not prove recos is best for every retrieval system, but it does reduce the chance that the result is merely one model behaving oddly after too much coffee.

The main result is consistency, not drama

The headline result is not that recos produces a huge average jump. It does not. The paper reports a micro-average score of 66.12 for recos, compared with 65.83 for cosine and 65.65 for decos. The average improvement over cosine is 0.29 points.

That is modest. In many business dashboards, a 0.29-point benchmark improvement would be just large enough to start an argument and too small to end one.

The stronger evidence is consistency. Across the 77 settings, recos beats cosine in 71 cases, ties in 5, and loses in 1. Excluding ties, that is a 98.6% win rate. The appendix reinforces the point with statistical tests: the mean difference is 0.292, the median difference is 0.160, the minimum difference is -0.310, and the maximum improvement is 1.360. The paper also reports that the direction of improvement remains robust under leave-one-dataset-out analysis.

So the result should be read carefully:

Evidence	What it supports	What it does not prove
71 wins, 5 ties, 1 loss over 77 settings	recos is consistently better than cosine on these STS comparisons	recos will dominate in every production retrieval workload
Average improvement of 0.29 points	the absolute gain is real but small	the gain is automatically worth latency or infrastructure cost
Larger gains for CLIP-ViT, DPR, and SPECTER	specialized representation spaces may benefit more from ordinal scoring	all multimodal or domain models will benefit equally
Zero-shot evaluation, no fine-tuning	the metric itself contributes the improvement	a trained reranker would not outperform both

That last distinction matters. recos is not competing with a full reranking stack, human feedback, domain-specific labels, query rewriting, or task-tuned retrieval. It is competing with cosine as a drop-in scoring rule.

That makes the paper operationally interesting. A small, consistent improvement that requires no retraining can be attractive. But only if the cost of computing it is acceptable.

The bigger gains appear where cosine’s assumptions look less natural

The model-level pattern is more informative than the overall average.

For static embeddings, the reported gains are small: Word2Vec improves only from 64.91 to 64.93 on average, FastText from 64.94 to 65.19, and GloVe from 61.43 to 61.80. That is useful, but not exactly a fireworks display.

For contextualized and specialized models, the gains become more visible. DPR moves from 60.01 to 60.66. SPECTER moves from 60.08 to 60.56. CLIP-ViT moves from 67.61 to 68.57, with the paper noting peak gains of +1.36 on STS14 and STS15.

This pattern is plausible. Specialized models are trained for objectives that may produce representation spaces less naturally aligned with the assumptions behind ordinary textual cosine similarity. CLIP-ViT, for example, comes from a visual-language alignment setting; SPECTER is designed around scientific-document representations; DPR is trained for dense passage retrieval. Their vector spaces may carry structure that is useful but not fully captured by angular similarity alone.

The paper interprets this as evidence that recos is especially helpful when representation spaces diverge from standard textual similarity assumptions. That is a reasonable inference from the presented results, though it should stay at the level of inference. The experiments show better STS correlation. They do not directly map the internal geometry of each model in a causal way.

Still, for business users, this is the practical clue: the more specialized or oddly trained your embeddings are, the less safe it is to assume cosine is the obvious final answer.

The appendix is mostly robustness, not a second thesis

The appendix is worth reading because it clarifies the character of the evidence.

The experimental configuration is an implementation detail: Linux environment, Python 3.11, ModelScope, PyTorch, Transformers, SentenceTransformers, and Gensim. The point is reproducibility, not conceptual novelty.

The model specification table is also implementation detail. It lists the exact checkpoints used, which matters if someone wants to reproduce the benchmark or challenge the results.

The statistical analysis has three purposes:

Appendix component	Likely purpose	How to use it in interpretation
Descriptive statistics	Main evidence support	Shows gains are small in average size but consistent in direction
Shapiro-Wilk normality check	Method selection	Justifies using non-parametric tests instead of relying only on a paired t-test
Wilcoxon signed-rank and sign test	Statistical confirmation	Supports that the improvement is unlikely to be random across paired settings
Mixed-effects model	Dependency control	Accounts for model and dataset variation
Leave-one-dataset-out analysis	Robustness test	Checks that one dataset is not carrying the result
Benjamini-Hochberg correction	Multiple-comparison control	Reduces the chance that reported significance is a testing artifact

One subtle point: the appendix reports both a negligible Cohen’s d and a large Wilcoxon effect size. That is not necessarily contradictory. The paper explains that Cohen’s d reflects modest absolute improvement relative to between-model variability, while the Wilcoxon statistic reflects the consistency of improvement direction. In business language: the uplift is usually small, but it keeps showing up. That is very different from a large but fragile gain.

What changes for RAG and semantic search teams

For production AI teams, the most useful interpretation is not “replace cosine everywhere.” That would be the kind of conclusion that sounds decisive because it skipped the hard part.

The better interpretation is:

recos is a low-disruption candidate for retrieval evaluation when the embedding model is fixed but ranking quality has plateaued.

That applies to several common workflows.

In RAG systems, recos can be tested as a first-stage similarity function or as a candidate feature before reranking. The possible benefit is improved retrieval of semantically similar passages whose embedding relationship is order-preserving but not strongly angular. The uncertainty is whether STS correlation translates into answer quality, citation accuracy, and reduced hallucination. Those must be measured directly.

In semantic search, recos may improve ranking when the current embedding model works but produces annoying near-misses. The operational test is straightforward: compare cosine and recos on held-out query-document relevance judgments, especially in ambiguous queries and domain-specific terminology.

In deduplication and clustering, recos may help when documents share structure but differ in wording or scale. However, clustering behavior can change in non-obvious ways when similarity scores shift. Teams should inspect cluster stability, not only pairwise accuracy.

In recommendation systems, recos is more speculative. If embeddings are learned from user-item behavior rather than pure semantic text, ordinal concordance may or may not correspond to useful preference similarity. The metric deserves an offline test, not immediate deployment.

In cross-modal retrieval, the CLIP-ViT result is the most intriguing. The paper’s STS benchmark does not directly evaluate image-text retrieval, but the larger gain for CLIP-ViT suggests that mixed-modality representation spaces may be a promising place to test recos. That is Cognaptus inference, not something the paper directly proves.

The cost is sorting, and sorting is not free at industrial scale

The practical downside is computational. Cosine similarity is cheap: compute a dot product, often after pre-normalizing vectors. recos requires sorting vector components. The paper states the complexity difference clearly: recos adds an $O(n \log n)$ sorting step, while cosine is $O(n)$.

For a single comparison or a small batch, this overhead is usually manageable. For billion-scale approximate nearest neighbor search, it is not a rounding error. Large retrieval systems are engineered around fast vector indexing, cache behavior, precomputed norms, approximate search structures, and hardware-friendly dot products. Adding per-comparison sorting can disturb that architecture.

There are possible mitigations. Sorted representations could be precomputed for stored documents. Approximate versions might use partial sorting, quantization, or learned shortcuts. recos might be applied only to a candidate set returned by cosine, functioning as a lightweight reranking step. The paper itself suggests future work on efficient approximations for billion-scale applications.

That last design may be the most realistic:

Use standard ANN search with cosine to retrieve a candidate set.
Apply recos to the top $k$ candidates.
Compare downstream ranking quality, latency, and cost.
Keep it only where the quality gain survives business evaluation.

That is less glamorous than announcing a new universal similarity metric. It is also how production systems avoid becoming benchmark-themed art installations.

What the paper directly shows, and what businesses should infer

The paper directly shows that recos improves Spearman correlation with human STS judgments across a broad set of tested embedding models and datasets, with especially consistent directional gains over cosine. It also provides a mathematical derivation showing that recos normalizes the dot product using a tighter Rearrangement Inequality-based bound and relaxes maximal similarity from linear dependence to ordinal concordance.

Cognaptus infers that recos is most relevant where teams already rely on embeddings but have reason to suspect cosine is under-ranking certain relevant pairs. The likely business use is not replacing the embedding model; it is extracting slightly more ranking signal from embeddings already in use.

What remains uncertain is equally important. The paper does not evaluate full RAG answer quality. It does not test large-scale ANN infrastructure. It does not show user-level business metrics such as conversion, analyst time saved, support-ticket resolution, or retrieval latency under production load. It does not prove that ordinal concordance has a stable semantic interpretation in every embedding space.

So the business takeaway is bounded but useful: recos is not a new AI platform strategy. It is a metric-level intervention that may improve retrieval quality without retraining. In mature AI systems, that kind of small lever can matter, especially when every larger lever is expensive, politically tangled, or already overused.

A practical evaluation checklist

For teams considering recos, the paper suggests a disciplined evaluation path:

Question	Recommended test	Decision signal
Does recos improve ranking on our data?	Compare cosine vs. recos on labeled query-document pairs	Higher NDCG, MRR, recall@k, or task-specific relevance
Is the gain concentrated in certain models?	Segment by embedding model and content domain	Use recos only where gains are material
Does the STS-style gain transfer to RAG?	Run answer-quality evaluation with retrieved context	Better groundedness, fewer missing citations, fewer irrelevant chunks
Is latency acceptable?	Benchmark reranking top-$k$ candidates	Keep recos after first-stage retrieval if full search is too costly
Does the metric destabilize retrieval?	Inspect nearest-neighbor changes and cluster shifts	Avoid if improvements come with unpredictable ranking artifacts

The important habit is to treat similarity scoring as a model choice. Not a sacred default. Not plumbing. Not something hidden in a vector database setting and forgotten until the demo fails.

Cosine is still useful; defaults are just not evidence

The paper’s conclusion is balanced: recos is not presented as a wholesale replacement for cosine. Cosine remains efficient, interpretable, and deeply supported by existing retrieval systems. That is not a small advantage. Infrastructure inertia is not always stupidity; sometimes it is the accumulated wisdom of latency budgets.

But the paper does weaken one lazy assumption: that better embeddings automatically make the similarity metric less important. In fact, the result hints at the opposite. As embedding spaces become more specialized, multimodal, and objective-shaped, the final similarity rule may matter more, not less.

Cosine asks whether two vectors point in the same direction. recos asks whether their components agree in order. In many systems, angle will remain enough. In some systems, order may quietly recover signal that angle leaves on the table.

That is the useful lesson. Not “cosine is dead.” Dead defaults rarely run this much infrastructure. The sharper lesson is that even boring defaults deserve periodic cross-examination.

Cognaptus: Automate the Present, Incubate the Future.

Xinbo Ai, “Beyond Cosine Similarity,” arXiv:2602.05266, 2026. https://arxiv.org/html/2602.05266 ↩︎

Cosine measures angle; recos measures ordered agreement#

The “expert ratings” example explains the metric better than the formula#

The empirical test is simple: same embeddings, different similarity metrics#

The main result is consistency, not drama#

The bigger gains appear where cosine’s assumptions look less natural#

The appendix is mostly robustness, not a second thesis#

What changes for RAG and semantic search teams#

The cost is sorting, and sorting is not free at industrial scale#

What the paper directly shows, and what businesses should infer#

A practical evaluation checklist#

Cosine is still useful; defaults are just not evidence#