Opening — Why this matters now
Search systems have grown fluent, but not necessarily intelligent. As enterprises drown in text—contracts, filings, emails, reports—the gap between what users mean and what systems match has become painfully visible. Keyword search still dominates operational systems, while embedding-based similarity often settles for crude averages. This paper challenges that quiet compromise.
The core question is deceptively simple: if words carry meaning individually, why do we insist on averaging them into silence?
Background — From bags of words to semantic geometry
Classical information retrieval leaned heavily on lexical overlap. TF‑IDF, BM25, and related vector-space models rewarded exact matches and punished linguistic creativity. Latent Semantic Analysis (LSA) tried to smooth this gap by projecting text into lower-dimensional semantic spaces, but its global matrix factorization struggled with short queries and nuanced rephrasings.
Neural embeddings—Word2Vec, GloVe, FastText—shifted the field from counting words to placing them. Meaning became geometry. Yet most production systems still compress these geometries into centroids and compare them with cosine similarity, a convenient shortcut that quietly discards structure.
Analysis — Letting words keep their distance
The paper’s central move is methodological, not architectural. Instead of representing a sentence or paragraph as the average of its word vectors, it measures similarity by computing how much effort it takes to move one set of words onto another. This is Word Mover’s Distance (WMD), adapted from Earth Mover’s Distance in computer vision.
In practical terms:
- Each document becomes a distribution of word embeddings.
- Similarity is the minimum cumulative distance required to align one distribution with another.
- Individual words matter again—especially in short or partially specified queries.
The study evaluates this approach across multiple embedding backbones (Word2Vec, FastText, GloVe) and compares them against Doc2Vec variants and an LSA baseline, using a real-world IPO prospectus as the retrieval corpus.
Findings — Accuracy without supervision
The results are unambiguous and slightly uncomfortable for the status quo.
| Model | Top‑20 Recall | Top‑3 Recall | #1 Rank Accuracy |
|---|---|---|---|
| WMD + GloVe | 100% | 100% | 89.83% |
| WMD + FastText | 100% | 98.31% | 84.73% |
| WMD + Word2Vec | 100% | 94.92% | 58.98% |
| Doc2Vec (PV+DBOW) | 89.83% | 79.66% | 66.10% |
| LSA | 18.64% | 8.47% | ~0% |
Several patterns stand out:
- Centroid collapse is real: averaging embeddings consistently underperforms distance-aware matching.
- GloVe quietly dominates: its global co‑occurrence structure pairs exceptionally well with WMD.
- Short queries benefit most: exactly where traditional models fail hardest.
- No supervision required: pre-trained embeddings generalize cleanly across domains.
Perhaps most striking is the qualitative behavior: WMD + GloVe doesn’t just retrieve the right sentence, it retrieves neighboring ideas—management structure, governance, related financial context—suggesting genuine semantic clustering rather than brittle matching.
Implications — Practical search, not academic elegance
This work is not proposing yet another representation model. It is pointing out a blind spot in how we use the ones we already trust.
For business systems—legal discovery, compliance search, financial analysis—the implications are direct:
- Better recall without ontology engineering.
- Robust performance on partial, reordered, or underspecified queries.
- Domain portability without retraining large models.
The trade-off is computational. Full WMD is expensive, but relaxed variants and hybrid indexing strategies make the cost manageable. In exchange, systems gain something users already assume exists: semantic understanding.
Conclusion — Stop averaging away meaning
Embedding-based search did not fail. We simply asked it to be fast instead of faithful. This paper shows that when words are allowed to keep their identities—and their distances—semantic retrieval becomes both sharper and more intuitive.
The lesson is understated but decisive: meaning lives in structure, not in averages.
Cognaptus: Automate the Present, Incubate the Future.