When Words Start Walking: Rethinking Semantic Search Beyond Averages

Search fails in a very ordinary way.

A lawyer looks for a clause without remembering the exact wording. A finance analyst searches a prospectus for an operating-profit statement, but types only the economic idea. A compliance officer remembers a person’s role, not the sentence where the role was declared. The system returns either too much, too little, or the wrong thing wearing the right keywords. Everyone then calls it “semantic search,” because apparently disappointment sounds better in Greek.

The paper behind this article, Evaluating the impact of word embeddings on similarity scoring in practical information retrieval, studies a deceptively small but useful question: when two pieces of text mean roughly the same thing, is it enough to average their word embeddings and compare the averages?¹

Its answer is: usually no. More precisely, the paper shows that the similarity mechanism matters as much as the embedding model. Word embeddings are not magic dust. If a system compresses a sentence into a single centroid too early, it may average away the very details the user needs.

The paper compares several semantic retrieval strategies on a practical statement-retrieval task. The strongest result comes from pairing Word Mover’s Distance (WMD) with GloVe embeddings. In the authors’ experiment, WMD + GloVe returns all 59 query-statement matches within the top three results and places 53 of 59 at rank one. That is not a universal law of search. It is, however, a useful warning against one of the laziest habits in embedding-based retrieval: treating the average of words as if it were the meaning of the sentence.

The real comparison is not old search versus new AI

The easy version of this article would say: keyword search is old, embeddings are new, semantic search wins. That story is comfortable and mostly useless.

The more interesting comparison is inside semantic search itself. The paper does not merely ask whether vector representations help. It asks how different representations and ranking methods behave when the same information is reworded, shortened, reordered, or partially queried.

The study compares:

Model family	What it represents	How similarity is judged	Practical risk
LSA	Terms projected into a lower-dimensional semantic space	Cosine similarity in the reduced space	Weak on short, specific statement retrieval
Doc2Vec PV-DM	Paragraph vectors with word-order/context sensitivity	Centroid-style nearest-neighbor comparison	Can become rigid when wording is rearranged
Doc2Vec PV-DBOW	Paragraph vectors with bag-of-words training	Centroid-style nearest-neighbor comparison	More flexible, but less precise
Doc2Vec PV + DBOW	Combined paragraph-vector approach	Centroid-style nearest-neighbor comparison	Stronger than single Doc2Vec variants, still behind WMD in this task
WMD + Word2Vec / FastText / GloVe	Individual word vectors inside each query and statement	Minimum “movement cost” between word distributions	Accurate but computationally heavier

This is why the accepted comparison-based structure matters. The paper’s value is not that it name-drops Word2Vec, FastText, GloVe, Doc2Vec, LSA, and WMD in one place. A shopping list is not analysis. The value is the contrast between two design instincts.

One instinct says: convert every text object into a single vector, then compare vectors.

The other says: keep the words as separate points in semantic space, then measure how much effort is needed to move one set of meanings onto another.

The second instinct is more expensive. It is also harder to fool with rearranged wording.

What Word Mover’s Distance changes

A centroid-based method compresses a query or paragraph into an average vector. That can work for broad topic clustering. It is less reliable when the user searches for a specific statement and only remembers part of it.

A simple example: suppose the target statement says:

“Michael Brown is the Chief Executive Officer of the Company.”

A user may search:

“Chief Executive Officer of the Company”

or simply:

“Michael Brown”

A centroid method has to decide whether the average vector of the query is close enough to the average vector of the statement. WMD does something more granular. It treats each text as a distribution of word embeddings and calculates the minimum distance required to move the words in one text to align with the words in the other.

That is the “walking” in the title. Words are allowed to keep their individual positions before the system decides whether two pieces of text are close.

This matters because statement retrieval is often not about broad theme recognition. It is about finding the right paragraph when the user gives an incomplete, reordered, or semantically similar query. In other words, the system is not trying to understand “governance” as a topic. It is trying to recover the sentence where a specific governance fact lives.

The experiment is practical, but narrow

The dataset comes from the 2013 IPO prospectus for Foxtons Estate Agents of London. The prospectus contains 223 pages, 141,171 words, and 8,127 individual statements or paragraphs. The authors selected 12 target statements and developed query variations for each, producing 59 search statements across the trials.

The query variations are the important part. They are not random paraphrases generated to look fashionable. They test common retrieval pain points:

Test pattern	Example of what changes	Likely purpose
Exact query	The query is the original statement	Main evidence: can the model recover the known target?
Reordered statement	Subsections of a financial sentence are swapped	Robustness/sensitivity test: does word order break retrieval?
Partial query	Only the name, title, figures, or phrase is searched	Robustness/sensitivity test: does the model handle incomplete user memory?
Short query versus longer statement	A few words must retrieve a fuller paragraph	Practical retrieval test: does semantic matching survive query length mismatch?
Multi-sentence target	Longer target statements up to 215 words	Sensitivity test: does retrieval degrade with statement length?

This setup is more realistic than a toy synonym test, but it is still a bounded experiment. It is one document collection, one business genre, and 59 query-statement comparisons. The right lesson is not “WMD + GloVe wins enterprise search.” The right lesson is more specific: for statement-level retrieval over business documents, especially where users submit partial or reworded queries, WMD-style ranking can substantially outperform centroid-style semantic matching.

That is already useful. It is just not a license to tattoo “GloVe” on the architecture diagram and call procurement.

The results separate recall from rank quality

The paper evaluates each method at three levels:

Whether the correct statement appears in the top 20 results.
Whether the correct statement appears in the top three results.
Whether the correct statement appears at rank one.

This distinction is important. Top-20 retrieval is useful for recall, but users rarely celebrate being asked to inspect twenty candidates. In an enterprise workflow, rank quality changes user behavior. A search system that places the right clause first feels intelligent. One that hides it at position 17 feels like a filing cabinet with Wi-Fi.

The main results are:

System	Correct matches in top 20	Correct matches in top 3	Correct matches at rank 1
WMD + GloVe	59/59, 100%	59/59, 100%	53/59, 89.83%
WMD + FastText	59/59, 100%	58/59, 98.31%	50/59, 84.73%
WMD + Word2Vec	59/59, 100%	56/59, 94.92%	18/59, percentage inconsistent in paper
Doc2Vec PV + DBOW	53/59, 89.83%	47/59, 79.66%	39/59, 66.10%
Doc2Vec PV-DBOW	40/59, 67.8%	23/59, 38.98%	16/59, 27.12%
Doc2Vec PV-DM	33/59, 55.93%	25/59, 42.37%	15/59, 25.42%
LSA	11/59, 18.64%	5/59, 8.47%	2/59, percentage inconsistent in paper

Two technical notes are worth making without turning this into a spreadsheet autopsy. First, the paper’s rank-one table appears to contain percentage inconsistencies for WMD + Word2Vec and LSA: 18/59 is not 58.98%, and 2/59 is not 0.03%. The counts are safer to interpret than those printed percentages. Second, the inconsistency does not change the main comparison. WMD + GloVe and WMD + FastText clearly dominate the rank-one results; WMD + GloVe is strongest overall.

The top-20 result says all WMD variants can recover the target somewhere in a manageable result set. The top-three result says WMD + GloVe is especially reliable when users expect a small candidate list. The rank-one result says GloVe paired with WMD is not merely finding the answer; it is often placing it where the user wants it.

That is the operational difference between “our search engine has recall” and “our employees trust the search box.”

Why Doc2Vec underperforms where it should have looked strong

Doc2Vec seems like it should be a natural fit. It represents sentences, paragraphs, or documents rather than only individual words. Business documents often require paragraph-level meaning, so paragraph vectors sound appealing.

The experiment gives a more complicated answer.

The combined PV + DBOW version performs much better than the single Doc2Vec variants. It retrieves 53 of 59 correct statements in the top 20 and 47 of 59 in the top three. That is not failure. It is a respectable result. But it still trails all three WMD combinations at the top-three level, and it trails WMD + GloVe and WMD + FastText at rank one.

The likely reason is not that paragraph vectors are useless. It is that this task punishes the wrong kind of compression. When queries are short, reordered, or partial, the system needs to preserve enough word-level structure to connect fragments to full statements. Doc2Vec’s paragraph representation may help with longer text segments, but the paper’s task repeatedly asks models to match smaller fragments against specific statements. In that setting, the “paragraph meaning” can become too smooth.

The comparison between PV-DM and PV-DBOW is also revealing. PV-DBOW outperforms PV-DM in top-20 recall, suggesting that ignoring word order can actually help when the test includes reordered statements. This is a nice little insult to our intuition. Sometimes preserving structure helps; sometimes it preserves the wrong structure.

The practical point is simple: do not choose a representation because its name sounds closer to your document type. Choose it because its failure mode matches your user behavior.

If users search with long, coherent paragraphs, paragraph-vector methods may be reasonable. If users search with fragments, aliases, reordered clauses, and half-remembered entities, a distance-aware word-level method may be better as a reranking layer.

Why GloVe wins this contest

The paper tests WMD with three embedding sources: Word2Vec, FastText, and GloVe. All three perform strongly in top-20 retrieval. The separation appears when ranking becomes stricter.

WMD + GloVe returns all 59 matches in the top three and 53 at rank one. WMD + FastText is close, with 58 in the top three and 50 at rank one. WMD + Word2Vec retrieves all 59 in the top 20 and 56 in the top three, but its rank-one count is much weaker in the reported table.

The paper interprets GloVe’s strength partly through its global co-occurrence structure. GloVe is trained from large co-occurrence counts, while Word2Vec is a local context-window model and FastText extends Word2Vec with subword information. In this experiment, the Common Crawl-trained GloVe vectors paired well with WMD’s word-level movement calculation.

A cautious business reading is better than a grand one. The result does not prove that GloVe is always superior to Word2Vec or FastText. It suggests that for this kind of statement-level retrieval over a business prospectus, GloVe’s semantic geometry gives WMD a cleaner distance landscape.

That distinction matters. The paper’s result is not “use GloVe everywhere.” The result is closer to: if you are building a semantic reranker for short-to-medium business statements, WMD + GloVe is a strong candidate worth testing before you default to averaged embeddings.

Tiny difference. Large procurement consequences.

The qualitative result is about semantic neighborhoods, not just exact hits

One of the paper’s more useful sections is not a table. It is the informal qualitative analysis of the “Michael Brown” query.

For that query, the WMD-based systems achieved full recall. In the WMD + GloVe results, 10 of the top 20 statements directly mention Michael Brown as CEO or executive director. That is the expected syntactic success. More interestingly, the remaining 10 statements concern related concepts: management structure, key management personnel, the board of directors, and company management.

This is where the paper becomes relevant to query expansion.

A search system should not only retrieve the exact target statement. It should help the user explore adjacent meaning. If someone searches for a CEO, related governance material may be useful. If someone searches for operating profit, related financial-performance statements may be useful. If someone searches for a compliance obligation, related definitions and exceptions may be useful.

The business value is not only better ranking. It is better navigation through unfamiliar document spaces.

That said, semantic neighborhoods are dangerous when left unlabelled. A related statement is not the same as an answer. In legal, financial, or medical contexts, “nearby” can become “misleading” with impressive speed. A well-designed interface should distinguish exact target matches from semantically adjacent suggestions. The model can expand the map, but the product must still mark the roads.

The business lesson is reranking, not replacement

The most practical way to use this paper is not to rebuild enterprise search around full WMD. That would be dramatic, and drama is usually where budgets go to become smoke.

A better architecture is hybrid:

Use a fast retrieval layer to generate candidates.
Use semantic methods to rerank those candidates.
Use query expansion or related-concept suggestions to improve exploration.
Use domain-specific tuning where general embeddings fail.

In that design, WMD is not the whole search engine. It is a precision instrument applied after candidate generation.

Business use case	Where WMD-style ranking helps	What it should not be asked to do alone
Legal document search	Reordered clauses, partial clause memory, synonym-heavy queries	Replace jurisdiction-specific legal interpretation
Financial filings	Locate specific statements with figures or partial phrases	Validate financial meaning or reconcile accounting context
Compliance repositories	Match policy questions to relevant policy statements	Decide obligation status without rule logic
Internal knowledge bases	Retrieve relevant paragraphs when employees use imprecise wording	Solve taxonomy, permissions, and document-quality problems
Medical or technical archives	Bridge wording differences across similar concepts	Substitute for domain-specific ontology where terminology is highly specialized

This is the right level of ambition. The paper directly shows improved statement retrieval on a prospectus-derived test set. Cognaptus can infer that WMD-style reranking may improve enterprise search workflows where users search with fragments and paraphrases. What remains uncertain is how far the result travels across larger corpora, noisier documents, domain-specific jargon, and modern embedding baselines.

The inference is promising. It is not a product guarantee wearing a lab coat.

The scalability boundary is real

The paper is explicit about WMD’s computational cost. Full WMD has high time complexity, often described as $O(n^3 \log n)$ where $n$ is the number of unique words. That makes it unattractive for full-text retrieval over very large corpora if used naively.

This limitation is not a footnote. It changes deployment design.

The authors point toward Relaxed Word Mover’s Distance and hybrid search strategies that index dense vectors through traditional inverted-index search engines. The broader idea is sensible: use fast mechanisms to narrow the candidate set, then use more precise semantic distance where it matters.

The paper also notes a second boundary. Pre-trained embeddings can work well across domains, but some business domains use highly nuanced language. In those cases, being detached from domain ontology is not always an advantage. A general embedding model may know that “charge,” “claim,” “security,” and “instrument” are meaningful words. It may not know which meaning matters in a loan agreement, regulatory filing, or clinical protocol.

So the implementation rule should be:

Decision point	Sensible choice
Large corpus, broad discovery	Fast first-stage retrieval, possibly BM25 or vector ANN
Small candidate set requiring precision	WMD or relaxed WMD reranking
Reworded and partial business statements	Test WMD + GloVe or comparable distance-aware methods
Specialized domain language	Add domain-enriched embeddings, ontology signals, or supervised feedback
User-facing exploration	Separate exact matches from semantic suggestions

Averages are cheap. Movement is expensive. The trick is not to choose one forever. The trick is to spend computation where the user is most likely to notice.

The paper is useful because it is modest

The study has limitations: one prospectus, 12 target statements, 59 query variations, and a pre-transformer embedding stack. It does not compare against modern transformer-based dense retrieval, cross-encoders, or LLM rerankers. It does not prove universal superiority for WMD + GloVe. It does not solve enterprise search.

But its modesty is exactly why it remains useful.

It isolates a design problem that still appears inside modern systems: the loss caused by premature averaging. Even if today’s production stack uses newer embeddings, the same architectural question survives. Are we preserving the structure needed for the retrieval task, or are we compressing it away because the resulting vector is convenient?

The paper’s answer is grounded in a practical retrieval setting. It shows that the path from word embeddings to useful search is not automatic. A representation must be paired with a similarity function that respects the user’s search behavior.

For Cognaptus readers building AI-enabled search or document automation systems, that is the lesson worth keeping.

Do not ask “which embedding model is best?” too early.

Ask first:

What does the user remember: the exact phrase, the entity, the role, the number, or only the idea?
Is the target a document, a paragraph, a sentence, or a clause?
Does ranking position matter more than broad recall?
Are related concepts useful, or dangerous?
Can expensive reranking be applied only after fast candidate retrieval?

Those questions lead to better systems than another round of embedding-model astrology.

Conclusion: meaning is not the average of its parts

The old keyword system fails when the user says the right thing with the wrong words. The naive embedding system fails more quietly: it turns the words into vectors, averages them, and hopes the average still means something.

Sometimes it does. Often enough, it does not.

This paper shows that for practical statement retrieval, letting words keep their individual semantic positions can produce sharper rankings. WMD + GloVe performs especially well in the authors’ Foxtons prospectus experiment, returning every target match in the top three and most at rank one. More importantly, the comparison reveals why the gain appears: not because embeddings exist, but because the ranking method preserves word-level movement instead of collapsing meaning into a centroid.

The business takeaway is not to worship WMD. It is to stop treating semantic search as a one-vector problem.

When users search business documents, they rarely hand the system a perfect sentence. They bring fragments, substitutions, reordered clauses, and partial memory. A useful retrieval system must meet them there.

Meaning does not sit still inside an average. Sometimes the words have to walk.

Cognaptus: Automate the Present, Incubate the Future.

Niall McCarroll, Kevin Curran, Eugene McNamee, Angela Clist, and Andrew Brammer, “Evaluating the impact of word embeddings on similarity scoring in practical information retrieval,” arXiv:2602.05734, accessed via PDF fallback because the arXiv HTML page was unavailable. ↩︎

The real comparison is not old search versus new AI#

What Word Mover’s Distance changes#

The experiment is practical, but narrow#

The results separate recall from rank quality#

Why Doc2Vec underperforms where it should have looked strong#

Why GloVe wins this contest#

The qualitative result is about semantic neighborhoods, not just exact hits#

The business lesson is reranking, not replacement#

The scalability boundary is real#

The paper is useful because it is modest#

Conclusion: meaning is not the average of its parts#