Search.
That is the unglamorous part of AI work. The demo asks a model to solve a clean problem. The enterprise system asks a model to find the right prior case, retrieve the relevant precedent, avoid the misleading near-match, and then adapt the answer without making a confident mess of it.
MathNet is interesting because it puts that distinction under pressure. The paper introduces a large multilingual, multimodal Olympiad mathematics benchmark, but the more useful business lesson is not merely that frontier models can solve hard math. We already have enough leaderboards wearing medals. The sharper finding is that models and embedding systems can still fail at recognizing when two problems are mathematically the same, or when one problem is structurally useful for another.1
That sounds like a niche mathematical complaint until one translates it into business language. A legal assistant that retrieves the wrong clause because the wording is similar has not “almost succeeded.” A compliance bot that finds a policy from the same department but not the same control logic has not helped. A support copilot that retrieves a customer ticket with matching keywords but a different root cause has increased the blast radius of mediocrity.
MathNet’s useful equation is therefore simple:
$$ \text{AI workflow value} \neq \text{reasoning score alone} $$
A system also needs to find the right analogy. That is where the paper becomes more than another math benchmark.
MathNet separates solving from finding the right analogy
MathNet is built from official Olympiad materials rather than informal web collections. The core dataset, MathNet-Solve, contains 30,676 expert-written problems with solutions, spanning 47 countries, 17 languages, and 143 competitions. The authors also describe a collection pipeline that starts with document ingestion and OCR, then uses LLM-assisted extraction, and finally applies rule-based checks, GPT-4.1 verification, and human review before retaining problem-solution pairs.
That pipeline is not the conceptual punchline, but it matters. Retrieval evaluation is only useful when the ground truth is not a pile of scraped uncertainty wearing a dataset badge.
The benchmark is organized into three layers:
| Layer | What it tests | Evidence type in the paper | Why it matters operationally |
|---|---|---|---|
| MathNet-Solve | Can the model solve Olympiad problems? | Main evidence for generative reasoning | Measures answer generation under hard conditions |
| MathNet-Retrieve | Can an embedding system retrieve mathematically equivalent problems? | Main evidence for structure-aware retrieval | Measures whether search can identify the right prior example |
| MathNet-RAG | Can retrieved examples improve solving? | Main evidence isolating retrieval quality | Measures whether context helps or merely decorates the prompt |
This structure is better than the usual benchmark format because it does not confuse three different abilities. A model may solve a problem directly. A retriever may find an equivalent or useful problem. A RAG system may use that retrieved material to improve a solution.
Those are not the same capability. In enterprise AI, pretending they are the same is how a company spends six months improving prompt templates while the retrieval layer quietly keeps handing the model the wrong files.
The solver scoreboard is impressive, but it is not the workflow
On MathNet-Solve, the strongest reasoning-oriented multimodal models perform well by Olympiad benchmark standards. The paper reports Gemini-3.1-Pro at 78.4% overall accuracy on the 6,400-problem test set, Gemini-3-Flash at 70.4%, and GPT-5 at 69.3%. The strongest models are especially good in algebra and number theory. Geometry and discrete mathematics remain harder.
This is the headline-friendly part. It tells us that frontier models are becoming genuinely capable on difficult structured reasoning tasks. It also shows useful domain variation: a model that looks strong in algebra may still struggle when diagrams, configurations, or combinatorial structure become central.
But for workflow design, this is only half the story. A business process is rarely a clean exam question handed to a model with all relevant assumptions already packaged inside the prompt. Most high-value work begins with retrieval:
- Has this contract clause appeared before under different wording?
- Which previous customer incident used the same root-cause pattern?
- Is this risk-control exception structurally equivalent to another exception already approved?
- Which internal memo contains the method, not merely the same vocabulary?
The model that solves the exam may still fail the filing cabinet.
That is the central comparison MathNet makes possible. Solving measures whether the model can generate a correct answer after receiving the problem. Retrieval measures whether the system can locate the right related problem in the first place. If retrieval fails, the generator may never get the useful context. It may instead receive a plausible distraction, which is worse than silence because it arrives with the emotional confidence of a helpful assistant.
The retriever scoreboard is the rude part
MathNet-Retrieve is designed around mathematical equivalence. For each of 10,000 anchor problems, the authors create one equivalent positive and three hard negatives. The positives preserve the underlying mathematical structure through transformations such as variable renaming, algebraic reformulation, and paraphrase. The hard negatives preserve surface similarity while changing the actual mathematics.
This is not ordinary semantic search. It is a trap for systems that mistake shared vocabulary for shared structure.
The results are blunt. On the full retrieval benchmark, the best Recall@1 scores are around 5%. Qwen3-embedding-4B reaches 4.96% overall Recall@1, while Gemini-embedding-001 reaches 4.83%. Gemini-embedding-001 performs much better at Recall@5, reaching 68.88%, and Qwen3-embedding-4B reaches 64.95%.
| Embedding model | Overall Recall@1 | Overall Recall@5 | Practical reading |
|---|---|---|---|
| Gemini-embedding-001 | 4.83% | 68.88% | Often finds the answer somewhere nearby, but rarely first |
| Qwen3-embedding-4B | 4.96% | 64.95% | Similar pattern: better candidate pool than top hit |
| text-embedding-3-large | 2.74% | 54.23% | Useful signal, weak structural precision |
| text-embedding-3-small | 1.98% | 35.49% | Surface semantics are not enough |
The difference between Recall@1 and Recall@5 is the business story. If the correct item appears somewhere in the top five, a human expert or a reranker may still salvage the workflow. If the production system blindly feeds the top result into the model, it may confidently ground the answer in the wrong analogy.
The paper’s cosine similarity analysis explains why. Equivalent pairs are not always ranked above non-equivalent or near-miss pairs. In plain English: embeddings can think the wrong problem is more similar than the right one because the wrong problem looks closer on the surface.
That is not a minor ranking inconvenience. It is the failure mode of generic RAG in domains where structure matters.
Similarity has three levels, and businesses usually flatten them
One of MathNet’s more useful contributions is its taxonomy of mathematical similarity. The authors distinguish between invariance, structural resonance, and affinity.
This taxonomy is worth stealing for enterprise AI evaluation, with appropriate shame and attribution.
| MathNet similarity type | Meaning in the paper | Business analogue | Common retrieval failure |
|---|---|---|---|
| Invariance | Same underlying structure under reformulation | Same obligation, same control, same issue, different wording | The system misses equivalent cases because the phrasing changed |
| Structural resonance | Different problem, reusable method or lemma | Different situation, same playbook or causal mechanism | The system retrieves same-topic material but misses the reusable method |
| Affinity | Broad topical relatedness | Same department, same product, same category | The system mistakes topical neighborhood for operational relevance |
Most enterprise search systems are comfortable with affinity. They can find documents about “vendor risk,” “refund dispute,” or “machine vibration.” That is useful, but it is the lowest bar.
The harder question is whether the system can find a prior case that has the same decision logic. That requires invariance or resonance, not just topical clustering. A policy memo and a support ticket may share no vocabulary yet express the same operational pattern. Conversely, two documents may share many terms while requiring opposite actions.
MathNet makes this visible in mathematics because mathematics is unforgiving. A small symbolic change can destroy equivalence. Business documents are less formal, but the same principle applies. A small change in jurisdiction, contract role, payment timing, sensor regime, or customer segment can change the answer.
The annoying part is that generic embeddings are often optimized for smooth semantic similarity. Business risk is often hiding in discontinuities.
RAG helps only when the retrieved example is actually useful
MathNet-RAG is the most directly relevant part for AI workflow design. The authors evaluate seven models under three settings:
- Zero-shot: the model receives only the target problem.
- Embed-RAG: the model receives a problem and solution retrieved by Gemini-embedding-001.
- Expert-RAG: the model receives an expert-paired structurally similar problem and its solution.
The comparison is elegant because it isolates retrieval quality. If Expert-RAG improves performance but Embed-RAG is inconsistent, the bottleneck is not simply whether context exists. The bottleneck is whether the context is the right context.
The results support exactly that reading.
Under human grading, DeepSeek-V3.2-Speciale rises from 84.8% zero-shot to 89.5% with Embed-RAG and 97.3% with Expert-RAG. GPT-5 rises from 76.8% zero-shot to 86.6% with Expert-RAG, although its Embed-RAG result is slightly lower than zero-shot at 75.2%. Gemini-3-Pro performs strongly zero-shot at 89.1%, improves to 92.9% with Embed-RAG, but drops to 87.5% with Expert-RAG under human grading.
So the result is not “retrieval always helps.” That would be too convenient, and therefore suspicious.
The better interpretation is narrower and more useful: structure-aligned examples can improve reasoning, but retrieved examples are not automatically structure-aligned. A retrieved near-match may help, do nothing, or interfere. This is why enterprise RAG systems need retrieval evaluation, not just a larger context window and a cheerful product page.
A 200-page context window does not solve the problem if the wrong 200 pages are invited to the meeting.
What each experiment should be allowed to prove
The paper contains several kinds of evidence. Treating them all as one generic “benchmark result” would flatten the argument. The useful reading is more disciplined:
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| MathNet-Solve model scores | Main evidence | Frontier models can solve many hard Olympiad problems, with domain-specific weaknesses | That models can manage real enterprise workflows end-to-end |
| MathNet-Retrieve Recall@k | Main evidence | Generic embeddings struggle with mathematical equivalence at top-1 retrieval | That all retrieval in all domains will fail at the same rate |
| Cosine similarity distribution | Diagnostic analysis | Surface similarity can outrank true equivalence | A complete causal theory of embedding internals |
| MathNet-RAG zero-shot vs Embed-RAG vs Expert-RAG | Main evidence isolating retrieval quality | RAG depends heavily on the quality of retrieved examples | That expert examples always help every model |
| Extraction and verification pipeline | Implementation detail and quality control | The dataset attempts to control OCR, alignment, and hallucinated extraction errors | That every retained item is perfect |
| Competition and topic coverage appendix | Dataset scope documentation | MathNet is broader than many prior Olympiad benchmarks | That it fully represents all mathematical reasoning styles |
This distinction matters because business readers often want to jump from one number to one procurement conclusion. MathNet does not say “buy symbolic AI” or “abandon embeddings” or “RAG is dead.” It says the gap between generation and retrieval is real enough to deserve separate measurement.
That is already a practical warning.
The business translation is retrieval governance, not benchmark theater
For Cognaptus-style business automation, the paper’s implication is not that companies should benchmark their assistants on Olympiad math. Please do not make the procurement team solve geometry problems unless morale is already beyond repair.
The implication is that enterprise AI systems need structure-aware retrieval tests in their own domain.
A useful internal benchmark should include:
| Enterprise object | Weak test | Stronger MathNet-inspired test |
|---|---|---|
| Legal clauses | Retrieve documents containing the same keywords | Retrieve clauses with the same obligation under different wording |
| Customer support tickets | Retrieve tickets from the same product category | Retrieve tickets with the same root-cause pattern |
| Compliance controls | Retrieve policies from the same department | Retrieve controls with equivalent risk logic |
| Financial analysis notes | Retrieve reports mentioning the same company | Retrieve cases using the same valuation or risk mechanism |
| Operations incidents | Retrieve incidents with similar labels | Retrieve incidents solved by the same diagnostic pathway |
The difference is not cosmetic. A support bot that retrieves same-category tickets may sound informed while recommending the wrong fix. A compliance assistant that retrieves policies by topic may miss the control that actually governs the case. A finance copilot that retrieves reports mentioning the same sector may ignore the scenario with the same balance-sheet mechanism.
The MathNet lesson is that the retrieval layer should be evaluated by the structure of the task, not by the apparent semantic friendliness of the returned text.
A practical diagnostic for enterprise RAG systems
MathNet suggests a simple diagnostic pattern for enterprise RAG projects.
First, define equivalence and usefulness separately. Invariance means the same decision or obligation under different expression. Resonance means a different case with a reusable method. Affinity means topical neighborhood. Do not let the search system receive full credit for affinity when the workflow needed invariance.
Second, build hard negatives. The most revealing test cases are not irrelevant documents. They are documents that look relevant but change the answer. This is where many demos quietly collapse, because demos are usually designed with friendly examples. Friendly examples are how bad systems get promoted.
Third, measure top-1 and candidate-pool quality separately. If the right source is often in the top five but not first, the system may need reranking, metadata filters, citation-aware UX, or human review. If the right source is not in the candidate pool at all, the embedding layer itself is weak.
Fourth, test whether retrieved context changes the final answer. Retrieval accuracy is not the final business metric. The final metric is whether retrieval improves the downstream task: a better answer, faster review, fewer escalations, more consistent decisions, or reduced analyst time.
Fifth, record when retrieval hurts. This is the uncomfortable one. A RAG system can degrade performance by injecting a misleading precedent. Production evaluations should count harmful retrieval explicitly, not bury it under average answer quality.
Where the paper’s boundary should stay
MathNet is a mechanism warning, not a universal enterprise performance forecast.
The direct evidence is from Olympiad mathematics. MathNet-Retrieve uses synthetic equivalent positives and hard negatives generated from anchor problems. MathNet-RAG is built from 35 anchor problems and 35 expert-paired real problems, which is valuable for human grading but small. The paper’s RAG results are therefore better read as a controlled stress test of retrieval quality than as a stable estimate of how much RAG will improve every model in every domain.
There is also a domain difference. Mathematical equivalence is more formal than legal, operational, or financial similarity. That cuts both ways. Formal equivalence makes the benchmark cleaner. Business similarity is messier, context-dependent, and often negotiated by policy rather than proof. A structurally similar customer incident may still require a different answer because the customer tier, contract promise, or local regulation changed.
So the business inference should be precise:
| Claim type | Safe statement |
|---|---|
| What the paper directly shows | Frontier models can solve many MathNet problems, while embedding retrieval struggles badly at top-1 mathematical equivalence. |
| What Cognaptus infers | Enterprise RAG systems in structure-heavy domains should evaluate retrieval by task logic, not just semantic similarity. |
| What remains uncertain | The exact failure rates and best technical fixes will differ by domain, data quality, and workflow design. |
This boundary does not weaken the article’s point. It makes the point usable.
The MathNet equation for AI copilots
MathNet’s most important contribution is not the largest number in the solver table. It is the comparison between solving and searching.
A model that can solve a hard problem is valuable. A system that can find the right prior structure before solving is more valuable in real work. Enterprise AI does not operate in a vacuum; it operates inside archives, policies, tickets, contracts, dashboards, emails, and past decisions. The system must retrieve the right thing before it can reason with the right thing.
That is why “just add RAG” has always been an incomplete instruction. RAG is not magic. It is a dependency chain:
$$ \text{Better answer} \leftarrow \text{right context} \leftarrow \text{right retrieval} \leftarrow \text{right similarity definition} $$
Break the similarity definition, and the rest of the chain becomes theater.
MathNet gives the AI industry a useful irritation. It shows that reasoning ability and retrieval ability can diverge sharply, even in a domain where structure is explicit and correctness is not a matter of vibes. For businesses, that means the next generation of AI copilots should not be evaluated only by how elegantly they answer a question. They should be evaluated by whether they can find the right analogy before they start talking.
Reasoning gets the applause. Retrieval decides whether the applause was deserved.
Cognaptus: Automate the Present, Incubate the Future.
-
Shaden Alshammari, Kevin Wen, Abrar Zainal, Mark Hamilton, Navid Safaei, Sultan Albarakati, William T. Freeman, and Antonio Torralba, “MathNet: a global multimodal benchmark for mathematical reasoning and retrieval,” arXiv:2604.18584v1, 20 April 2026, https://arxiv.org/abs/2604.18584. ↩︎