Database.
That is where many enterprise AI demos quietly go to die.
The user asks one clean natural-language question: “How many customers are in California?” The AI assistant smiles politely, searches something, finds a table that looks relevant, and returns a confident answer. The problem is not that the model cannot understand English. The problem is that five internal databases may all contain customers, states, locations, stores, loans, accounts, or sales regions. Some can answer the question. Some can almost answer it. Some merely smell like they can answer it.
That “almost” is where enterprise AI becomes expensive.
The paper Routing End User Queries to Enterprise Databases tackles exactly this failure mode: given a natural-language query and a repository of databases, rank the databases by whether they can actually answer the query.1 Not generate SQL yet. Not summarize the result. Not build a dashboard. First, choose the right database.
That sounds small until you remember how enterprise data actually looks: overlapping domains, reused column names, inconsistent schema conventions, legacy systems, and table names that were probably clear to someone in 2014. Charming, in the way a warehouse full of unlabeled cables is charming.
The paper’s useful contribution is not merely “better benchmark, better score.” The important point is mechanistic: embeddings confuse semantically similar databases; prompt-only LLM rerankers can confuse lexical overlap with answerability; and reliable routing needs verifiable checks for phrase coverage, schema connectivity, and semantic alignment.
In other words, the system should not only ask, “Does this database sound relevant?” It should ask, “Can the pieces needed to answer this question be grounded in this schema, and can those pieces be joined into an executable path?”
That is a very different engineering question.
The routing problem begins before SQL generation
Most discussion around natural-language data access jumps too quickly to text-to-SQL. The user asks a question; the model writes SQL; the database returns an answer. Nice diagram. Very reassuring. Also incomplete.
In a real enterprise, the first question is often not “What SQL should we write?” but “Which database should we even ask?”
The paper formalizes this as a database routing task. A system receives:
- a natural-language question;
- a repository of candidate databases;
- schema information for each database;
- and the goal of ranking the databases by answerability.
Answerability is the key word. A database is not relevant because it shares vocabulary with the query. It is relevant if it can provide the information needed to answer the query correctly and comprehensively.
This distinction matters because business users do not usually ask database-specific questions. They ask operational questions:
“Which product category had the highest return rate last month?”
“Show average customer balance by state.”
“How many books are there?”
The system then has to infer whether “customer” belongs to a retail database, loan database, CRM database, store database, or some charmingly named customer_v2_final_revised database hiding under a team folder. Semantic similarity can narrow the search. It cannot reliably prove answerability.
The paper’s framing is useful because it treats routing as a reliability layer before downstream analytics. If that layer fails, every later component may behave perfectly and still produce the wrong answer. The SQL generator can be beautiful. The chart can be elegant. The business decision can still be built on the wrong database. Elegant nonsense remains nonsense; it just has nicer typography.
The benchmark repair is not housekeeping; it changes the difficulty of the task
The first major contribution of the paper is benchmark construction. The authors extend two existing text-to-SQL datasets into database-routing benchmarks:
| Benchmark | Source dataset | Total databases | Total questions | Train questions | Test questions |
|---|---|---|---|---|---|
| Spider-Route | Spider | 206 | 11,831 | 5,892 | 5,939 |
| BIRD-Route | BIRD-SQL | 80 | 10,962 | 5,461 | 5,501 |
The design choice matters. Prior routing benchmarks inherited train/test splits from the original text-to-SQL datasets. That made the routing repository small, uneven, and easier than a realistic enterprise environment. The authors instead merge databases into a unified repository and split queries within each database 50/50 into train and test. They also define in-domain and cross-domain settings while keeping the repository consistent.
That sounds like dataset plumbing. It is not.
Routing difficulty depends heavily on the number and similarity of candidate databases. If the repository is small or the test setting is artificially clean, a method can appear strong simply because the wrong choices are not very tempting. In an enterprise repository, wrong choices are extremely tempting. Many databases are semantically close, structurally partial, and lexically misleading.
The paper’s reconstructed benchmarks make this harder setting visible. The authors show that errors are not random wanderings into irrelevant domains. They are concentrated within semantic neighborhoods. For Spider-Route, 82% of questions have top-5 embedding-ranked databases from the same cluster; for BIRD-Route, the figure is 76%. Among cases where the correct database is not ranked first by the embedding method, 86.2% of Spider-Route errors and 75.0% of BIRD-Route errors still put a same-domain database first.
That is the important diagnosis. The router is usually not mistaking a healthcare database for a sports database. It is mistaking one plausible book database for another plausible book database, or one customer-like database for another customer-like database.
For business systems, this is exactly the uncomfortable zone. The model is not obviously wrong. It is plausibly wrong. Plausibly wrong systems are harder to debug, harder to monitor, and much better at surviving demos.
Embeddings retrieve the neighborhood, not the answer
The paper begins with a standard retrieval idea: embed the user question, embed database schemas, compute cosine similarity, and rank candidate databases. This is reasonable as a first-pass filter. It is also insufficient.
Embedding-based retrieval is good at finding semantic neighborhoods. If the user asks about books, it can retrieve book-related databases. If the user asks about products, it can retrieve product-related databases. That is useful. But routing requires choosing the database that can answer the specific question, not merely one that lives in the same semantic suburb.
The paper’s results show this clearly. Using all-mpnet-base-v2 embeddings alone, recall@1 is 0.4747 on BIRD-Route and 0.4365 on Spider-Route. A stronger embedding model, gte-Qwen2-7B-instruct, raises recall@1 to 0.6249 on BIRD-Route and 0.6033 on Spider-Route. Better embeddings help. They do not solve the core problem.
| Method | BIRD R@1 | BIRD MAP | Spider R@1 | Spider MAP |
|---|---|---|---|---|
Cosine retrieval with all-mpnet-base-v2 |
0.4747 | 0.5751 | 0.4365 | 0.5281 |
Cosine retrieval with gte-Qwen2-7B-instruct |
0.6249 | 0.7471 | 0.6033 | 0.6780 |
| Direct LLM reranking with Gemini | 0.7120 | 0.7890 | 0.6799 | 0.7193 |
| Modular reasoning reranking | 0.7962 | 0.8210 | 0.7865 | 0.7964 |
Recall@1 asks whether the correct database is ranked first. MAP, or mean average precision, captures ranking quality beyond the first position. These are not abstract leaderboard decorations. In enterprise routing, rank one matters because many assistants will pass only the top database to the next stage. A wrong rank one becomes a wrong SQL query, then a wrong answer, then possibly a wrong decision with a meeting attached.
The improvement from stronger embeddings is meaningful, but the remaining gap tells us something more interesting: semantic similarity is a retrieval primitive, not an answerability test. It can produce a shortlist. It cannot reliably distinguish among databases with overlapping vocabulary, parallel structures, or ambiguous table names.
That is why the paper’s method keeps embeddings as stage one, but refuses to let them make the final decision alone.
Prompt-only LLM reranking still confuses words with roles
A natural response is: fine, use embeddings to retrieve top candidates, then ask an LLM to rerank them.
The paper tests this. Direct LLM reranking improves over embeddings alone, but still loses to the modular method. On BIRD-Route, Gemini-based reranking reaches 0.7120 recall@1, while the modular approach reaches 0.7962. On Spider-Route, direct reranking reaches 0.6799, while the modular approach reaches 0.7865.
The reason is not that the LLM is useless. The reason is that a single broad prompt asks the model to perform several different operations at once:
- interpret query intent;
- match phrases to schema entities;
- distinguish values from attributes;
- infer whether tables can be joined;
- compare several similar schemas;
- and produce a clean ranking.
That is a lot to ask from a prompt. Naturally, the model sometimes takes shortcuts. The paper gives a useful example: the query asks for “the attribute data type for the attribute named ‘Green’.” A direct LLM reranker incorrectly prefers a database where “Green” appears to match a color field. The correct database is product_catalog, where “Green” is treated as an attribute name, not merely a color value.
This is the kind of error that looks small but reveals the deeper failure. The model recognizes the word. It misreads the role of the word.
Enterprise databases are full of such traps. “Gold” may be a product tier, a material, a customer segment, a color, a loyalty status, or a fund category. “CA” may be California, Canada, current assets, certificate authority, or an abbreviation someone invented during a migration and never documented. Asking a general LLM to “rank the most relevant database” compresses too many distinctions into one soft judgment.
The paper’s answer is to decompose the judgment.
The modular method turns routing into a sequence of smaller checks
The proposed method is training-free. That is not just a convenience; it fits the enterprise setting. Database schemas change. New tables appear. Legacy systems are renamed. Fine-tuning a router every time a schema shifts is not exactly the dream of operational efficiency.
The method works in two stages.
First, embeddings retrieve the top-$k$ candidate databases. In the main experiments, $k=5$. This stage is allowed to be broad. Its job is to make sure the right database is in the candidate set.
Second, the method reranks those candidates using modular reasoning. Instead of asking the LLM to solve the entire routing problem in one step, it breaks the decision into several sub-tasks:
| Module | What it does | Why it matters for routing |
|---|---|---|
| Schema graph construction | Builds an adjacency list of joinable tables based on schema structure | Prevents choosing databases where needed entities cannot be connected |
| Query phrase-schema mapping | Maps query spans to columns, values, or schema entities | Forces the system to identify what each phrase actually refers to |
| Phrase coverage scoring | Checks whether all important query phrases are grounded in the schema | Penalizes databases that only partially match the question |
| Connectivity validation | Tests whether mapped entities form a connected subgraph | Separates answerable databases from merely relevant-looking ones |
| Semantic tie-breaking | Uses fine-grained embedding similarity between phrases and mapped schema entities | Resolves cases where multiple databases pass structural checks |
The clever part is not that an LLM appears somewhere in the pipeline. Everyone can sprinkle an LLM into a diagram. The useful part is where the LLM is used.
The LLM handles localized semantic interpretation: extracting query phrases and mapping them to possible schema entities. The algorithm handles structural checks: coverage, connectivity, graph traversal, scoring, and tie-breaking. This division of labor is more reliable than asking a model to behave like a database administrator, semantic parser, graph engine, and ranking evaluator all at once.
The paper’s scoring logic reflects this. Coverage checks whether query phrases are mapped. Missing phrases are penalized sharply. Connectivity checks whether the mapped entities can be joined through the database schema. If not, the database is structurally invalid for that query. The total score combines coverage and connectivity. Semantic score then breaks ties when multiple databases appear structurally capable.
This is boring in the best possible way. Enterprise reliability often comes from boring checks placed in the right location.
Connectivity is the part prompt engineering usually forgets
The most business-relevant piece of the paper is the connectivity check.
A database can contain all the right-looking words and still fail to answer the query if the relevant columns live in tables that cannot be joined. This is not a language problem. It is a schema graph problem.
The paper’s example is useful: for the query “What is the transmitter of the radio with the largest ERP_kW?”, direct LLM reranking selects databases whose mapped entities are distributed across tables without valid joining paths. The modular method ranks the correct tv_shows database first because the mapped schema entities can be connected. Other candidates receive a connectivity score of zero.
This is the difference between semantic relevance and operational answerability.
A human analyst would understand this immediately. If two needed fields are in unrelated tables, you cannot write a valid SQL query without inventing a bridge. LLMs, however, can be seduced by names. They see words that match the question and infer relevance. The modular method adds a hard question: can the schema support the join?
That single check changes the nature of the system. It turns routing from “which database sounds right?” into “which database can support the computation implied by the question?”
For enterprise AI, that distinction is worth money. It reduces wasted queries, wrong answers, manual debugging, and the slow erosion of trust that happens when users learn that the assistant is impressive only when it is lucky.
Same-domain confusion is the real enemy
The paper’s cluster analysis should be read as a diagnosis, not a decorative experiment.
The authors cluster database schemas using embeddings and constrained K-means to analyze whether routing errors happen across unrelated domains or within similar domains. The result is clear: top candidates are often from the same semantic cluster, and errors frequently place a same-domain database above the ground truth.
This matters because it explains why naive improvements are limited.
A stronger embedding model can retrieve better candidates, but if the top five are all plausible databases in the same business area, ranking still requires finer evidence. A larger LLM can inspect more schema text, but without explicit coverage and connectivity checks, it may still overweight lexical similarity. More prompt instructions can tell the model to be careful, which is adorable, but not the same as enforcing structural validation.
The paper’s examples show three recurring mechanisms of failure and repair:
| Failure pattern | What goes wrong | What the modular method adds |
|---|---|---|
| Attribute/value confusion | The model treats a query term as a value when it should be an attribute name, or vice versa | Phrase-schema mapping forces role-sensitive interpretation |
| Structural impossibility | A database contains relevant-looking entities, but they cannot be joined | Connectivity scoring invalidates disconnected mappings |
| Multiple plausible databases | Several databases can partially or fully answer a generic query | Semantic tie-breaking compares phrase-level alignment after structural checks |
This table is more than a technical summary. It is a design pattern. If an enterprise AI system touches structured data, it should have explicit tests for grounding, feasibility, and ambiguity. Otherwise, it is doing vibes-based data engineering. Vibes are not a control framework.
The result table is main evidence; the error analysis explains why it works
The paper includes several kinds of evidence. They should not be read as if they all prove the same thing.
| Evidence type | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Main benchmark results on Spider-Route and BIRD-Route | Main evidence | Modular reasoning beats embedding-only and direct LLM reranking on R@1 and MAP | It does not prove universal performance on every enterprise schema |
| Same-domain cluster analysis | Diagnostic comparison | Routing errors concentrate among semantically similar databases | It does not prove the clusters correspond perfectly to real business domains |
| Qualitative examples such as “Green,” “ERP_kW,” and “How many books are there?” | Mechanism explanation | The method fixes role confusion, disconnected schema mappings, and tie cases | Individual examples are illustrative, not standalone statistical proof |
| Error analysis adding the correct DB into top-5 | Boundary test | If retrieval includes the correct database, the reranker can often select it | It depends on artificially inserting the correct database for analysis |
| Gemini 2.5 with top-20 candidates on 500 erroneous queries | Exploratory robustness/sensitivity test | Larger candidate sets and a stronger LLM may recover many failures | It is a subset experiment and has higher cost |
| Removing BIRD metadata | Ablation-like test | Domain metadata improves routing when DDL alone is ambiguous | It does not show that all metadata is equally useful or cheap to maintain |
This distinction matters because the paper’s strongest claim is not “the method solves database routing.” The stronger, better-supported claim is more precise: when the correct database appears in the retrieved candidate set, modular reasoning can substantially improve rank-one selection among semantically similar candidates.
The error analysis is especially revealing. The authors report that among remaining failures, 11% of Spider-Route and 16% of BIRD-Route errors occur because the correct database is not in the top-5 candidates produced by the embedding method. When they add the correct database into the top-5 and rerun modular reranking, the method ranks it first for 98% of such Spider-Route cases and 96% of such BIRD-Route cases.
This is not the main operating condition; it is a diagnostic test. Its purpose is to separate retrieval failure from reranking failure. If the correct database never enters the candidate set, the reranker cannot rescue it. No amount of beautiful reasoning helps if the right option is not on the menu.
The paper also tests Gemini 2.5 with top-20 candidates on a subset of 500 erroneous questions. Recall@1 rises to 92% on Spider-Route and 89% on BIRD-Route. This is best read as a sensitivity or exploratory extension: increasing candidate breadth and using a stronger model can help, but cost rises. In enterprise systems, that cost-performance trade-off is not a footnote. It is the architecture.
Metadata is not decoration; it is schema meaning in prose form
For BIRD-Route, the paper uses database-level metadata formed from the union of question-specific evidence sentences in the original BIRD-SQL dataset. The authors also run an experiment without this metadata and observe a 5% decrease in recall@1 and a 7% decrease in recall@3.
This is a practical result hiding inside a technical paper.
DDL tells you table names, column names, data types, and relationships. It does not always tell you what a field means in business language. A column named status may refer to order status, customer status, shipment status, loan status, compliance status, or emotional status of the engineering team after the last migration. DDL alone often cannot disambiguate this.
Metadata gives the router more semantic surface area. Table descriptions, column descriptions, sample values, business definitions, and domain notes can help the system map user phrases to schema entities. The paper’s result suggests that metadata improves routing when schema names are ambiguous or polysemous.
The business implication is straightforward but often ignored: if a company wants reliable AI access to structured data, it cannot treat data documentation as optional bureaucracy. Metadata quality becomes model performance infrastructure.
That does not mean every enterprise needs a giant ontology before building anything. It means there should be a practical metadata layer: table purpose, key columns, entity definitions, join notes, ownership, freshness, and known ambiguity. Without that, the system is asking an LLM to infer business meaning from abbreviations and hope. Hope is a poor schema management strategy.
What Cognaptus would infer for enterprise architecture
The paper directly shows benchmark and reranking performance. The business architecture implications require interpretation.
Here is the clean separation.
| Layer | What the paper directly shows | Cognaptus inference for business systems | Boundary |
|---|---|---|---|
| Retrieval | Embedding retrieval improves with stronger models but still misranks same-domain databases | Use embeddings as a candidate generator, not as the final routing decision | Retrieval quality limits downstream reranking |
| Reranking | Modular reasoning beats direct LLM reranking on tested benchmarks | Decompose routing into auditable sub-checks instead of relying on a monolithic prompt | Benchmarks are derived from NL-to-SQL datasets, not live enterprise logs |
| Schema structure | Connectivity checks eliminate candidates that cannot support joins | Treat schema graphs as first-class routing infrastructure | Requires accurate schema relationships |
| Metadata | Removing BIRD metadata reduces performance | Invest in lightweight semantic documentation for tables and columns | Metadata must be maintained, not dumped once and forgotten |
| Ambiguity | Some queries are answerable by multiple databases | Router output should include confidence, alternatives, and clarification triggers | Ground-truth labels may underrepresent legitimate alternatives |
For implementation, the practical architecture would look something like this:
- Schema registry: store database schemas, table descriptions, column descriptions, primary/foreign key relationships, ownership, freshness, and access constraints.
- Candidate retrieval: use embeddings to retrieve top-$k$ candidate databases quickly.
- Phrase grounding: map important query spans to schema entities and possible values.
- Coverage scoring: check whether all necessary query components are grounded.
- Connectivity validation: verify whether grounded entities can be joined.
- Semantic tie-breaking: resolve structurally valid candidates using fine-grained similarity.
- Uncertainty handling: return clarification requests or multiple candidates when the query is genuinely ambiguous.
- Audit trace: log why a database was selected, which phrases were mapped, and which candidates failed coverage or connectivity.
The audit trace is not cosmetic. In enterprise settings, users will ask, “Why did the assistant use this database?” The system needs a better answer than “because the embedding score was high” or “because the LLM said so.” Coverage and connectivity create inspectable reasons.
That is the difference between an AI feature and an enterprise system.
The ROI is fewer silent failures, not magical automation
The obvious business story is that better routing improves enterprise AI accuracy. True, but too vague.
The more useful ROI story is failure reduction.
Wrong database routing creates downstream costs:
- analysts spend time verifying answers;
- data teams investigate false complaints about SQL generation;
- business users lose trust in AI assistants;
- compliance teams worry about untraceable data access;
- dashboards or reports may be generated from the wrong source;
- and prompt engineers are summoned, as if the schema graph can be healed with adjectives.
The paper’s modular design reduces a particular class of silent failure: a system selecting a plausible but structurally wrong database. That has operational value because silent wrong answers are worse than visible failures. A visible failure asks for help. A silent wrong answer enters the meeting deck.
The business value is therefore not simply “higher R@1.” It is cheaper diagnosis. When a router fails, a modular system can reveal whether the failure came from retrieval, missing metadata, phrase mapping, disconnected tables, or query ambiguity. That makes improvement targeted.
An embedding-only router gives you a score. A prompt-only reranker gives you a ranking. A modular router gives you a partial explanation of the decision path. In enterprise AI, explanation is not moral decoration. It is maintenance infrastructure.
The boundaries are real, and they matter
The paper is useful, but its boundaries should be kept clear.
First, the benchmarks are constructed from existing NL-to-SQL datasets. That is reasonable and valuable, but enterprise queries may be messier. Real users ask incomplete, context-dependent, permission-sensitive, and conversational questions. They may say “last quarter” without specifying fiscal calendars. They may ask “our premium customers” when the definition lives in a policy document, not a database schema.
Second, the method depends on candidate retrieval. In the main experiment, reranking operates over the top five databases from embedding retrieval. The paper’s own error analysis shows that some failures occur because the correct database is not in that top-five set. Increasing $k$ can help but raises cost, latency, and prompt complexity.
Third, schema quality matters. Connectivity validation assumes useful schema relationships. Many enterprise databases have incomplete foreign keys, implicit joins, denormalized tables, legacy conventions, or relationships known only to senior analysts and possibly one retired contractor. If the schema graph is wrong, the router’s structural confidence may be misplaced.
Fourth, some queries are genuinely ambiguous. The paper notes cases where the predicted database can also answer the query even if it is not the benchmark’s labeled ground truth. For example, “How many customers in state of CA?” may be answerable from more than one database containing customer and state information. In business settings, the right response may not be a single database. It may be a clarification question: “Do you mean retail customers, loan customers, or all customer records?”
This last point is important. Perfect routing is sometimes the wrong goal. Good routing should know when not to pretend the question is well-specified.
The sharper lesson: enterprise AI needs routing governance
The paper’s contribution fits a broader pattern: enterprise AI does not become reliable by making the model more eloquent. It becomes reliable by adding the missing control layers around the model.
For database access, those layers include:
- candidate retrieval;
- schema-aware grounding;
- structural validation;
- metadata management;
- ambiguity detection;
- audit logging;
- and fallback or clarification policies.
This is less glamorous than “let the AI talk to your data.” It is also more likely to work.
The misconception to retire is that database routing is just semantic search over schema descriptions, with a large model added at the end for refinement. That approach treats routing as a ranking problem. The paper shows why it should be treated as an answerability problem.
Ranking asks: which database looks most relevant?
Answerability asks: which database can support the entities, relationships, and computations implied by the question?
The second question is harder. It is also the one business users thought they were asking.
Conclusion: ask once, but route with proof
The paper quietly dismantles a popular enterprise AI fantasy: that embeddings plus a clever prompt are enough to connect business users to structured data. They are not enough. They retrieve neighborhoods. They produce plausible rankings. They do not reliably prove that a database can answer a question.
The stronger architecture is modular. Use embeddings to find candidates. Use LLMs for localized semantic mapping. Use schema graphs to test connectivity. Use coverage scores to penalize missing concepts. Use semantic tie-breaking only after the structural checks are satisfied.
That is the real lesson: enterprise AI should not merely sound like it understands the question. It should show that the selected database can carry the question all the way to an answer.
Because when a user asks once, the system has one job: query right.
Cognaptus: Automate the Present, Incubate the Future.
-
Saikrishna Sudarshan, Tanay Kulkarni, Manasi Patwardhan, Lovekesh Vig, Ashwin Srinivasan, and Tanmay Tulsidas Verlekar, “Routing End User Queries to Enterprise Databases,” arXiv:2601.19825, 2026. https://arxiv.org/abs/2601.19825 ↩︎