Ask Once, Query Right: Why Enterprise AI Still Gets Databases Wrong

Opening — Why this matters now

Enterprises love to say they are “data‑driven.” In practice, they are database‑fragmented. A single natural‑language question — How many customers in California? — may be answerable by five internal databases, all structurally different, semantically overlapping, and owned by different teams. Routing that question to the right database is no longer a UX problem. It is an architectural one.

Large Language Models promised to solve this. Just embed the question, embed the schemas, rank by similarity, and let the LLM do the rest. The paper “Routing End User Queries to Enterprise Databases” shows — quite calmly and convincingly — why this optimism is misplaced.

Background — Context and prior art

The DB‑routing task was formalized only recently: given a natural‑language query and a repository of enterprise databases, rank which database can actually answer the question. Early benchmarks reused NL‑to‑SQL datasets like Spider and BIRD‑SQL, but with a critical flaw: train/test splits were too clean, too small, and too forgiving.

As a result, prior systems reported surprisingly high accuracy in cross‑domain settings — a red flag. In real enterprises, databases do not politely stay in separate semantic lanes. They overlap, collide, and reuse column names with reckless abandon.

Analysis — What the paper actually does

The authors fix the benchmark first — a decision many papers postpone until reviewers force it.

1. A more realistic benchmark

Instead of separating databases by split, they:

Merge all databases into a single repository
Split queries within each database (50/50 train–test)
Keep the repository identical across in‑domain and cross‑domain settings

This produces two datasets — Spider‑Route and BIRD‑Route — with hundreds of databases competing simultaneously. Routing becomes meaningfully hard again.

2. Why embeddings + LLM prompts fail

Embedding similarity works — until it doesn’t. When several databases live in the same semantic neighborhood (books, customers, products), cosine similarity collapses into tie‑break chaos.

Prompt‑only LLM re‑ranking fares no better. The paper documents systematic failures where LLMs:

Confuse attribute names with attribute values
Prefer lexical overlap over structural feasibility
Select databases that cannot even be joined to answer the query

In short: fluent, confident, wrong.

3. Modular reasoning instead of monolithic prompting

The core contribution is a training‑free, modular re‑ranking framework that decomposes “Can this DB answer the question?” into verifiable sub‑tasks:

Step	What it checks	Why it matters
Schema graph construction	Which tables can join	Prevents impossible SQL
Query‑phrase extraction	What the question really refers to	Avoids token‑level hallucination
Entity coverage	Are all query parts grounded?	Penalizes missing semantics
Connectivity validation	Can mapped entities connect?	Structural correctness
Semantic tie‑breaking	Fine‑grained similarity	Resolves same‑domain confusion

LLMs are used only where they are strong: localized semantic interpretation. Everything structural is handled algorithmically.

Findings — Results that actually mean something

The results are blunt.

Method	R@1 (Spider)	R@1 (BIRD)	mAP (Spider)	mAP (BIRD)
Embeddings only	0.44	0.47	0.53	0.58
LLM re‑ranking	0.68	0.71	0.72	0.79
Modular reasoning (Ours)	0.79	0.80	0.80	0.82

More importantly, error analysis shows why improvements occur:

Most failures happen within the same semantic domain, not across domains
Structural validation eliminates entire classes of false positives
When the correct DB appears in top‑k, modular reasoning recovers it ~98% of the time

This is not prompt tuning. It is system design.

Implications — What this means for enterprise AI

Three uncomfortable lessons emerge:

LLMs are not database routers. They are semantic assistants that need scaffolding.
Enterprise AI failures are structural, not linguistic. You cannot prompt your way out of graph constraints.
Interpretability is a feature, not a luxury. Modular scoring explains why a database was chosen — critical for trust, debugging, and compliance.

For practitioners building AI search, RAG pipelines, or agentic data systems, the takeaway is clear: routing must be reasoned, not merely ranked.

Conclusion — Routing is reasoning

This paper quietly dismantles a popular illusion: that embeddings plus a clever prompt are sufficient for enterprise intelligence. They are not.

By rebuilding the benchmark and rethinking the architecture, the authors show that correctness emerges from coverage, connectivity, and constraint‑aware reasoning — not from ever‑larger language models.

Enterprise AI does not fail because models are weak. It fails because systems are lazy.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually does#

1. A more realistic benchmark#

2. Why embeddings + LLM prompts fail#

3. Modular reasoning instead of monolithic prompting#

Findings — Results that actually mean something#

Implications — What this means for enterprise AI#

Conclusion — Routing is reasoning#