Search Me: Why PIPER Makes Tables Findable When Metadata Goes Missing

A data catalog is supposed to answer a simple question: where is the dataset I need?

In practice, it often answers a different question: which dataset owner bothered to write a decent title, description, and tag list?

That distinction matters. A table may contain exactly the columns, ranges, patient attributes, locations, dates, or transaction variables a team needs, while its metadata says something thrilling like “export_final_v3.csv.” The dataset is technically present. It is not findable. This is a familiar enterprise tragedy: the data lake is full, the catalog exists, and still everyone asks the same analyst where the useful files are. Excellent digital transformation, naturally.

The paper behind PIPER addresses this specific failure mode: tabular dataset search when metadata is incomplete, unreliable, or too shallow to represent what a table actually contains.1 Its main move is not simply “use an LLM to summarize tables.” That would be the obvious version, and the obvious version is usually where the costs hide. PIPER instead builds a content-based retrieval pipeline around statistical table profiles, LLM-generated pseudoqueries, dense retrieval, query optimization, and listwise reranking.

The important business lesson is comparative, not promotional. PIPER does not show that metadata is dead. It shows when metadata-first search is fragile, why content-derived search can help, and why the likely production answer is hybrid retrieval rather than another heroic “one model to search them all” story.

A weak reading of the paper would say: traditional search uses metadata; PIPER uses LLMs; therefore LLMs are better. This is tidy, quotable, and not quite what the evidence says.

The more useful comparison has three columns:

Search approach What it indexes When it works When it fails
Metadata-first catalog search Titles, descriptions, tags, publisher metadata, sometimes schema labels Metadata is accurate, specific, and aligned with user language Metadata is missing, vague, stale, or written for governance rather than discovery
Content-derived semantic search Signals extracted from the table itself, then represented in retrieval-friendly language Table content carries the real meaning and metadata is weak The table profile loses crucial row-level context or metadata already gives stronger cues
Hybrid search Metadata plus content profiles plus generated access paths Real enterprise collections where some datasets are well documented and many are not Requires orchestration, evaluation, cost control, and governance around generated representations

PIPER sits mostly in the second column. Its design deliberately reduces dependence on metadata. Each candidate dataset is treated as a single table. The system builds searchable representations from the table content itself, not from a publisher’s description.

That choice is not merely technical taste. It reflects a common condition in data lakes, open-data portals, and cross-organizational data spaces: the data often exists before the documentation culture matures. The catalog becomes a thin wrapper around poorly described objects. Search quality then depends less on retrieval algorithms and more on whether someone wrote good descriptions at ingestion time. Spoiler: many did not.

PIPER asks whether retrieval can expose what the table can support, even when the catalog text does not.

PIPER indexes likely user intentions, not just table summaries

The paper’s most important mechanism is the offline transformation from a raw table into multiple pseudoqueries.

First, PIPER creates a statistical profile for each table. The profile is computed over the full table, not just a small sample. For each column, it records attributes such as datatype, number of unique values, missing-value information, value coverage, and type-specific statistics. For numerical columns, this can include minimum, maximum, mean, and median.

This is a modest but important design choice. Passing a full real-world table into an LLM is often impractical. Passing only a few rows may make the model hallucinate the table’s meaning from an unrepresentative slice. A statistical profile is a compromise: compressed enough for LLM processing, but still grounded in the whole table.

Then comes the more interesting part. Instead of indexing the profile as one dense vector, PIPER asks an LLM to generate a fixed set of synthetic natural-language pseudoqueries from the profile. These pseudoqueries are meant to resemble what a user might type when searching for such a dataset. A diabetes-related table, for example, may yield a pseudoquery such as searching for a diabetes dataset with patient attributes like age and BMI.

This changes the retrieval object. The dataset is no longer represented only as a table, a title, or a blob of metadata. It is represented as several possible access paths into the dataset.

That matters because users rarely search in the language of schemas. They do not necessarily know the column names. They ask for “hospital readmission data with treatment outcomes,” “regional housing affordability indicators,” or “customer transactions with product category and timestamp.” If the dataset only exposes region_id, income_band, and sale_dt, the retrieval system must bridge a language gap. PIPER tries to build that bridge before the user ever arrives.

The offline phase is therefore not a decorative summarization step. It is an indexing strategy: turn table profiles into multiple user-style retrieval hooks, embed each hook, and store the embeddings in a vector database together with dataset identifiers and profiles.

The online pipeline turns messy questions into retrievable subqueries

At query time, PIPER adds another layer: query optimization.

A user query may be short, ambiguous, multi-faceted, or written in vocabulary that does not match the generated pseudoqueries. PIPER uses an LLM in two steps. First, it creates a short internal background document containing terminology and contextual concepts related to the query. Second, it decomposes the original query into optimized subqueries designed for retrieval.

Only these optimized subqueries are used for the initial search. The original user query is preserved for final reranking.

That division is sensible. Retrieval wants explicit, decomposed, recall-friendly language. Final ranking wants alignment with the original intent. If the optimized subqueries drift too far, the reranker can still evaluate candidates against the initial request.

The candidate scoring is intentionally simple. For each optimized subquery, PIPER retrieves the top similar pseudoqueries. Since every pseudoquery belongs to a dataset, the system aggregates matches at dataset level. A dataset is promoted when its pseudoqueries appear repeatedly across retrieved results. In plain terms: if several user-intent fragments point toward the same table, that table deserves attention.

Finally, a listwise LLM reranker evaluates the candidate datasets jointly. It receives the original query, the dataset profile, and retrieval score, then produces a refined ranking. This reranking stage is where the system can use richer profile evidence than the vector similarity step alone can capture.

The architecture is easy to summarize, but the business implication is more precise: PIPER separates ingestion-time enrichment from query-time interpretation. That is a useful pattern for enterprise data systems. Expensive or slow LLM work can be moved partly into an offline indexing workflow, while online retrieval focuses on matching and ranking.

The TARGET benchmark shows the boundary: metadata can still win

The paper evaluates PIPER in two stages. The first uses TARGET, a controlled table retrieval benchmark that includes FetaQA and OTT-QA. These are originally table question-answering-style settings adapted for retrieval. They are useful tests, but they are not perfect replicas of messy enterprise dataset search.

The results are mixed in the correct way.

On FetaQA, PIPER performs strongly. Its Recall@10 is 0.784, higher than dense table embedding at 0.741 and much higher than the QGpT variant reported at 0.586. This supports the paper’s claim that profile-derived pseudoqueries can expose table meaning effectively when metadata is not the main advantage.

On OTT-QA, the picture changes. PIPER reaches 0.729 Recall@10, while BM25 with table titles reaches 0.967, TF-IDF with table titles reaches 0.963, dense table embedding reaches 0.963, and QGpT reaches 0.915. Even PIPER without query optimization performs better than full PIPER on OTT-QA, at 0.780.

That is not a fatal flaw. It is the point.

OTT-QA appears to reward title and metadata cues more strongly. When titles and metadata provide direct lexical signals, metadata-aware methods can dominate. In that environment, PIPER’s content-first stance becomes less advantageous. It may even leave useful title-level clues unused.

Benchmark result Likely purpose in the paper What it supports What it does not prove
Strong FetaQA performance Main evidence under a controlled retrieval benchmark PIPER can outperform several baselines when content-derived representation matters Universal superiority across table retrieval tasks
Weak relative OTT-QA performance Boundary evidence, not an embarrassment to hide in a cupboard Metadata and titles remain powerful when they are informative That content-based retrieval is unnecessary in metadata-poor collections
Query optimization ablation on TARGET Ablation Query expansion is not automatically beneficial when benchmark queries are already aligned That query optimization should be removed from realistic search systems

The business reading is straightforward. If your catalog already has accurate titles, strong descriptions, stable tags, and users search in language close to those descriptions, do not rip out metadata search because a paper has the letters “LLM” in it. That would be expensive theater.

But if your catalog contains thousands of weakly described tables, then metadata-first retrieval is solving the wrong proxy problem. It finds good descriptions. It does not necessarily find good datasets.

NTCIR-15 is where PIPER looks more like a data-catalog tool

The second evaluation is closer to dataset search. The authors use a tabular subset of NTCIR-15 Data Search, restricted to 111 tabular datasets and 10 queries where most relevant results are tabular. This is much smaller than a real enterprise data lake, but it is more heterogeneous than the TARGET tasks. The tables are also much larger on average: the NTCIR subset has an average of 73.7K rows and 25.5 columns, compared with small tables in FetaQA and OTT-QA.

Here PIPER performs much more convincingly.

For complex natural-language queries, PIPER achieves MAP 0.560, Precision@10 0.480, Recall@10 0.647, and nDCG@10 0.676. The closest strong baseline in the table, Dense-BGE, reports MAP 0.364, Precision@10 0.360, Recall@10 0.468, and nDCG@10 0.510.

For keyword queries, PIPER also leads: MAP 0.483, Precision@10 0.430, Recall@10 0.563, and nDCG@10 0.578. Dense-BGE again trails at MAP 0.342, Precision@10 0.360, Recall@10 0.480, and nDCG@10 0.487.

NTCIR-15 setting PIPER nDCG@10 Best listed baseline nDCG@10 Interpretation
Complex natural-language queries 0.676 0.510, Dense-BGE PIPER’s query optimization plus profile-based pseudoquery retrieval helps when user intent is freely expressed
Keyword queries 0.578 0.487, Dense-BGE PIPER still helps, but the advantage is smaller than in the complex NL setting
Without query optimization, complex NL 0.233 Query optimization is a major component in this setting
Without query optimization, keyword 0.327 Removing query optimization substantially weakens the full system

This is the result that should interest data-platform teams. Natural-language dataset search is not the same as matching a question to a small table. The user may ask for an analytical object, not a known file. They may describe what they want to study rather than what the table is called. In such conditions, PIPER’s generated pseudoqueries become useful because they precompute possible user-facing interpretations of each table.

The authors also compute 95% bootstrap confidence intervals for NTCIR-15 nDCG@10. This is best read as a robustness or uncertainty check around the ranking results, not as a separate thesis. The intervals matter because the NTCIR subset is small: 111 datasets and 10 queries. The figure supports the reported advantage direction, but it should not be stretched into a broad claim about all data lakes, all domains, or all enterprise retrieval tasks. Ten queries can teach a lesson. They cannot certify a product category.

Query optimization is powerful only when the query needs help

One of the better parts of the paper is that its ablation does not pretend every module is magic.

On TARGET, removing query optimization has almost no effect on FetaQA: PIPER moves from 0.784 Recall@10 to 0.783. On OTT-QA, removing query optimization improves Recall@10 from 0.729 to 0.780. The authors interpret this as possible query drift: if a benchmark query is already tightly aligned with the target table, extra reformulation can introduce noise.

On NTCIR-15, the story reverses. Removing query optimization sharply reduces performance. For complex natural-language queries, nDCG@10 falls from 0.676 to 0.233. For keyword queries, it falls from 0.578 to 0.327.

This is not contradiction. It is segmentation.

Query optimization is valuable when the user’s request is under-specified, multi-dimensional, or linguistically distant from the indexed content. It is less useful when the query already says exactly what the retrieval system needs. In practical terms, query optimization should probably be conditional. A production system could decide whether to expand, decompose, or preserve a query based on query length, ambiguity, domain specificity, and prior retrieval confidence.

Blind expansion is not intelligence. Sometimes it is just a very expensive way to wander.

The operational value is better data reuse, not prettier search boxes

For businesses, the immediate temptation is to imagine PIPER as a better search bar for a data catalog. That is correct but incomplete.

The deeper value is reuse. Data assets produce returns only when teams can discover, evaluate, and apply them. A table that cannot be found becomes dead inventory. A table that can be found only through one employee’s memory becomes operational dependency disguised as expertise.

PIPER suggests a practical workflow for improving dataset findability:

Technical step Operational consequence ROI relevance
Full-table profiling Captures content signals even when catalog descriptions are poor Reduces dependence on manual metadata cleanup before search improves
Pseudoquery generation Creates multiple natural-language access paths into the same dataset Helps non-expert users find datasets without knowing schema names
Dense retrieval over pseudoqueries Matches user intent to content-derived dataset facets Improves recall for heterogeneous queries
Query optimization Converts vague or complex requests into retrieval-friendly subqueries Supports realistic search behavior, especially for exploratory analytics
Listwise reranking Uses profiles to compare candidate datasets jointly Improves final ranking quality when candidate sets contain near-matches

The enterprise use case is not hard to imagine. A procurement team wants supplier-level delivery delay data across regions. A finance team searches for datasets linking payment terms, invoice aging, and customer segment. A healthcare analytics group looks for patient datasets with longitudinal treatment outcomes. In each case, the useful table may not be titled in the exact language of the request. The table content may know more than the catalog entry says.

PIPER’s approach is also relevant to cross-organizational data spaces, where metadata standards are uneven. In a single company, governance teams can at least try to enforce catalog discipline. Across partners, agencies, subsidiaries, or public portals, that fantasy becomes charmingly optimistic. Content-derived search becomes more attractive when metadata quality cannot be centrally controlled.

A production system should be hybrid by design

The paper is careful about this, and the point deserves emphasis: PIPER should not be treated as a universal replacement for metadata-driven retrieval.

A production architecture should likely combine:

  1. Metadata search, because good titles, descriptions, tags, owners, lineage, and business glossary terms are still valuable.
  2. Schema-aware retrieval, because column names and types often carry direct meaning.
  3. Content profiling, because values, distributions, missingness, ranges, and coverage reveal what metadata omits.
  4. Generated pseudoqueries, because users search in natural language and often describe analytical needs rather than table structures.
  5. Reranking and filtering, because retrieval scores alone do not answer governance, freshness, permission, or fitness-for-use questions.

The decision is not whether to choose metadata or content. The decision is when each signal deserves weight.

For example, a curated finance mart with strong documentation may benefit more from metadata and lineage-aware search. A raw operational data lake with inconsistent file naming may benefit more from profiling and pseudoqueries. An open data portal may need both, because some publishers document datasets properly and others apparently believe “miscellaneous” is a data strategy.

The most promising direction is adaptive retrieval. If metadata is rich and matches the query, use it. If metadata is thin or query results are weak, lean more heavily on content-derived profiles. If the query is broad, decompose it. If the query is precise, avoid expansion. This is less glamorous than a single benchmark-winning model, but much closer to how retrieval systems survive contact with reality.

The limitations are small-sample evidence, content compression, and cost

PIPER’s limitations are not generic “more work is needed” fog. They affect how the result should be used.

First, the NTCIR-15 evaluation is small: 111 tabular datasets and 10 queries. The results are encouraging, especially because the benchmark is more dataset-search-like than TARGET, but they are not enough to estimate performance across a large enterprise lake with thousands or millions of tables, many domains, access controls, multilingual metadata, and messy file formats.

Second, profiling compresses the table. That is the point, but it is also the risk. Statistical profiles capture datatypes, coverage, missingness, and summary statistics. They may miss row-level patterns, rare but important categories, temporal structures, relational dependencies, or domain semantics that require deeper inspection. A profile can tell the LLM useful things. It cannot make the table fully present.

Third, pseudoqueries are generated representations. They can be useful, but they need quality controls. If pseudoqueries overstate what a dataset supports, retrieval may become confidently misleading. In regulated or high-stakes domains, generated access paths should be auditable, reproducible, and tied back to concrete profile evidence.

Fourth, the cost structure needs evaluation. PIPER moves much of the LLM work offline, which is sensible, but large-scale profiling and pseudoquery generation across a constantly changing data lake is not free. Updates, refresh schedules, embedding storage, reranking latency, and permission-aware retrieval all matter in production.

Finally, PIPER currently focuses on tabular datasets treated as single tables. Real enterprise datasets may be multi-table relational structures, nested files, document-table hybrids, event streams, or semantic-layer models. Extending the approach beyond single-table retrieval is plausible, but not shown by this paper.

What Cognaptus would take from PIPER

The practical lesson is not “install PIPER and your catalog is fixed.” The better lesson is that dataset search should expose what the data can answer, not merely what someone wrote about it.

That implies a different implementation philosophy for business data platforms:

Business question Metadata-first answer PIPER-style extension
How do users find datasets? Search titles, descriptions, and tags Search generated user-intent pseudoqueries derived from table content
How is weak documentation handled? Ask owners to improve metadata Use content profiles as a second evidence layer while metadata improves
How are natural-language requests supported? Embed the query and match catalog text Expand and decompose the query, then match against pseudoqueries
How is relevance judged? Keyword score, embedding similarity, or filters Candidate aggregation plus profile-aware reranking
What is the safest production strategy? Standardize metadata and hope Combine metadata, schema, profiles, pseudoqueries, and governance filters

For Cognaptus-style automation work, this points to a concrete opportunity: build data-catalog search pipelines that treat metadata quality as a variable, not an assumption. Many AI systems fail not because the model cannot reason, but because the relevant data asset is invisible to the workflow. Better retrieval is therefore not a nice interface feature. It is infrastructure for reuse.

The strongest use cases are not glamorous. Internal analytics teams looking for reusable datasets. Data product marketplaces. Open-data portals. Cross-subsidiary data spaces. Compliance-aware data discovery. AI agents that need to locate valid structured data before answering business questions. In all of these settings, findability determines whether automation has raw material to work with.

PIPER contributes a useful pattern: profile first, generate multiple user-facing search intents, retrieve semantically, then rerank with the original need in view. It is not the whole data-catalog future. It is a serious component of it.

Conclusion: metadata is still useful, but it should not be the only witness

PIPER’s best contribution is not that it uses an LLM. Plenty of systems now do, with varying degrees of ceremony. Its useful contribution is the way it reframes tabular dataset search around content-derived access paths.

Metadata-first search asks: does the catalog description match the query?

PIPER asks: does the table’s content suggest that this dataset could satisfy the user’s information need?

Those are different questions. The second one is harder, but it is often closer to what users actually want.

The evidence supports a conditional conclusion. When metadata and titles are strong, metadata-aware methods can still win, as the OTT-QA results show. When metadata is weak, queries are natural, and table content carries the real semantics, PIPER’s profile-and-pseudoquery design becomes much more attractive, especially in the NTCIR-15 dataset-search setting.

The likely future is not metadata versus content. It is retrieval systems that know when metadata is enough, when content must speak for itself, and when both should be cross-examined before the user wastes another afternoon looking for a table that was there all along.

Cognaptus: Automate the Present, Incubate the Future.


  1. Riccardo Terrenzi, Matteo Falconi, Serkan Ayvaz, and Pierluigi Plebani, “PIPER: Content-Based Table Search via profiling and LLM-Generated Pseudoqueries,” arXiv:2605.18199, 2026. https://arxiv.org/abs/2605.18199 ↩︎