A manager asks the analytics copilot, “Which regions are underperforming this quarter?”

This sounds like a normal business question. It is also, technically, a small swamp.

Which regions? Sales regions, operating regions, logistics regions, or customer billing regions? Underperforming against what: forecast, last quarter, budget, peers, margin, revenue, retention, or some executive’s private sense of disappointment? And “this quarter” may mean calendar quarter, fiscal quarter, quarter-to-date, or the latest complete quarter if the finance team has not closed the books yet.

The conventional AI answer is to call this ambiguity a problem. Better prompt engineering, more schema linking, more clarification, more guardrails. There, fixed. Humans will now write questions like tiny SQL compilers in business casual.

A new paper by Daniel Gomm, Cornelius Wolff, and Madelon Hulsebos argues for a more useful view: ambiguity is often not a bug in the user’s question, but part of the collaboration between user and system.1 Natural-language analytics works because users do not specify everything. They rely on a capable partner to infer conventions, choose reasonable methods, and expose assumptions when the choice matters.

That distinction sounds philosophical. It is not. It changes how enterprise data assistants should be designed, tested, and trusted.

The paper’s central move is to separate three things that current conversations about data agents often blend together: what the user explicitly grounds, what the system may reasonably infer, and what remains too vague to answer responsibly. Once those are separated, a lot of today’s evaluation practice starts looking less like rigorous measurement and more like judging a restaurant by whether it guessed the customer’s childhood nickname correctly. Charming when it works. Not exactly a metric.

The real unit of work is not the query, but the interpretation

The paper begins with a simple but important mechanism: an analytical question becomes executable only when it has an actionable query interpretation. That means the system has pinned down both the analytical procedure and the data scope.

A question such as “What is the average summer temperature in Copenhagen?” is not fully specified. The user has named a place and an approximate concept, but the system still needs to decide what counts as summer, what time period to use, and what “average” means in this context. Mean? Median? Daily highs? Monthly averages? Weather station readings? Climate normals?

A brittle system treats this as a missing-parameter problem. A better system sees a cooperative interaction. The user is not necessarily failing to specify. The user may be delegating.

The paper identifies two kinds of grounding labor:

Grounding mechanism What happens Example Business interpretation
User-provided grounding The user supplies the relevant entity, time frame, method, or constraint directly or contextually. “Mean temperature in June-August in Copenhagen over the last 20 years.” The user controls the analytical definition; the system should execute faithfully.
System-inferred grounding The user leaves something implicit and expects the system to infer it. “Average summer temperature in Copenhagen.” The system must choose reasonable defaults, disclose them, and allow correction.

The second row is where most useful analytics products live. Executives, analysts, sales managers, finance teams, and operations leads do not always know the structure of the data. Often they do not know the exact analysis they need until they see the first answer. This is not laziness. It is the point of analysis.

The paper grounds this in cooperative communication theory: speakers usually give enough information for the listener to understand, but not excessive information. In normal conversation, leaving things out is efficient. If someone asks for the “highest mountain,” we usually infer “in the world” unless the context says otherwise. If someone asks for “the relationship between marketing spend and conversions,” a data system may reasonably choose a correlation or regression approach, but it should not pretend the choice was handed down from Mount SQL.

The practical implication is sharp: the system should not merely parse the query. It should manage the division of labor.

Three query types, three product behaviours

The paper distinguishes among unambiguous queries, ambiguous-but-cooperative queries, and uncooperative queries. These are not just academic labels. They imply different system behaviours.

Query type What makes it different Proper system behaviour Bad system behaviour
Unambiguous The query maps to one actionable interpretation through explicit or conventional grounding. Execute and report the result. Ask unnecessary clarification questions to look careful. Very sophisticated, very annoying.
Ambiguous but cooperative The query leaves choices open, but reasonable interpretations exist. Infer, disclose assumptions, and let the user adjust. Freeze because the prompt was not perfect, or silently choose without disclosure.
Uncooperative The query lacks enough information to identify any valid actionable interpretation. Ask a clarifying question or explain what is missing. Guess confidently and generate a number with the emotional energy of an audit liability.

This is the core of the paper. Ambiguity is not one thing. Some ambiguity is resolvable through convention. Some is productive delegation. Some is irresolvable underspecification.

A good data assistant needs to know which is which.

For business intelligence copilots, that distinction matters more than another percentage point on a benchmark leaderboard. In practice, the failure modes are different. A system that cannot execute an unambiguous query has an engineering problem. A system that cannot handle cooperative ambiguity has a product problem. A system that answers uncooperative queries without clarification has a governance problem.

Those should not be measured as if they were the same defect wearing different hats.

Why benchmark accuracy can become a fog machine

The paper then turns from mechanism to evaluation. This is where the argument becomes uncomfortable for anyone selling “AI analyst” performance as a single score.

If a benchmark gives a natural-language query and expects one gold answer, it assumes that the query has one intended interpretation. That works for unambiguous queries. It becomes messy when the query is ambiguous but cooperative, because several interpretations may be reasonable. It becomes invalid for uncooperative queries, because there may be no defensible answer at all.

The authors audit 15 datasets spanning tabular question answering, text-to-SQL, and data-analysis benchmarks. The list includes datasets such as WikiTable-Questions, TabMWP, CRT-QA, HiTab, OpenWiki-Table, OTT-QA, FeTaQA, TableBench, QTSumm, MMQA, Spider, BIRD, DA-Code, KramaBench, and DA-Eval. The datasets vary widely: some focus on factual retrieval, others on aggregation, trend analysis, correlation, regression, classification, SQL generation, or broader data-analysis tasks.

The audit asks two main questions.

First, are queries data-independent? In an open-domain setting, a user does not know the hidden table schema. So a realistic query should not sound as if it was copied from a column header or crafted around a particular dataset. The paper calls queries that rely on hidden table knowledge data-privileged. Examples include references to structural elements such as first_name, obscure internal values such as a private order ID, or explicit container references such as “the provided dataset.”

Second, are queries unambiguous? A query is unambiguous only if both the data scope and analytical procedure are sufficiently specified. The paper decomposes this into dimensions such as entity specification, temporal bounds, domain bounds, intent specification, and methodological specification.

The evidence is not a model-training result. It is a benchmark audit. The authors sample queries from each dataset and classify them using LLM-based classifiers validated against expert annotations. The validation matters because the task is interpretive: deciding whether a phrase is natural, schema-dependent, conventionally resolvable, or underspecified is not a mechanical string-matching exercise.

The main finding is that many existing benchmarks mix query types without controlling for what is being measured. Figure 2 shows that several datasets, especially complex data-analysis benchmarks, contain substantial shares of data-privileged queries. It also shows that unambiguous queries are often a small fraction of the benchmark queries, particularly in complex tabular-analysis settings such as DA-Eval and DA-Code.

The interpretation is not “all these benchmarks are useless.” That would be satisfyingly dramatic and mostly wrong.

The better interpretation is that a single benchmark score may combine at least four capabilities: retrieving or identifying the right data, interpreting ambiguous intent, choosing reasonable grounding assumptions, and executing the analysis. If the dataset expects one gold answer while the query permits multiple valid interpretations, accuracy becomes a muddied signal. It may reward systems for guessing the benchmark author’s intention rather than serving the user’s actual analytical need.

A leaderboard can still be useful. But only after we know what race is being run.

The appendix is not decorative; it tells us what the audit can and cannot support

The paper’s appendices are worth treating carefully because they define the measurement apparatus.

Appendix A decomposes query specification into procedural and data dimensions. Procedural specification concerns whether the user gives a clear analytical goal and method. Data specification concerns whether the query defines the entities, temporal boundaries, domain boundaries, and analytical structure needed for execution.

Appendix B lists the 15 datasets and their characteristics. This is mostly implementation context: it tells us what kinds of benchmark tasks are being audited and how broad the comparison is.

Appendix C is the important validation section. The authors use LLM judges to classify data-independence and ambiguity. For data-independence, they evaluate structural references, value references, and container references, using self-consistency over multiple classifications. They validate these labels against two expert human annotators. For query ambiguity, they classify specification dimensions and validate against expert-corrected labels.

That means the appendix is not a second thesis. It is the support structure for the audit.

Component Likely purpose What it supports What it does not prove
Figure 1 conceptual framework Mechanism illustration Shows how cooperative, unambiguous, and uncooperative queries relate to actionable interpretations. Does not prove user behaviour empirically.
Figure 2 benchmark audit Main evidence Shows that benchmark datasets mix data-privileged, ambiguous, and unambiguous queries. Does not measure production system ROI or user satisfaction.
Appendix A specification dimensions Operationalisation Defines how ambiguity is decomposed into data and procedural grounding requirements. Does not claim these dimensions exhaust every possible business context.
Appendix C classifier validation Robustness and measurement support Shows the LLM-based classification aligns strongly with expert labels for this audit task. Does not make the labels absolute truth; judgement remains context-sensitive.

This matters because the business interpretation should be calibrated. The paper does not deliver a new enterprise analytics system. It does not show that a specific vendor’s BI copilot will improve retention, reduce analyst workload, or save three hours every Friday. Very inconsiderate of it, but academically respectable.

What it does provide is a framework for diagnosing whether evaluation protocols and product behaviours match the kind of question being asked.

The product lesson is controlled agency, not endless clarification

The obvious product response to ambiguity is to ask more questions.

That is also how one builds an AI assistant that users abandon by Wednesday.

The paper’s framework suggests a more subtle rule: ask only when the ambiguity is uncooperative or when the grounding decision materially affects the result and cannot be safely inferred. Otherwise, infer and disclose.

For example, if a user asks, “Show customer churn trends this year,” a system in a company data warehouse can likely infer the customer entity, the active fiscal year, the churn definition used by the business, and a monthly trend chart. It should still disclose: “Using the standard churn metric from the subscriptions table, grouped monthly, fiscal year-to-date.” That is useful. It creates a correction surface.

If the user asks, “Which accounts are risky?” the system probably needs clarification unless “risk” is a defined business metric in the environment. Risk could mean credit risk, churn risk, fraud risk, compliance risk, implementation risk, or the probability that sales promised something impossible again.

The key is not whether ambiguity exists. It always exists. The key is whether the system has a defensible grounding path.

This creates a practical product design pattern:

  1. Interpret first. Convert the user’s question into an explicit analytical interpretation: data scope, method, assumptions, and output format.
  2. Classify the ambiguity. Decide whether the query is unambiguous, cooperatively underspecified, or uncooperative.
  3. Choose the interaction mode. Execute, infer-and-disclose, or clarify.
  4. Expose grounding choices. Show the assumptions that materially shape the result.
  5. Evaluate by query type. Do not mix execution accuracy, interpretation alignment, and robustness to unanswerable queries into one flattering number.

This is less glamorous than “autonomous data scientist.” It is also closer to how useful enterprise software survives contact with finance, operations, and legal teams.

Business value starts with separating failure modes

For enterprise buyers, the paper’s most useful implication is not that ambiguity is beautiful. Please do not put that on a slide.

The useful implication is that analytics systems should be evaluated by capability layer.

What the paper directly shows Cognaptus business inference Remaining uncertainty
Natural-language tabular queries require grounding of both procedure and data scope. BI copilots should make assumptions explicit instead of treating the answer as self-evident. The best disclosure format will vary by user role and workflow.
Cooperative ambiguity can be a reasonable division of labor between user and system. Products should support infer-and-disclose behaviour, not only clarify-or-fail behaviour. The threshold for “reasonable inference” depends on domain risk and organisational conventions.
Benchmarks often mix unambiguous, cooperative, and uncooperative queries. Procurement teams should be cautious about single-number benchmark claims for data agents. Existing benchmarks may still be useful for narrower tasks if stratified properly.
Some benchmark queries are data-privileged in open-domain settings. Evaluation should distinguish closed-schema assistants from open-domain table-analysis agents. Enterprise deployments often sit between those extremes, with partial schema knowledge and business context.
LLM classifiers can support large-scale query auditing when validated against expert judgement. Companies can audit their own analytics logs to identify ambiguity patterns and product failure modes. Privacy, sampling bias, and domain-specific definitions must be handled carefully.

The procurement version is simple: do not ask only, “What is the benchmark accuracy?” Ask, “On what kind of queries?”

A vendor may perform well on data-privileged questions that resemble table-specific test cases. That can be relevant for closed-schema SQL copilots. It says less about an open-domain analyst that must find the right table, infer the right metric, and decide whether to ask a human before proceeding.

Similarly, a model may look weak on a benchmark because it chose a reasonable interpretation that differs from the gold answer. That is not necessarily failure. It may be evidence that the benchmark is measuring agreement with hidden intent rather than analytical competence.

The annoying answer is the correct one: evaluation must become more stratified.

Why this matters more as analytics gets more complex

For simple lookup questions, users can often be explicit. “What was Q2 revenue for the UK region?” is reasonably direct if the organisation has stable definitions for revenue, quarter, and region.

But as analytics becomes more diagnostic or predictive, full specification becomes harder. “Why did margin decline?” is not a single query. It implies decomposition, segmentation, causal hypotheses, time comparisons, outlier detection, and possibly multiple datasets. “Which customers are likely to churn?” implies a modelling task, feature selection, historical labels, a prediction window, and an action threshold.

The more valuable the analysis, the more likely it is that the user cannot fully specify the procedure upfront.

That is precisely where ambiguity becomes productive. It lets a system take initiative. But initiative without disclosure becomes unaccountable automation. The system must be allowed to choose, but not to smuggle choices into the result as if they were facts.

This is where the paper’s mechanism-first framing pays off. The issue is not whether AI can answer vague questions. It is whether it can distinguish among conventional defaults, discretionary analytical choices, and missing information that blocks a responsible answer.

A useful data agent should sound less like an oracle and more like a competent analyst: “I interpreted underperforming as revenue below forecast, used fiscal quarter-to-date, and ranked regions by percentage variance. I can switch to margin or year-over-year growth.”

That sentence is not glamorous. It is governance in plain clothes.

Boundaries: what the paper does not settle

The paper is conceptual and evaluative, not a deployment study. Its benchmark audit shows measurement misalignment, but it does not prove how much better a cooperative data agent will perform in a live enterprise environment. It also does not prescribe one universal threshold for when a system should infer versus ask.

That threshold will vary. In marketing analytics, choosing a reasonable default chart may be harmless. In credit risk, healthcare operations, tax, compliance, or workforce decisions, the same inferential freedom may be unacceptable unless tightly governed.

The framework also depends on context. A query that is underspecified in an open-domain setting may be perfectly cooperative inside a company’s internal analytics system. “Show churn by segment” is vague to the internet. Inside a SaaS company with a standard churn metric and customer segmentation model, it may be clear enough.

Finally, the classification work relies on expert-validated LLM judges. The validation is important and reasonably strong, but these are still judgements about language, convention, and context. The right conclusion is not that the labels are eternal truth. The right conclusion is that benchmark design needs to acknowledge these distinctions instead of pretending they do not exist.

The future data assistant is an interpreter, not just an executor

The common fantasy of natural-language analytics is that business users will ask questions and AI will return answers. Clean, fast, democratic. Lovely brochure copy.

The paper makes that fantasy more realistic by complicating it.

A data assistant is not just translating words into code. It is negotiating grounding: what data, what method, what scope, what assumption, what default, what clarification. Ambiguity is not merely noise in that process. Sometimes it is the interface through which the user delegates work.

That changes the standard for good systems. They must execute precisely when the question is precise. They must infer responsibly when the question is cooperatively underspecified. They must ask when the question is not answerable. And they must make those modes visible enough that users can correct the machine before its assumptions become organisational folklore.

The business lesson is equally blunt. If you evaluate an AI analyst with a single accuracy score, you may not know whether it understands data, executes code, guesses benchmark intent, exploits schema leakage, or simply benefits from questions that no real user would ask.

Ambiguity is not always the enemy. Unexamined ambiguity is.

That is the difference between a data assistant that helps people think and one that merely produces confident spreadsheets at industrial scale. We already have enough of the latter. They are called quarterly reports.

Cognaptus: Automate the Present, Incubate the Future.


  1. Daniel Gomm, Cornelius Wolff, and Madelon Hulsebos, “Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis,” arXiv:2511.04584. ↩︎