CQ or Consequences: What This LLM Benchmark Reveals About AI Requirements Work

Requirements work has a reputation problem.

It is rarely the part of an AI project that receives the keynote slide, the demo video, or the executive applause. Nobody opens a budget meeting by saying, “What we really need is a better way to ask the system what it must know.” They should, but apparently civilization still has limits.

In ontology engineering, that question has a formal name: the competency question, or CQ. A CQ is a natural-language question that an ontology should be able to answer. For a knowledge graph, compliance engine, product recommender, clinical decision-support layer, or enterprise AI memory system, CQs translate vague intent into testable scope. “Which national parks match this traveler’s weather and crowd preferences?” is not just a question. It is a boundary around the concepts, relationships, and data the system must represent.

A recent paper, Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models, studies how different large language models generate these questions across multiple domains.¹ The paper introduces CompCQ, a framework for comparing LLM-generated CQs by readability, complexity, relevance, diversity, and semantic overlap.

The business temptation is obvious: if LLMs can generate requirement questions at scale, perhaps firms can automate a painful upstream bottleneck in knowledge engineering. The paper’s answer is more useful and less convenient: yes, LLMs can help, but the “best model” is not the one with the prettiest average score. Different models expose different parts of the requirement space. Some are clear but narrow. Some are diverse but messy. Some are cheap but suspiciously quiet. A requirements pipeline built around one model may look efficient while quietly omitting the questions that matter.

That is not automation. That is a blind spot with a progress bar.

The paper is not asking whether LLMs can generate CQs; it asks what kind of CQs they generate

A weaker article could stop at the familiar claim: LLMs can accelerate requirements work. True, but not interesting enough. The paper moves the question from feasibility to characterisation.

The authors compare five LLMs:

Access type	Model	Role in the comparison
Closed	Gemini 2.5 Pro	Strong proprietary baseline; concise and readable profile
Closed	GPT-4.1	Stable proprietary baseline; generally balanced output
Open / open-weight access	KimiK2-1T	Large open model; often more complex and exploratory
Open	Llama 3.1 8B	Smaller local-style option; lower CQ volume in several cases
Open	Llama 3.2 3B	Smaller model; occasionally diverse but unstable

They test these models across five ontology requirement settings:

Requirement setting	Domain type	Why it matters for interpretation
Music Meta	Broad metadata / integration user story	Leaves room for models to explore different facets
British Music Experience (BME)	Cultural heritage user story	Narrative requirements with stakeholder goals
When To Go Where (WTGW)	Tourism recommendation use case	Narrower, more constrained recommendation task
Political Journalism Ontology (PJO)	Media analysis use case	Complex, socially loaded, concept-rich domain
Personalized Depression Treatment Ontology (PDTO)	Healthcare use case	Technical, relation-heavy, clinically complex domain

The prompting design is intentionally plain. The models receive the requirement specification and a minimal instruction to generate competency questions. No examples. No elaborate prompt engineering. No comforting theatrical ritual where the model is first asked to “think like a world-class ontology engineer.”

That design matters. The study is not benchmarking how clever a prompt engineer can be. It is asking how models behave under a neutral, reproducible generation setup.

CompCQ turns “good requirement questions” into measurable trade-offs

CompCQ evaluates CQs at two levels.

At the question level, it looks at readability, complexity, and relevance. Readability is measured with Flesch-Kincaid Grade Level and Dale-Chall readability. Complexity is split into requirement complexity, linguistic complexity, syntactic complexity, and length. Relevance is scored against the input requirements using a four-point scale, with Gemini 2.5 Pro used as the relevance judge and a small subset manually checked during prompt engineering.

At the set level, it looks at diversity and overlap. The authors embed CQ sets using Sentence-BERT, then compare internal diversity and inter-model semantic coverage. The important idea is simple: two models can both generate relevant questions while still covering different concepts. In requirements work, difference is not noise. Difference can be where the missing scope lives.

CompCQ dimension	What it captures	Business translation	Boundary
Readability	Whether stakeholders can easily understand the question	Lower review friction; easier workshops	Readability formulas were designed for prose, not short interrogative questions
Requirement complexity	Concepts, properties, relationships, filters, cardinality, aggregation	Likely ontology and query implementation burden	Extracted using an LLM, so it is a proxy rather than a formal proof
Linguistic / syntactic complexity	Wording, noun phrases, verbs, dependencies, nesting	Risk of ambiguity and review fatigue	Complex phrasing is not always bad; some domains are genuinely complex
Relevance	Alignment with explicit or necessary requirements	Lower hallucination risk	LLM-assisted relevance judgment needs human governance in production
Internal diversity	Breadth within one model’s CQ set	Helps reveal whether a model explores enough of the domain	Diversity can include useful novelty or distracting sprawl
Inter-model overlap	Whether model sets cover the same semantic ground	Helps decide whether multiple models add coverage	Low overlap does not automatically mean higher quality

This framework is the paper’s first contribution. Its value is not that each metric is perfect. The authors are careful about that. Readability scores are comparative indicators, not absolute truth. Embedding-based semantic overlap depends on representation choices. LLM relevance scoring is scalable, but not the same as expert validation.

Still, the framework gives teams a vocabulary for a problem that is usually handled by taste: “This model feels better.” Taste is a poor control system. It does not survive procurement.

Comparison 1: domain structure often matters more than model branding

The cleanest result is not “Gemini wins” or “open models lose” or “GPT is safest.” The cleaner result is that domain structure strongly shapes model behavior.

The Personalized Depression Treatment Ontology is the paper’s stress test in plain sight. It involves patient demographics, genetic information, treatment options, clinical trial outcomes, and relationships among them. Across the models, this domain tends to produce more complex and less readable CQs. The paper reports that Llama 3.2-3B reaches an FKGL of 18.21 on PDTO, a very high reading level in this comparative setup. KimiK2 also shows high requirement complexity in this domain, reflecting questions that demand more concepts, properties, and relationships.

That result should feel familiar to anyone who has watched AI vendors demo on toy workflows. A travel recommender and a clinical ontology are not the same problem wearing different labels. Domain complexity is not washed away by model size. The model may generate fluent text, but the underlying semantic burden remains.

The When To Go Where use case behaves differently. It is a clearer recommendation task: suggest national parks based on user preferences such as weather, crowds, and location. In this constrained setting, models tend to converge on similar core requirements. The paper reports high centroid similarities among several model pairs, including Gemini-GPT at 0.91 and Gemini-KimiK2 at 0.88. Gemini and GPT also reach 56.2% bidirectional coverage, meaning a substantial share of one model’s semantic content is represented by the other.

The lesson is not that constrained domains are easy. It is that they create agreement. When the requirement surface is narrow, models are pulled toward the same obvious questions. This is useful for consistency, but not necessarily for discovery.

By contrast, Music Meta and BME are broad user-story settings. There, models may be thematically aligned but still produce almost no semantic overlap across many pairings. The paper reports near-zero bidirectional coverage between many model pairs in these broad domains. In business terms: the models are all looking at the same room, but each one notices different furniture.

That is the part executives should not miss. In broad or early-stage requirements work, single-model generation can create the illusion of completeness. The output is long, relevant, and grammatically convincing. It is also partial.

Comparison 2: readability and richness are not the same objective

A common mistake is to treat relevance as the scoreboard. If the generated questions are relevant, the model has done the job. The paper makes that assumption uncomfortable.

Across the experiments, mean relevance scores are generally high, often above 3 on the four-point scale. Closed models, especially Gemini and GPT, show the most consistent relevance. That is good news. It suggests the models are usually not inventing wildly unrelated requirements.

But relevance does not equal coverage. It also does not equal usability.

Gemini’s profile is the clearest example. The paper finds Gemini often produces concise, simple, readable CQs. In Music Meta, BME, and When To Go Where, Gemini registers low scores in length, requirement complexity, lexical complexity, and syntactic complexity, with strong readability. For workshops involving business stakeholders, this matters. A requirement question that nobody can comfortably discuss is not a requirement artifact. It is a small monument to technical self-harm.

GPT sits closer to the middle: generally more complex than Gemini, less verbose than KimiK2 or Llama 3.2-3B, and stable enough for structured enterprise workflows.

KimiK2 is more interesting because its weakness is adjacent to its strength. It often generates more complex and verbose questions, with higher requirement complexity in several domains. That can raise review cost. It can also surface richer candidate primitives and relationships. In early ontology design, especially when the team is still discovering the domain boundary, that extra complexity may be useful. Not because the output should be accepted as final, but because it expands the search space.

The smaller Llama models are more fragile in the paper’s results. Llama 3.1-8B and Llama 3.2-3B often generate fewer CQs. In several domains, the Llama models produce only 8 to 16 questions where Gemini, GPT, and KimiK2 produce larger sets. Low output volume matters because CQ generation is partly a coverage task. A short list may be elegant. It may also be incomplete.

Model profile	What the paper directly shows	Practical use	Practical risk
Gemini 2.5 Pro	Often concise, readable, consistently relevant	First-pass core requirements; stakeholder review	May under-explore broad requirement space
GPT-4.1	Balanced, stable, often clustered around core topics	Structured enterprise workflows; baseline comparison	Low diversity can miss peripheral concepts
KimiK2-1T	Often more complex; can add diversity	Discovery workshops; exploratory generation	Verbosity and complexity increase review burden
Llama 3.1 8B	Smaller model; lower CQ counts in several cases	Low-cost experimentation, internal tooling	Coverage risk
Llama 3.2 3B	Sometimes highly diverse, sometimes outlier behavior	Controlled prototyping only	Erratic semantic alignment and low reliability

The business decision is therefore not “which model is best?” It is “which failure mode can this stage tolerate?”

A discovery stage may tolerate verbosity if it reveals missing concepts. A compliance validation stage may prefer conservative clarity. A stakeholder workshop may need readability more than novelty. A test-generation pipeline may need relevance and coverage, with human review before formalization.

Anyone selling one model as a universal requirements analyst is not selling a method. They are selling a shortcut. The market enjoys those. Reality invoices later.

Comparison 3: diversity is useful only when it expands coverage, not when it creates decorative variation

The set-level analysis is where the paper becomes most relevant for enterprise AI design.

Internal semantic diversity measures whether a model’s own CQ set spreads across different concepts or clusters tightly around a few themes. The authors use average pairwise cosine similarity, average centroid distance, and Shannon entropy. Lower average cosine similarity and higher centroid distance or entropy generally indicate greater diversity.

The findings are not neat enough for a leaderboard, which is precisely why they are useful. No model is always the most diverse. Llama 3.2-3B is described as “boom-or-bust”: least diverse on Music Meta, but highly diverse on When To Go Where and Political Journalism. GPT tends to produce lower-diversity sets, often clustered around core topics, though its entropy can still be high, meaning its questions may distribute across subtopics even if semantically close. KimiK2 often matches or exceeds closed models on diversity while still producing a reasonable number of CQs.

This distinction matters because diversity has two faces.

Useful diversity reveals missing areas of the requirement space. Decorative diversity paraphrases the same issue in several ways. Dangerous diversity wanders into irrelevant or ungovernable requirements. CompCQ does not magically solve that distinction, but it helps teams see when outputs are too clustered or too isolated.

The pairwise overlap results sharpen the point. In constrained WTGW, several models overlap substantially. In broad Music Meta and BME, many pairings have near-zero bidirectional coverage. In complex technical domains such as PDTO and Political Journalism, the closed models overlap more with each other than with many open-model pairings. For example, Gemini and GPT reach 23.8% bidirectional coverage in PDTO and 17.6% in Political Journalism, while many other pairings involving KimiK2 or Llama 3.2-3B show 0% overlap.

That pattern suggests a practical design principle:

Constrained task       → expect convergence; use overlap to validate consensus.
Broad user story       → expect novelty; use multiple models to reveal missing facets.
Complex technical case → expect partial convergence; use stable models for core scope and exploratory models for edge discovery.

This is the most important business implication in the paper. The value of multiple models is not ideological diversity. It is coverage engineering.

The experiment types are doing different jobs

The paper’s evidence is easiest to misread if every table is treated as the same kind of proof. They are not.

Paper component	Likely purpose	What it supports	What it does not prove
Neutral CQ generation across five models	Main evidence	Model profiles under comparable prompting	Best possible performance after prompt optimization
Five-domain comparison	Main evidence	Domain-dependent behavior	Universal behavior across all enterprise domains
Readability and complexity metrics	Main evidence / diagnostic framework	Comparative cognitive and implementation burden	Absolute quality of each CQ
LLM-based relevance scoring	Scalable evaluation method	Approximate alignment with source requirements	Expert-validated correctness in production
Sentence-BERT diversity and overlap	Main evidence for set-level behavior	Novelty, convergence, and semantic coverage patterns	Whether novel questions are all useful or correct
Data-leakage checks for source materials	Implementation safeguard	Reduces concern that models reproduced known CQs	Complete proof of no memorization

The paper is strongest when it compares observable output properties. It is more tentative when moving from those properties to final ontology quality. That is not a flaw; it is the proper boundary of the study.

A CQ set with low readability may still contain valuable expert-level questions. A highly diverse set may include distracting lines of inquiry. A highly relevant set may still omit important edge cases. The paper does not claim to solve ontology evaluation end-to-end. It gives teams a way to inspect the generation stage before downstream modeling hardens soft omissions into expensive architecture.

What this means for business teams building knowledge systems

For firms building knowledge graphs, compliance ontologies, data catalogs, AI memory layers, or domain-specific reasoning systems, the paper points toward a staged workflow.

Start with a clarity-first model. Gemini or GPT-like systems are useful for generating a readable core set of CQs that stakeholders can actually discuss. This creates the shared baseline: the obvious requirements, the core entities, the main relationships, and the questions everyone expects the system to answer.

Then add a diversity-first model. KimiK2-like behavior may be valuable here, not because every output is clean, but because messy breadth can reveal concepts the baseline missed. In early requirements work, controlled messiness is not a bug. It is a search strategy.

Next, run overlap and novelty review. Questions covered by multiple models may indicate consensus. Questions produced by only one model may indicate novelty, misunderstanding, or hidden domain scope. The human analyst’s job is to classify those cases, not merely proofread grammar.

Finally, convert accepted CQs into tests, ontology design tasks, and governance checks. A CQ is only useful if it becomes operational: a data requirement, a SPARQL query, a validation test, a workflow rule, or a review criterion.

A practical pipeline could look like this:

Stage	Main question	Recommended model posture	Human role
Baseline generation	What must the system obviously answer?	Stable, readable model	Confirm core scope
Expansion	What might the baseline miss?	Diverse / exploratory model	Separate useful novelty from noise
Coverage comparison	Which questions overlap or diverge?	Embedding-based or analyst-assisted comparison	Identify omissions and conflicts
Formalization	Which CQs become design constraints?	Tool-assisted conversion	Approve ontology/query/test implications
Governance	What is still uncertain?	Periodic re-generation and comparison	Maintain audit trail and update scope

This is where the paper becomes relevant beyond ontology engineering. Many enterprise AI projects now depend on hidden ontologies whether teams call them that or not. A customer-support agent has an implicit ontology of products, policies, exceptions, and escalation paths. A compliance assistant has an implicit ontology of jurisdictions, obligations, evidence, controls, and violations. A data catalog has an implicit ontology of assets, owners, lineage, definitions, and permissions.

If those systems are built from incomplete requirement questions, they inherit incomplete worlds.

The expensive error is not a bad question; it is the question never asked

A bad generated CQ is visible. Someone can mark it irrelevant, rewrite it, or delete it.

A missing CQ is quieter. It leaves no red flag in the document. It becomes visible later as a failed query, an unsupported edge case, a compliance exception, or a user asking, “Why can’t the system answer this?”

The paper’s low-overlap results are therefore more than an academic curiosity. In broad domains, different models can generate relevant but distinct CQ sets. If a team uses only one model, it may never see the alternative questions. The output will still look professional. That is the danger.

This is a general pattern in AI automation. Generated artifacts often look complete because they are fluent. Requirements artifacts are especially vulnerable because completeness is not visually obvious. A polished list of twenty questions can still omit the one concept that later breaks the system.

That is why the business value is not simply cheaper CQ generation. It is cheaper diagnosis of missing scope.

Boundaries: what CompCQ can and cannot promise

CompCQ is a benchmarking and diagnostic framework, not an oracle.

First, several metrics are proxies. Readability formulas were designed for longer prose, so the paper treats them comparatively. That is reasonable, but teams should not interpret an FKGL score as a literal user research result.

Second, relevance is LLM-assisted. Using Gemini 2.5 Pro as a judge makes the study scalable, but production governance should not replace expert validation with another model and call it independence. That would be very 2026, and not in a good way.

Third, embedding-based overlap captures semantic similarity, not business importance. A unique CQ may be critical, irrelevant, or merely oddly phrased. Novelty requires interpretation.

Fourth, the study uses neutral zero-shot prompting. This helps comparability, but it does not show the best achievable performance of each model under carefully engineered prompts, retrieval context, or domain-specific instructions. In a real Cognaptus-style pipeline, prompt design, domain documents, review rubrics, and feedback loops would matter.

Finally, the paper evaluates generated CQs, not the final ontologies built from them. The downstream question remains open: which combinations of model-generated and human-refined CQs lead to better ontology coverage, lower implementation cost, fewer defects, or higher ROI?

These boundaries do not weaken the paper’s practical value. They locate it. The study is most useful at the point where teams must decide how to generate, compare, and review candidate requirements before committing them to architecture.

A better procurement question: not “which model,” but “which model mix?”

The lazy procurement question is: which LLM should we use for requirements generation?

The better question is: what model mix gives us readable core requirements, useful novelty, measurable coverage, and controlled review cost?

That shift changes the buying conversation. A vendor claiming “we use GPT” or “we use an open model” has not answered the operational question. Ask instead:

Procurement question	Why it matters
How do you measure coverage across generated requirements?	Prevents fluent but partial outputs
Do you compare outputs from multiple models or prompts?	Reveals hidden omissions and model-specific blind spots
How do you distinguish useful novelty from irrelevant expansion?	Controls review cost
Are CQs converted into tests or only stored as text?	Separates documentation from operational governance
Where does human review enter the loop?	Keeps domain accountability where it belongs
Can the process show what was rejected and why?	Creates auditability for regulated or high-stakes domains

This is the commercial opening. Not another generic AI wrapper. Not another “requirements copilot” that produces neat bullets and calls it transformation. The opportunity is a disciplined requirements intelligence layer: generation, comparison, coverage mapping, expert review, and traceable conversion into operational artifacts.

Margin does not live in the prompt box. It lives in the control system around the prompt box.

The practical conclusion: automate generation, not judgment

The paper’s main contribution is not that LLMs can write competency questions. We knew they could produce plausible text. The contribution is showing that the generated questions have measurable profiles, and those profiles change across domains and models.

For constrained tasks, models converge. For broad user stories, models explore different facets. For complex technical domains, closed models may share a stable core while exploratory models add divergent coverage. Across all of this, the best workflow is not single-model automation. It is multi-model generation, measured comparison, and human curation.

That is a less glamorous message than “AI replaces requirements work.” It is also the message serious operators should prefer.

Machines can draft the questions.

Professionals still decide which questions matter.

Cognaptus: Automate the Present, Incubate the Future.

Reham Alharbi, Valentina Tamma, Terry R. Payne, and Jacopo de Berardinis, “Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models,” arXiv:2604.16258, 2026. https://arxiv.org/abs/2604.16258 ↩︎

The paper is not asking whether LLMs can generate CQs; it asks what kind of CQs they generate#

CompCQ turns “good requirement questions” into measurable trade-offs#

Comparison 1: domain structure often matters more than model branding#

Comparison 2: readability and richness are not the same objective#

Comparison 3: diversity is useful only when it expands coverage, not when it creates decorative variation#

The experiment types are doing different jobs#

What this means for business teams building knowledge systems#

The expensive error is not a bad question; it is the question never asked#

Boundaries: what CompCQ can and cannot promise#

A better procurement question: not “which model,” but “which model mix?”#

The practical conclusion: automate generation, not judgment#