Questions look cheap.

That is why they are dangerous.

In most enterprise AI projects, the visible work arrives late: dashboards, RAG demos, knowledge graphs, compliance assistants, workflow copilots, and executive slides with arrows pointing to a “semantic layer.” The invisible work arrives earlier and is less glamorous: deciding what the system must actually know, answer, retrieve, distinguish, reject, and explain.

In ontology engineering, one of the oldest tools for doing that work is the Competency Question, or CQ. A CQ is a natural-language question that a future ontology, knowledge graph, or semantic model should be able to answer. “Which collection items are on loan?” is a CQ. So is “What format is each multimedia file?” or “Which instruments belong to this family?” The question is not decoration. It is a requirement wearing a readable costume.

A recent paper by Reham Alharbi, Valentina Tamma, Terry Payne, and Jacopo de Berardinis tests a timely version of an old problem: when the same ontology requirements are given to human experts, reusable templates, GPT-4.1, and Gemini 2.5 Pro, who writes better competency questions?1

The tempting answer is: the LLMs, because they are fast and fluent. The paper’s answer is more annoying, and therefore more useful: LLMs can produce relevant first drafts, but human ontology engineers still produce the most suitable, readable, compact, and inferentially useful CQs. Once again, the hard part is not producing language. It is knowing what the language should commit the system to.

Competency questions are requirements, not brainstorming prompts

A CQ sits between messy stakeholder language and formal knowledge modelling. It turns a desired capability into a testable question. If the model cannot answer the CQ, the model is incomplete, irrelevant, or both.

That makes CQs especially important for business systems that depend on structured meaning rather than keyword matching. A data-governance knowledge graph needs CQs to define what counts as lineage, ownership, consent, policy scope, or exception handling. A compliance assistant needs CQs to distinguish “What policy applies?” from “What evidence proves compliance?” A RAG system built over enterprise documents needs CQs to prevent a common disease: impressive retrieval over poorly specified tasks.

The usual business failure is not that nobody asks questions. Everyone asks questions. The failure is that questions remain too vague, too late, or too disconnected from the model that must eventually answer them. A CQ forces the uncomfortable connection: if this question matters, where in the ontology will the required concepts, relations, attributes, and constraints live?

That is why the paper’s comparison is worth reading as more than a niche ontology study. It is also a small but sharp test of AI-assisted requirements elicitation.

The experiment compares five CQ sets from one controlled source

The study introduces AskCQ, a dataset of 204 competency questions generated from the same cultural-heritage user story. The story involves a music archivist and a collection curator working with museum collection data: loaned music memorabilia, metadata, multimedia files, display requirements, and related management needs.

The controlled setup matters. If one method received a clean requirements document and another received a vague stakeholder interview, any comparison would be theatre. Here, every method starts from the same source material.

The five CQ sets are:

CQ set Generation method Number of CQs What it represents
HA-1 Human ontology engineer 44 Manual expert formulation
HA-2 Human ontology engineer 54 A second independent manual expert formulation
Pattern Manually instantiated CQ templates 38 Semi-automated pattern-based elicitation
GPT-4.1 LLM-generated from the user story 26 Automated generation by OpenAI’s model
Gemini 2.5 Pro LLM-generated from the user story 42 Automated generation by Google’s model

The human annotators had at least five years of ontology-engineering experience. The pattern-based set was produced by an ontology engineer using predefined CQ archetypes. The LLM sets were generated from a Markdown version of the same user story, without giving the models a CQ definition, desired number of questions, or subjective quality criteria. That design choice avoids steering the models toward the authors’ preferred answer. It also means the LLM result should be read as an off-the-shelf baseline, not as the best possible prompt-engineered workflow.

The CQs were anonymized before evaluation. Three ontology-engineering evaluators then judged whether each CQ was suitable for guiding ontology-engineering work in the context of the user story. The paper also analyzes ambiguity, relevance, readability, complexity, and semantic overlap across CQ sets.

This is not one experiment pretending to answer everything. It is a set of complementary tests. The expert suitability evaluation is the main evidence. The readability and complexity metrics explain part of why some CQs are easier to use. The semantic-overlap analysis asks whether different methods are converging on the same requirements or merely staying in the same broad topic area. Figure 1 in the paper consolidates the feature profiles, but it is best treated as a summary visualization, not a separate proof.

The first comparison: experts still write the most acceptable CQs

The clearest result is the expert suitability evaluation.

Each CQ received three independent accept/reject judgments. The paper sums these ratings into a suitability score from -3 to +3, where a score above zero means majority acceptance. Inter-annotator agreement was fair, with Fleiss’ Kappa at $\kappa = 0.35$. That is not perfect agreement, but it is enough to remind us that CQ quality is partly judgment-based. Even experts do not behave like a checksum.

CQ set Commented CQs Mean suitability score Majority-accepted CQs
HA-2 19% 2.87 ± 0.62 98%
HA-1 27% 2.39 ± 1.26 91%
GPT-4.1 35% 1.85 ± 1.52 85%
Gemini 2.5 Pro 31% 1.52 ± 1.88 67%
Pattern 37% 0.11 ± 2.12 50%

The result is not “LLMs are useless.” GPT-4.1 reached an 85% acceptance rate, which is not a trivial result. A business team would be foolish to ignore that kind of first-pass productivity.

But the ranking matters. Both human sets outperform the LLM sets, and HA-2 is especially strong: 98% majority acceptance, the highest mean score, and the lowest share of evaluator comments. Gemini is weaker than GPT in this setup, and the pattern-based method performs worst despite being manually instantiated by an experienced engineer.

This last point is important because it prevents a lazy “humans versus machines” reading. Patterns are not machines in the generative-AI sense; they are reusable structures filled in by a human. Yet they scored lowest. The problem is not simply whether a person is involved. The problem is whether the method preserves the relationship between the source requirement and the question being asked.

Templates can enforce consistency, but they can also impose structure before the requirement has revealed its shape. Reality, as usual, declines to fit the form.

The second comparison: LLMs ask longer, heavier questions

Suitability tells us which CQs experts accepted. Readability and complexity help explain what the CQs feel like to use.

The paper uses Flesch-Kincaid Grade Level and Dale-Chall readability scores as comparative indicators. The authors are careful here: these formulas were designed for continuous prose, not short interrogative sentences. So the scores should not be read as literal comprehension truth. They are still useful for comparing the five sets under the same measurement conditions.

CQ set FKGL Dale-Chall Average length Requirement complexity Syntactic complexity
HA-1 5.63 ± 2.80 9.59 ± 1.62 42.57 ± 12.94 4.52 ± 1.50 16.76 ± 5.27
HA-2 6.88 ± 3.42 8.76 ± 2.00 46.93 ± 12.01 4.17 ± 1.09 18.04 ± 3.87
Pattern 7.66 ± 2.81 10.94 ± 2.63 51.69 ± 15.53 4.94 ± 1.34 16.79 ± 4.03
GPT-4.1 11.64 ± 2.69 12.67 ± 1.89 111.15 ± 17.18 8.12 ± 2.14 37.91 ± 7.20
Gemini 2.5 Pro 9.72 ± 2.67 12.90 ± 2.64 93.50 ± 27.52 5.60 ± 2.40 31.96 ± 9.06

The gap is not subtle. GPT-4.1 produced CQs averaging 111 characters, more than double HA-1 and HA-2. Gemini averaged 93.5 characters. The LLM questions were also more complex across requirement, linguistic, and syntactic dimensions.

In some contexts, complexity is not bad. A richer question may capture a richer requirement. The paper’s complexity measures include ontological primitives such as concepts, properties, relationships, filters, cardinality, and aggregation. A question that genuinely needs more of these primitives should be complex.

But here the LLMs are not merely adding useful modelling richness. They are also producing longer and syntactically heavier language. That creates a practical problem. A CQ is supposed to help engineers, stakeholders, and reviewers converge on what the system must answer. When the question itself becomes a small legal contract, the review burden moves from ontology design to question interpretation. Congratulations: the team has automated the creation of things people now need meetings to understand.

The human CQs are shorter and easier to read. That matters operationally. A requirement artifact is not valuable because it sounds comprehensive. It is valuable because it can be reviewed, challenged, tested, and translated into model structure.

The third comparison: experts infer what stakeholders forgot to say

One of the paper’s most business-relevant findings is hidden inside the relevance analysis.

The authors rate relevance on a four-point scale. A score of 4 means the CQ addresses a requirement explicitly stated in the user story. A score of 3 means the CQ addresses something not explicit but inferable from the story using domain knowledge and functionally necessary for the story’s goals. In this section, the paper reports the proportion of score-3 CQs; the rest received score 4.

That distinction is small, but it is the heart of requirements work.

CQ set Inferential relevance: score-3 CQs
HA-2 27.8%
HA-1 18.2%
Pattern 13.5%
GPT-4.1 12.0%
Gemini 2.5 Pro 4.8%

The human experts produced the highest proportion of inferential CQs. The paper gives examples such as asking for an instrument family or the format of each multimedia file. These needs may not be spelled out directly, but they are functionally necessary if the ontology is to support the archivist and curator’s work.

This is where the “LLMs replace requirements engineers” story starts to wobble.

An LLM can read a user story and produce plausible questions about what is explicitly present. That is useful. But much of enterprise requirements work consists of detecting what is missing, implied, assumed, or operationally necessary. Stakeholders rarely provide a complete specification. They provide fragments, examples, complaints, habits, and the occasional sentence that sounds clear until someone tries to implement it.

The expert contribution is not just better phrasing. It is abductive judgment: seeing a goal and inferring the questions that must exist for the goal to be achievable.

For business AI projects, that distinction maps directly to risk. If AI-generated requirements stay close to explicit text, they may look faithful while missing operational dependencies. A compliance assistant may list the policy but miss the required evidence. A product knowledge graph may capture product names but miss lifecycle states. A procurement RAG system may retrieve supplier documents but fail to represent approval authority.

The problem is not hallucination. It is under-inference.

The fourth comparison: same topic does not mean same requirements

The semantic-overlap analysis is the part of the paper that should make governance teams sit up.

The authors compute sentence embeddings for each CQ using Sentence-BERT and compare CQ sets through centroid similarity, directional coverage, and bidirectional coverage. The threshold for coverage is $\tau = 0.75$.

The centroid similarities are generally high. Many CQ sets are broadly about the same thing. For example, HA-1 and GPT have the highest centroid similarity at 0.85. HA-2 and Pattern reach 0.84. HA-1 and Pattern reach 0.83. At the thematic level, the methods understand the domain.

Then the coverage results ruin the party.

Comparison Centroid similarity Bidirectional coverage
HA-1 ↔ HA-2 0.82 15.3%
HA-1 ↔ Pattern 0.83 14.6%
HA-1 ↔ GPT-4.1 0.85 10.0%
HA-1 ↔ Gemini 0.73 0.0%
HA-2 ↔ GPT-4.1 0.74 2.5%
GPT-4.1 ↔ Gemini 0.80 2.9%

The key lesson: two CQ sets can be thematically aligned while still identifying different specific requirements.

This is not a contradiction. A model can ask many museum-collection questions and still not ask the same museum-collection questions that matter for a particular ontology. High-level semantic similarity says, “Yes, everyone is talking about the museum collection.” Low coverage says, “No, they did not converge on the same requirement baseline.”

The two human annotators had the highest bidirectional coverage, at 15.3%, and the strongest directional overlap. That number is still low in absolute terms, which tells us that even expert CQ formulation is diverse. But compared with GPT-Gemini overlap at 2.9%, or Gemini’s zero bidirectional coverage with both human sets in some comparisons, the human sets appear more aligned in identifying core requirements.

For enterprise teams, this matters because requirements baselines are supposed to be stable. If changing the model from GPT-4.1 to Gemini silently changes the generated CQ set, then the downstream ontology, data model, RAG evaluation suite, or compliance workflow may also drift. Model substitution becomes requirements substitution. A delightful feature, provided nobody is accountable for the system afterward.

Ambiguity is not the whole story

A lazy interpretation would be: choose the method with the lowest ambiguity.

That would be wrong.

In the paper, ambiguity does not map cleanly to overall quality. HA-2 has very low ambiguity at 3.7%, and GPT is almost identical at 3.8%. Yet HA-2 receives a 98% majority acceptance rate while GPT receives 85%. HA-1 has the highest ambiguity score at 20.5%, but still reaches 91% majority acceptance. The authors note that 66% of HA-1’s ambiguous CQs were resolved as suitable after discussion, while only 33% of Gemini’s ambiguous CQs were resolved as suitable.

This tells us something useful about review workflows. Ambiguity can be benign or harmful depending on what it hides. A human-generated CQ may contain a wording issue that experts can resolve because the underlying requirement is sound. An LLM-generated CQ may be grammatically clear but still less useful because it is verbose, shallowly explicit, or misaligned with the functional requirement.

Clarity is necessary. It is not sufficient.

That is a surprisingly important point for AI evaluation. Many enterprise teams over-weight surface quality because it is easy to inspect. Does the output read well? Is it grammatically clean? Does it mention the right domain terms? Fine. But requirement artifacts also need coverage, inferential value, and suitability for downstream modelling. Smooth language can still be operationally thin.

What this means for enterprise AI teams

The paper does not say “never use LLMs for CQs.” It says something more practical: use them where they are strong, and do not pretend those strengths cover the whole job.

A business workflow inspired by this paper would look like this:

Step AI role Human role Governance check
1. Requirement intake Convert source material into candidate CQs Clarify stakeholder goals and domain scope Store source version and assumptions
2. First-pass generation Produce a broad candidate CQ set Remove irrelevant or verbose questions Record model, prompt, parameters, and date
3. Expert refinement Suggest alternative phrasings and missing angles Add inferential CQs and normalize wording Mark accepted, rejected, rewritten, and added CQs
4. Coverage review Cluster and compare CQ sets Decide which requirements are essential Track semantic gaps and unresolved ambiguity
5. Downstream use Translate accepted CQs into tests, retrieval checks, or ontology tasks Validate against operational workflows Version the CQ baseline as part of model governance

The ROI is not “replace the ontology engineer.” That is the usual PowerPoint fantasy, briefly profitable and operationally expensive.

The more plausible ROI is faster diagnosis. LLMs can produce enough candidate questions to start a review session earlier. They can widen the search space. They can make missing areas visible by contrast. They can generate draft evaluation items for a RAG or knowledge-graph project before the team has finished turning stakeholder interviews into formal requirements.

But the final CQ set should not be accepted because it sounds plausible. It should be accepted because it is readable, concise, tied to explicit or functionally necessary requirements, and stable enough to govern downstream design.

What the paper directly shows, and what we should infer carefully

The business interpretation needs clean boundaries.

Level What can be said What should not be overclaimed
Direct paper finding Human ontology engineers produced the highest suitability, readability, and lowest complexity in this controlled AskCQ study. Human experts will always outperform LLMs across all domains and all prompts.
Direct paper finding LLM-generated CQs were relevant but more complex, less readable, and less semantically aligned with human sets. LLMs are bad at CQ generation. GPT-4.1 still achieved 85% acceptance here.
Direct paper finding Human experts captured more inferential, functionally necessary requirements. The paper proves a general theory of human reasoning superiority. It does not need that grand opera.
Cognaptus inference LLMs are useful for first-pass CQ brainstorming in enterprise knowledge projects. LLM-generated CQ sets should become requirements baselines without expert review.
Cognaptus inference Model and prompt versioning matter because generated CQ sets may vary substantially. Semantic embedding metrics alone can certify coverage quality.

This is the sensible reading: LLMs lower the cost of producing candidate requirement questions. They do not remove the need to decide which questions deserve to become requirements.

The boundaries are narrow, but useful

The paper’s limitations are not decorative; they affect how the result should be used.

First, the dataset comes from one user story in the cultural-heritage domain. That controlled design is a strength for isolating method differences, but it limits generalization. A banking compliance ontology, manufacturing maintenance graph, or clinical data model may produce different results.

Second, some metrics are proxies. Relevance is scored by Gemini 2.5 Pro with partial manual validation. Requirement complexity also relies on LLM-extracted primitives. The authors frame these as comparative indicators rather than absolute measures, which is the right level of confidence.

Third, the LLM setup is intentionally not a heavily optimized CQ-generation pipeline. The models were not given detailed CQ definitions, examples of high-quality CQs, or target properties. This makes the comparison cleaner as a baseline, but it leaves room for better prompts, domain-specific examples, retrieval-augmented context, multi-agent critique, or expert-in-the-loop iteration.

Fourth, the paper evaluates generated CQs, not the full downstream ontology built from them. Suitability is a strong proxy for ontology-engineering usefulness, but the next research step would be to measure how different CQ sets affect actual ontology design quality, maintenance cost, validation coverage, and stakeholder satisfaction.

Those boundaries do not weaken the practical message. They sharpen it. The result is not a universal ranking of humans, templates, and LLMs. It is evidence for designing reviewable AI-assisted requirements workflows.

The question of questions is still a human question

The paper’s most useful contribution is not that humans beat LLMs. That headline is too easy, and slightly boring.

The more useful contribution is a diagnostic map of how methods fail differently.

Pattern-based generation is consistent but rigid. LLM generation is fast and relevant but verbose, complex, and unstable across models. Human experts are slower, but they are better at producing readable questions that capture implicit functional needs. The business answer is not to worship any one method. It is to assign each method to the part of the workflow where its failure mode is tolerable.

Use LLMs to generate candidates. Use patterns to enforce certain recurring structures where the domain genuinely supports them. Use experts to select, rewrite, infer, and govern the final baseline.

That baseline is where the real value sits. A CQ is small, but it decides what the model must know. When the question is wrong, the ontology can be technically elegant and still useless. When the question is right, even a modest knowledge model has a fighting chance of serving the business rather than impressing the architecture committee.

AI can help us ask more questions. Good. We needed that.

But the question of which questions matter remains stubbornly human.

Cognaptus: Automate the Present, Incubate the Future.


  1. Reham Alharbi, Valentina Tamma, Terry R. Payne, and Jacopo de Berardinis, “A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements,” arXiv:2507.02989v2, 2026. https://arxiv.org/abs/2507.02989 ↩︎