Opening — Why this matters now

Everyone wants AI systems that are explainable, reliable, and aligned to business needs. Few want to do the tedious work required to get there.

That work often begins with asking the right questions.

In knowledge engineering, those questions are called Competency Questions (CQs): natural-language prompts that define what an ontology or knowledge model must be able to answer. Think: Which assets are on loan?, Who created this artifact?, What metadata is missing?

They sound simple. They are not.

A recent comparative study tested whether humans, templates, or modern LLMs are best at generating these questions. The result is a familiar story in AI: automation is impressive, but judgment remains stubbornly human.

Background — Context and prior art

Ontologies are structured representations of concepts, entities, and relationships. They power search, recommendation, governance, interoperability, enterprise data mapping, and every executive deck that says “knowledge graph.”

But ontologies fail when they are built around technical elegance instead of operational need.

CQs help prevent that by translating stakeholder requirements into testable questions. If the ontology cannot answer them, it is missing something important.

Historically, CQs are produced in three ways:

Method How it Works Strength Weakness
Human Expert Engineers derive questions manually Contextual judgment Slow, expensive
Pattern-Based Reusable templates generate questions Consistency Rigid phrasing
LLM-Based Models generate questions from requirements text Fast, scalable Variable quality

The paper compares all three using the same cultural-heritage user story, ensuring a fair test rather than a methodological food fight.

Analysis — What the paper does

Researchers created AskCQ, a dataset of 204 competency questions generated from identical source requirements using five sets:

  • Human Annotator 1 (44)
  • Human Annotator 2 (54)
  • Pattern-Based (38)
  • GPT-4.1 (26)
  • Gemini 2.5 Pro (42)

They then evaluated outputs across four dimensions:

  1. Suitability — Would ontology engineers accept this CQ?
  2. Readability — Is it easy to understand?
  3. Complexity — How structurally dense or demanding is it?
  4. Semantic Overlap — Do different methods identify similar requirements?

Which is another way of asking: Can machines ask useful business questions, or merely longer ones?

Findings — Results with visualization

1. Human experts dominated suitability

CQ Set Acceptance Rate
Human Annotator 2 98%
Human Annotator 1 91%
GPT-4.1 85%
Gemini 2.5 Pro 67%
Pattern-Based 50%

Humans produced the most consistently useful questions. Pattern systems performed worst, proving that templates age poorly when reality refuses to be templated.

2. LLMs were far more complex

CQ Set Avg Length (chars) Structural Complexity
Human 1 42.6 16.8
Human 2 46.9 18.0
GPT-4.1 111.2 37.9
Gemini 2.5 93.5 32.0

LLMs generated questions roughly twice as long and substantially more complex.

In enterprise terms: more tokens, less clarity.

3. Human experts inferred unstated needs

Humans more often produced questions tied to implicit but necessary requirements—the things stakeholders forget to mention but later insist were obvious.

Examples include:

  • What family does this instrument belong to?
  • What format is each multimedia file?

LLMs mostly stayed closer to explicitly stated requirements.

This matters because real requirements gathering is rarely explicit. It is archaeology.

4. LLM agreement was low

Different models produced meaningfully different CQ sets from the same source material.

That means swapping one model for another may silently change your requirements baseline.

A charming feature if you enjoy governance risk.

Implications — Next steps and significance

For Business Leaders

If you use AI to gather requirements, do not treat first-pass outputs as final specifications. Treat them as brainstorming drafts.

For Data & AI Teams

Use LLMs for speed, then route outputs through subject-matter experts who can:

  • remove ambiguity n- simplify wording
  • detect missing assumptions
  • add implicit operational constraints
  • prioritize what actually matters

For Governance Programs

Requirements volatility is an under-discussed AI risk. If two models generate different questions from the same prompt, downstream systems may be designed against different realities.

Versioning prompts, models, and outputs should become standard practice.

For Product Builders

The best future workflow is likely hybrid:

Human judgment + AI ideation + formal review loops

Not because humans are nostalgic, but because accountability still lacks an API.

Conclusion — Wrap-up

This paper does not show that LLMs fail at requirement elicitation. It shows they are useful junior collaborators: fast, energetic, occasionally brilliant, and in need of supervision.

Human experts still write clearer, more relevant, more strategically valuable competency questions. Machines help widen the search space; humans decide what deserves to remain inside it.

As usual, the hard part was never generating language. It was knowing what mattered.

Cognaptus: Automate the Present, Incubate the Future.