Opening — Why this matters now
Everyone wants AI to automate the expensive, slow, deeply human parts of work. Requirements gathering is high on that list. It is also where many software and data projects quietly fail.
A recent paper, Characterising LLM-Generated Competency Questions, examines whether large language models can reliably generate competency questions (CQs) — the structured questions used in ontology engineering to define what a knowledge system must know, answer, or reason about. In simpler terms: if you are building a knowledge graph, compliance engine, recommendation system, or enterprise AI layer, CQs help translate vague business intent into testable requirements. fileciteturn0file0
This is not glamorous work. It is, however, the work that determines whether your expensive AI project becomes infrastructure or decoration.
Background — Context and prior art
Ontology engineering has long relied on human experts to ask questions like:
- Which suppliers are approved for regulated materials?
- Which patients match this treatment profile?
- Which parks meet weather and crowd preferences?
Those questions define the scope of the system.
Traditionally, creating them is manual, slow, and dependent on scarce specialists. LLMs appear to offer relief: generate dozens of requirement questions instantly, then move to implementation.
The problem is familiar. AI can produce output at scale. Whether that output is useful, complete, and trustworthy is another matter entirely.
So the researchers built CompCQ, a framework to evaluate generated questions across multiple dimensions:
| Dimension | What It Measures | Why It Matters |
|---|---|---|
| Readability | Ease of understanding | Stakeholders must interpret requirements clearly |
| Complexity | Linguistic + structural sophistication | Impacts implementation effort and ambiguity |
| Relevance | Alignment to source requirements | Reduces hallucinated requirements |
| Diversity | Breadth of concepts covered | Avoids blind spots |
| Overlap | Similarity between models | Reveals consensus vs novelty |
fileciteturn0file0
Analysis — What the paper does
The study compares five models across five domains, including healthcare, tourism, political journalism, cultural heritage, and music metadata:
- Gemini 2.5 Pro
- GPT-4.1
- Kimi K2
- Llama 3.1 8B
- Llama 3.2 3B
The prompting method was intentionally plain: give the requirement text, ask the model to generate competency questions, and avoid examples that might bias results. Sensible. If a model needs hand-holding to perform, that is already a result.
Core Finding #1: Domain complexity matters more than model branding
In the healthcare use case (Personalized Depression Treatment Ontology), every model produced harder-to-read and more complex questions.
Translation: difficult business domains remain difficult even when wrapped in generative AI.
Core Finding #2: Closed models were steadier, open models more adventurous
Gemini and GPT generally produced:
- clearer wording n- more stable relevance scores
- simpler question structures
Open models, especially Kimi K2, often produced broader and more varied outputs — useful for brainstorming, less ideal for clean first drafts.
Core Finding #3: No single model covered the full requirement space
This is the headline executives should care about.
Different models surfaced different questions. In broad domains, overlap between model outputs was often near zero. That means relying on one model can create hidden omissions while still looking productive.
Classic automation theater.
Findings — Results with visualization
Model Behavior Snapshot
| Model | Typical Strength | Typical Risk | Best Use Case |
|---|---|---|---|
| Gemini 2.5 Pro | Clear, concise, readable outputs | Can be conservative | Baseline requirements drafting |
| GPT-4.1 | Balanced and reliable | Moderate topic clustering | Structured enterprise workflows |
| Kimi K2 | High novelty and breadth | Verbosity / complexity | Discovery workshops |
| Llama 3.1 8B | Useful low-cost option | Lower output volume | Lightweight internal experimentation |
| Llama 3.2 3B | Occasionally diverse | Inconsistent coverage | Controlled prototyping only |
Strategic Pattern
Need coverage? Use multiple models.
Need clarity? Use stronger closed models first.
Need exploration? Add open models for idea spread.
Need trust? Keep humans in review.
Need shortcuts? That remains fictional.
---
## The Hidden Metric: Cost of Missing Questions
A bad generated requirement is annoying.
A missing requirement is expensive.
If your system forgets to ask:
* which jurisdiction applies,
* which edge cases break policy,
* which data source overrides another,
…you do not have an AI issue. You have an operations issue with AI branding.
# Implications — Next steps and significance
## For Business Owners
Do not buy a single-model “requirements automation” story. Ask vendors how they measure completeness, contradiction detection, and coverage gaps.
## For Product Teams
Use an ensemble workflow:
1. Generate requirements with Model A (clarity-first)
2. Challenge gaps with Model B (diversity-first)
3. Human analyst consolidates final set
4. Convert outputs into tests and governance controls
## For AI Governance Leaders
This paper quietly supports a broader truth: **AI assurance is not just about outputs. It is about omitted outputs.**
Many governance failures come from what the system never considered.
## For Cognaptus-style Operators
There is commercial value in becoming the layer between raw models and reliable operations:
* requirement generation pipelines
n- multi-model validation workflows
* domain-specific knowledge structuring
* human review orchestration
* ROI-linked automation governance
That is where margin lives. Not in yet another wrapper with a gradient background.
# Conclusion — Wrap-up and tagline
This study is useful because it avoids magical thinking. LLMs can absolutely accelerate requirements engineering. They can surface ideas, broaden scope, and reduce manual effort.
But they do not eliminate judgment. They redistribute it.
The winning operating model is not human *or* AI. It is multi-model generation, human curation, and measurable controls.
Machines can draft the questions.
Professionals still decide which questions matter.
**Cognaptus: Automate the Present, Incubate the Future.**