A server goes down. Not a poetic metaphor. An actual server.
In the paper’s SAP scenario, Server 003 is offline. At first, this sounds like a routine IT incident: check connectivity, inspect logs, restart services, escalate if necessary. The sort of answer a general LLM can produce in tidy bullet points before congratulating itself for being helpful. The problem is that the server is not just “a server.” It runs the LE-DEL module for Logistics Execution — Delivery and Returns. Its failure brings down Dispatching Bay 17. The bay handles high-value shipments. In one prompt variant, downtime can cost $2.4 million in three hours. In another, chemical product containers may pile up against regulatory limits.
That is where the paper becomes interesting. Not because it says “add more context,” which is now the enterprise AI equivalent of saying “drink more water.” It argues that agentic AI needs a validated semantic layer: a structured representation of what things mean inside the organisation, how they relate, which constraints matter, and why a decision can be justified before or after action.1
The key shift is subtle but important. The authors are not merely asking whether an LLM can explain itself. They are asking whether an enterprise agent can make decisions that are \astjustifiable\ast against explicit evidence, domain concepts, operational rules, and human-validated knowledge. Explanation tells a story after the fact. Justification makes the decision answerable to a structured record. One is a press release. The other is closer to an audit trail.
The Server 003 case shows why generic “helpfulness” is not enough
The paper’s experiment uses three models: ChatGPT 4o, Gemini 2.0 Flash Thinking, and Gemma3 27B. Each receives a sequence of eight prompts across five testing cycles. Tests 1 to 6 gradually add context. Tests 7 and 8 are “super prompts,” adding substantially richer operational detail all at once.
The prompt sequence begins with the simplest question:
What do I do if Server 003 is down?
Then it adds role context: act as an SAP monitoring expert. Then it adds that SAP runs Logistics Execution — Delivery and Returns on the server. Then it adds Dispatching Bay 17. Then financial exposure. Then the preventive clue: increasing the ID range on Server 003 would have prevented the downtime. The super prompts go further, describing the chemical manufacturing setting, downstream effects, compliance risks, dispatch rates, warehouse constraints, and recommended mitigation categories.
This progression matters because it separates three things that business readers often blur together:
| Prompt condition | What the model can do | What it still lacks |
|---|---|---|
| Generic server outage | Produce standard IT troubleshooting advice | Operational meaning of the asset |
| SAP expert role | Use more domain-appropriate vocabulary | Specific business impact and constraints |
| Logistics + dispatch context | Connect the outage to process disruption | Full risk profile and decision priorities |
| Rich ontological context | Relate systems, processes, costs, constraints, and stakeholders | Still depends on quality of provided knowledge |
The first answer can be relevant. It can even be coherent. But it does not know why this server matters. “Check the logs” is not wrong. It is just thin. Enterprise failure rarely comes from not knowing that logs exist. It comes from not knowing which dependency, exception, process owner, cost exposure, or compliance rule turns an ordinary alert into a business-critical decision.
This is the paper’s strongest practical intuition: LLMs are already good at staying on topic. The experiment reports perfect average relevance scores of 5.0 for all three models. That sounds impressive until you notice the trap. Relevance is the low bar. A model can remain perfectly relevant while still being operationally under-informed.
In business terms, the LLM knows what meeting it is in. It does not necessarily know who can sign the purchase order, which plant is blocked, or why “just restart it” might be a dangerous sentence.
The proposed architecture is not “ontology magic”; it is a three-loop governance system
The paper describes an integration between OntoKai, a knowledge orchestration and ontology platform, and Avantra AIR, an AIOps-style agentic system for enterprise operations such as SAP monitoring. The architecture is neuro-symbolic in the practical sense: neural models help process and generate candidate structures, while symbolic knowledge models provide explicit concepts, relationships, rules, and evidence paths.
The authors organise the system into three improvement cycles.
First, the \ast\astKnowledge Graph cycle\ast\ast turns enterprise knowledge into structured models. Raw documents, datasets, and institutional knowledge are ingested. LLMs may propose candidate knowledge architectures. Domain experts then validate, correct, and publish those structures. This matters because the hard work is not simply data extraction. The hard work is deciding what the business means by “server,” “dispatch bay,” “container,” “downtime,” “regulatory limit,” or “high-value shipment” when those terms live across multiple systems and departments.
Second, the \ast\astInsight cycle\ast\ast uses the validated semantic context to improve prompts, responses, and follow-up questions. Better context produces better answers. Better answers can improve human inquiry. Better inquiry creates more useful context. That feedback loop sounds almost embarrassingly sensible, which is probably why many enterprise AI programmes skip it and then wonder why their chatbot behaves like a consultant who read only the first page of the ERP manual.
Third, the \ast\astGovernance cycle\ast\ast introduces the agentic justification loop. Here the semantic model becomes more than background context. It becomes an inspectable instruction layer and a record against which agent decisions can be reviewed.
The interesting part is that the authors frame explainability through \ast\astjustification\ast\ast, using a Toulmin-style argument structure. Instead of asking only “which features influenced the model output?”, the system is asked to record elements such as:
| Justification element | Operational question it answers |
|---|---|
| Claim / decision | What action or conclusion is being proposed? |
| Grounds / evidence | What facts support the decision? |
| Warrant | Why do those facts justify this action? |
| Backing | Why should the warrant be trusted? |
| Rebuttals | Under what conditions might this reasoning fail? |
| Qualifiers | What limits should be attached to the decision? |
This is a better fit for enterprise governance than classic post-hoc explainability alone. SHAP values and saliency maps may help technical teams inspect model behaviour, but they usually do not tell an operations manager why a decision is acceptable under a specific business rule. The paper is blunt on this distinction: its justification approach is less about explaining how the AI internally produced a result and more about whether the result can be justified to a human in the loop, given evidence, reasoning, semantic context, and alternatives.
That is a large conceptual move. It changes the governance object from “the model’s mind,” which we cannot fully inspect, to “the decision record,” which we can structure, challenge, and audit.
The evidence supports a threshold effect, not a fairy tale about ontologies
The empirical section is useful, but it needs careful reading.
The study evaluates accuracy, coherence, and relevance on 0–5 scales. Accuracy measures factual correctness and completeness. Coherence measures logical structure and internal consistency. Relevance measures whether the response addresses the query without drifting into unrelated content.
Across all models and tests, average accuracy is already high: ChatGPT 4o averages 4.65, Gemini 2.0 Flash Thinking 4.63, and Gemma3 27B 4.60. Average coherence is similarly high: Gemini 2.0 Flash Thinking averages 4.65, while ChatGPT 4o and Gemma3 27B average 4.6. Relevance is perfect across the board.
So the result is not “LLMs are useless until ontologies save them.” The baseline systems are already competent. The question is whether richer context moves them from competent generic response to more complete operational reasoning.
The headline result is the shift from Test 1 to Test 8. For average accuracy, all three models move from 4.0 to 5.0. The same pattern appears for coherence. The authors interpret this as a 25% improvement from low context to rich super-prompt context.
The statistical analysis is more informative than the averages. The authors run sign tests on paired comparisons across 15 model-cycle combinations: three models times five cycles. They compare Test 1 vs Test 8, Test 1 vs Test 6, and Test 6 vs Test 8.
| Comparison | Likely purpose | Result | What it supports | What it does not prove |
|---|---|---|---|---|
| Test 1 vs Test 8 | Main evidence for full context enhancement | 15/15 combinations improved for both accuracy and coherence; p < 0.0001 | Rich context is associated with consistently better ratings | That ontology structure alone caused the gain |
| Test 1 vs Test 6 | Sensitivity test for gradual context addition | 3/15 improved for accuracy and 3/15 for coherence; not statistically significant | Small incremental additions did not reliably move performance | That gradual context is useless in every setting |
| Test 6 vs Test 8 | Main evidence for threshold / super-prompt effect | 12/15 improved for both metrics; p = 0.0005 | Large context jumps produced meaningful improvement | That longer prompts are always better |
This is where the paper is most valuable for business readers. The result is not “add a sentence of context and enjoy governance.” The result is closer to: shallow context may make the answer sound more domain-aware without changing much; rich, structured, operationally specific context can change the quality of response.
That distinction matters for implementation budgets. Many firms try to improve AI workflows by adding a role instruction, a few policy excerpts, or a longer prompt template. The paper suggests that the real lift comes from representing operational knowledge at the level where the model can connect systems, processes, constraints, losses, responsibilities, and exceptions.
The annoying part, naturally, is that this is exactly the work organisations hoped LLMs would let them avoid.
What changed in the answer: from troubleshooting checklist to business decision
The Server 003 sequence is a useful case because the problem becomes progressively less technical and more organisational.
At the generic level, the model gives reasonable IT support advice: confirm the outage, check connectivity, inspect service status, gather error messages, escalate if necessary. This is not bad. It is just generic.
Once SAP and logistics context enter the prompt, the answer begins to recognise affected modules, business processes, and stakeholder groups. With Dispatching Bay 17 and financial exposure, the response shifts toward prioritisation: restore the server, activate workarounds, communicate with operations, reduce shipment backlog, and consider customer and compliance implications.
With the richer super-prompt context, the model can calculate backlog-style implications, reason about warehouse constraints, and identify that the incident is no longer merely an uptime problem. It is a decision problem across IT, logistics, finance, compliance, and customer operations.
That is the point of an ontology in this paper. Not a decorative graph. Not a taxonomy built so a consultant can bill for rectangles. A working ontology is a memory structure for the organisation: what exists, what depends on what, what counts as evidence, what constraints govern action, and which exceptions matter.
The authors also include examples beyond the SAP incident. OntoKai is described as supporting taxonomy and ontology work for National Highways and knowledge integration at Howdens Joinery. Avantra AIR is described across SAP monitoring, chemical manufacturing, manufacturing environments, and managed service provider contexts. These examples are not controlled experiments. They are deployment context: useful for understanding where the architecture is meant to operate, not proof that every sector gets the same performance gain.
The controlled evidence remains the prompt experiment. The broader examples explain why the architecture is commercially plausible.
The business value is not “better answers”; it is cheaper escalation and cleaner accountability
A conventional reading of the paper would say: ontology-derived context improves LLM accuracy and coherence. True, but a little too polite.
The sharper business interpretation is this: structured context reduces the gap between an alert and an accountable decision.
In an incident workflow, time is lost in translation. A monitoring tool emits an alert. An engineer interprets it. Someone searches documentation. Someone asks what business process is affected. Someone finds out which team owns the workaround. Someone checks whether a compliance constraint applies. Someone writes a status update. Someone later reconstructs what happened and why.
Agentic AI promises to compress that chain. But without a semantic layer, it compresses the wrong thing. It produces fluent recommendations while still forcing humans to verify the hidden context. That is not automation; that is a beautifully formatted liability.
The paper’s pathway is more operational:
- Convert tacit operational knowledge into inspectable knowledge models.
- Let AI propose structures, but require human validation before publication.
- Feed validated context into the agentic decision process.
- Require decisions to expose evidence, warrants, rebuttals, and qualifiers.
- Store decisions and justification records for review, audit, and continuous improvement.
The ROI logic, therefore, is not only model performance. It is reduced escalation time, fewer repeated investigations, better continuity when experienced staff leave, clearer audit records, and safer delegation of routine actions.
That last phrase is important: \astroutine actions\ast. The paper does not prove that enterprises should hand over high-risk decisions to autonomous agents. It argues for a system in which some lower-risk actions can be recorded and reviewed, while higher-risk actions can remain human-in-the-loop. That is a governance design, not an autonomy fantasy. How refreshing. Almost suspiciously adult.
The misconception to avoid: this does not isolate “ontology” from “more information”
There is one trap in reading the evidence.
It would be tempting to say the experiment proves that ontologies outperform ordinary prompting. The authors do not quite show that. They explicitly note that the experimental design did not isolate ontological structuring from information quantity or relevance. In plain English: Test 8 is not only more structured; it is also much richer. It contains more domain details, more constraints, more business consequences, and more operational specificity.
A stricter experiment would compare at least three conditions:
| Condition | What it would test |
|---|---|
| Minimal prompt | Baseline model capability |
| Long unstructured domain context | Effect of information volume and relevance |
| Structured ontology-derived context | Added value of semantic organisation itself |
The paper mostly compares minimal or gradually expanded prompts against rich context. That is still useful. It tells us that serious context changes output quality. But it does not fully separate whether the improvement comes from structure, quantity, relevance, or all three together.
For business adoption, this limitation is not fatal. It simply changes the claim. The safe claim is:
Rich, operationally specific, validated context appears to improve LLM response accuracy and coherence in this SAP/AIOps case study.
The stronger claim would be:
Ontological structure independently causes the improvement.
The paper makes the first claim much more securely than the second. Business readers should care because procurement decks love causal certainty. Reality, inconsiderately, prefers experimental controls.
The boundary conditions are practical, not cosmetic
The limitations section is unusually important because it tells us where the architecture may become expensive or fragile.
First, the evaluations use subjective 5-point rating scales, and inter-rater reliability is not reported. This means the ratings are interpretable as structured expert judgement, not as a large-scale benchmark with independently verified scoring. The sign-test results are still useful, but the measurement foundation is not as strong as it could be.
Second, the case study does not provide full security architecture details, formal compliance assessments, or detailed data protection protocols. The paper says enterprise deployment would require further evaluation of network security, encryption, backup procedures, and formal risk assessment. That matters because a semantic layer can become a high-value map of the organisation. If compromised, it does not merely leak data; it leaks how the business understands itself.
Third, governance consensus is not automatic. Ontologies force organisations to make definitions explicit. That is good. It is also politically inconvenient. Different teams may disagree about categories, responsibilities, exceptions, or acceptable trade-offs. The paper recognises that tooling must allow conflicting positions, definitions, and intentions to be represented and audited, not magically resolved by a graph database wearing a halo.
Fourth, scale introduces a coordination problem. If many agents operate across a shared semantic substrate, they need explicit rules for consensus, tolerance, delegation, and review. Too much human review creates bottlenecks. Too little turns governance into theatre. The useful design question is not “human-in-the-loop or fully autonomous?” It is “which classes of action require which level of justification and approval?”
That is where the paper’s proposed evidence-framework extension becomes relevant. The authors suggest adding evidence quality dimensions such as face, criterion, construct, and content validity, along with profiles for comprehensiveness, relevance, objectivity, quantity, and consistency. This would make the justification loop less dependent on verbal plausibility and more grounded in evidence quality. In enterprise AI, this is the boring work that eventually saves you from expensive nonsense.
A practical adoption map for enterprise teams
For a firm considering agentic AI in ERP, AIOps, compliance-heavy operations, or knowledge-retention workflows, the paper suggests a staged adoption path.
| Stage | Concrete work | Main business benefit | Main risk |
|---|---|---|---|
| Knowledge capture | Extract processes, entities, constraints, dependencies, and exceptions | Less institutional amnesia | Political fights over definitions |
| Human validation | Review and correct AI-generated candidate models | Higher trust in knowledge layer | Slow expert bottlenecks |
| Contextual prompting | Feed validated semantic context into LLM workflows | More accurate and coherent responses | Overlong prompts without prioritisation |
| Justification loop | Record claims, evidence, warrants, rebuttals, and qualifiers | Auditability and accountable delegation | Plausible but weak justifications |
| Governance scaling | Classify which decisions need approval, review, or automation | Faster operations with controlled autonomy | Agent coordination failures |
This map also shows why the paper is relevant beyond SAP. The same pattern appears wherever operational decisions depend on local meanings: insurance claims, hospital administration, logistics, energy operations, finance controls, public-sector case handling, and regulated manufacturing. The model’s general intelligence is not the scarce ingredient. The scarce ingredient is validated organisational context.
Of course, that also means the solution is not plug-and-play. A vendor cannot arrive on Monday, ingest SharePoint by Tuesday, and deliver accountable autonomy by Friday unless the organisation’s hidden knowledge was already miraculously clean. It was not.
Conclusion: governance begins before the agent acts
The paper’s best contribution is not that it celebrates ontologies. Ontologies have been around long enough to have earned both respect and eye-rolls. Its stronger contribution is showing how ontology-derived context can support a justification layer for agentic AI decisions: evidence, reasoning, rebuttals, qualifiers, and auditability embedded into the workflow rather than pasted on afterward.
The Server 003 case makes the point plainly. A generic LLM can tell you to check whether the server is down. A context-rich agent can understand that the server supports a logistics module, that a dispatch bay is blocked, that downtime has financial and regulatory consequences, and that the decision record must remain inspectable.
That is the difference between chatbot assistance and operational governance.
The paper does not prove that ontologies alone are the magic ingredient. It does not prove universal ROI. It does not eliminate security, scaling, or organisational politics. What it does show is more grounded and more useful: enterprise agents need structured context if their decisions are to be not only fluent, but answerable.
And answerable is the word that matters. In business, an AI system that cannot justify its action is not autonomous. It is just unsupervised.
\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast
-
Liam McGee, James Harvey, Lucy Cull, Andreas Vermeulen, Bart-Floris Visscher, and Malvika Sharan, “Enabling Ethical AI: A case study in using Ontological Context for Justified Agentic AI Decisions,” arXiv:2512.04822, 2025. ↩︎