From Black-Box to Boarding Gate: When LLMs Finally Learn to Show Their Work

Airports are where ordinary corporate coordination problems go to become expensive.

A delayed data update is not just an “alignment issue.” A vague handoff is not just “cross-functional friction.” A misunderstood phrase can move aircraft, ground crews, gates, passengers, baggage, and regulatory responsibility in the wrong order. Aviation has a talent for making management consultants’ favorite words suddenly physical. Very inconsiderate of it.

That is why the useful question is not whether an LLM can read an airport operations manual. Of course it can. The useful question is whether the system can show exactly which source sentence produced which operational claim, which stakeholder owns which step, and which dependency comes before another when the document itself describes the process out of order.

A new paper, Semi-Automated Knowledge Engineering and Process Mapping for Total Airport Management, studies precisely this problem using the EUROCONTROL Airport Collaborative Decision-Making, or A-CDM, milestone manual as the test case.¹ Its answer is not “make the model smarter.” The paper’s answer is more disciplined and more useful: put the model inside a scaffold of domain ontology, force it to output structured triples, anchor every extraction to source text, and only then convert the result into operational artifacts such as swimlane diagrams.

That is the right order. First constrain meaning. Then extract. Then visualize. Do not ask the chatbot to become an airport operations analyst because apparently we have learned nothing from enterprise AI demos.

The real problem is not reading documents; it is making extracted knowledge accountable

Most enterprise AI discussions still treat documentation as a retrieval problem. The company has manuals, PDFs, tickets, SOPs, call transcripts, and spreadsheets. The LLM searches them. A user asks a question. The system returns an answer with a pleasant tone and an optional citation badge. Everyone briefly feels modern.

In airport operations, that is not enough.

The paper starts from a stricter premise: airport knowledge is fragmented across stakeholders, terminology, systems, and incentives. Airlines, air traffic control, ground handlers, airport operators, and service providers may all participate in a single operational sequence while using partially different vocabularies and holding different slices of the process. The result is not merely messy documentation. It is semantic risk.

A knowledge graph matters here because it is not just another storage format. It gives the system a controlled vocabulary of entities and relationships. In this paper, the authors focus on a deliberately small operational slice: three classes — Procedure, Sequenced_Item, and Stakeholder — and two object properties — hasNext and hasStakeholder. That narrow schema choice is important. The goal is not to model the entire airport universe in one heroic ontology sprint. The goal is to test whether an LLM can populate a disciplined procedural graph from an operational manual.

The distinction matters because ordinary LLM extraction produces plausible text. This pipeline tries to produce accountable structure.

Usual document AI question	This paper’s stricter question
Can the model summarize the procedure?	Can it extract the procedure as graph triples?
Can it answer a user’s query?	Can every extracted relationship be traced to source text?
Can it read long documents?	Does longer context recover procedural dependencies rather than burying them?
Can it produce a diagram?	Can the diagram inherit stakeholder attribution and provenance from the graph?

This is the paper’s central move: it shifts the unit of trust from the model’s response to the extracted, source-anchored relationship. That is less glamorous than a conversational copilot. It is also much closer to how operational systems actually earn trust.

The mechanism is symbolic scaffolding first, LLM extraction second

The architecture is best understood as a control system around the LLM.

The pipeline begins with operational data ingestion: the A-CDM milestone manual is converted from PDF into text. Then the system applies what the authors call scaffolded symbolic fusion. An expert-curated knowledge graph and ontology provide the formal structure for prompts and few-shot examples. LangExtract then performs structured triple extraction. Finally, the extracted triples are integrated into a knowledge graph and converted into stakeholder-attributed swimlane diagrams.

That sequence looks simple, but the ordering is the point.

Operational manual
      ↓
Ontology and curated KG scaffold
      ↓
Schema-guided LangExtract triple extraction
      ↓
Source-anchored operational KG
      ↓
Stakeholder-attributed swimlane diagram

The LLM is not being asked to improvise the domain model. It is being asked to fill a structure already shaped by domain knowledge. That is why the paper is more interesting than another “LLMs extract information from PDFs” experiment. The authors are not treating the model as a genius intern. They are treating it as a probabilistic extraction component inside a governed pipeline.

The symbolic part supplies constraints: what classes exist, which relationships are expected, and what a valid output should look like. The LLM supplies scalability: it reads unstructured procedural text and proposes candidate entities and relationships. LangExtract then adds a crucial practical feature: source grounding. Each extracted item is linked back to the text span that supports it.

This hybrid design is the paper’s most transferable idea.

Pipeline component	Technical role	Operational consequence	ROI relevance
Expert ontology / seed KG	Defines valid entities and relationships	Reduces semantic drift across stakeholders	Less manual reconciliation during process mapping
Few-shot, schema-guided extraction	Constrains LLM outputs into triples	Makes outputs machine-readable without heavy cleanup	Faster conversion from manuals to structured workflows
Deterministic source anchoring	Links triples to supporting text	Makes audits and reviews possible	Lower review cost and stronger compliance evidence
KG-to-swimlane generation	Converts graph dependencies into visual lanes	Makes ownership and sequence visible	Better training, incident review, and handoff diagnosis

The quiet lesson: governance is not a policy document attached after the AI system is built. Governance is in the shape of the extraction task.

Long context wins because the process is not written in the order it happens

The paper’s most counterintuitive empirical result is that long-context inference performs better than page-level inference.

That matters because many readers now carry a default suspicion about long context: yes, the model can technically accept the tokens, but it may lose information in the middle, confuse dependencies, or produce structurally weaker outputs. Chunking feels safer. Smaller context means denser attention. Simple story. Unfortunately, operations manuals do not always cooperate with simple stories.

The authors compare two configurations on a 16-page A-CDM manual of roughly 10,000 tokens:

Short-context inference: process the document page by page.
Long-context inference: process the entire manual in one pass.

They evaluate extracted triples against a manually curated ground-truth knowledge graph. The metrics are standard:

$$ Precision = \frac{TP}{TP + FP} $$

$$ Recall = \frac{TP}{TP + FN} $$

$$ F1 = \frac{2PR}{P + R} $$

The reported results are high in both settings, but the long-context setting is better.

Metric	Short context	Long context	Interpretation
True positives	440	442	Long context recovers slightly more correct triples
False positives	18	15	Long context also reduces erroneous extractions
False negatives	13	8	The biggest gain is fewer omissions
Precision	0.961	0.967	Both are strong; long context is marginally cleaner
Recall	0.971	0.982	Long context captures more required relationships
F1 score	0.966	0.975	Overall extraction quality improves

The false-negative reduction is the most meaningful number: from 13 missed triples to 8, a 38.5% reduction. That is not a revolution, but it is a useful signal. In procedural systems, omissions are often more dangerous than extra noise because a missing dependency can disappear from the downstream process map.

The paper’s explanation is practical. A-CDM procedures often contain logical inversions: the effect may be described before the procedure that causes it, or a dependency may be spread across nearby sections. Page-level extraction can break those dependencies at arbitrary boundaries. Long-context extraction lets the model see enough surrounding material to reorder cause and effect correctly.

So the lesson is not “long context is always better.” That would be too convenient, and therefore suspicious. The lesson is narrower: when the document describes a tightly coupled process whose dependencies cross local boundaries, chunking can damage the very structure the system is supposed to extract.

In this case, the cost of fragmentation was higher than the cost of a longer context window.

The experiment is main evidence; the diagrams are operationalization, not decoration

The paper includes several figures and tables, but they do not all serve the same evidentiary purpose. Treating them as one flat set of “results” would blur the argument.

Paper element	Likely purpose	What it supports	What it does not prove
Architecture diagram	Implementation detail / mechanism explanation	The pipeline integrates ingestion, symbolic scaffolding, extraction, and artifact synthesis	It does not independently validate extraction accuracy
Protégé schema	Implementation detail / scope definition	The KG is grounded in a formal airport operations schema	It does not prove the schema covers all airport operations
LangExtract prompt and example output	Implementation detail / reproducibility support	Extraction is schema-guided and source-grounded	It does not prove generalization across manuals
Table I extraction metrics	Main evidence	Long-context and short-context extraction both achieve high precision, recall, and F1 on the A-CDM manual	It does not prove production readiness across airports
Table II provenance alignment	Main evidence for traceability	Extracted triples are mapped back to source sentences; fuzzy matches carry more false-positive risk	It does not eliminate the need for expert review
Swimlane diagram	Downstream operational artifact	KG triples can be transformed into stakeholder-attributed workflow visuals	It does not prove the diagram improves operational performance
Future video analytics figure	Exploratory extension	Shows a possible path toward real-time deviation monitoring	It is not evidence that the current system works on live multimodal data

This matters for business interpretation. The strongest demonstrated result is not “the system can manage airport operations.” The strongest demonstrated result is more precise: from one A-CDM manual, using a manually curated ground truth and Gemini-2.5-flash through LangExtract, the pipeline can extract procedural triples with high fidelity and preserve sentence-level provenance.

That is already useful. Inflating it would only make it less useful.

The swimlane diagram contribution should be read as operationalization. Once the graph contains hasNext and hasStakeholder relationships, the system can place procedures into lanes by responsible stakeholder and sequence them by dependency. The algorithm uses graph traversal logic — a modified topological sorting approach with breadth-first search — to position nodes and arrows.

For managers, the value is not that the diagram is pretty. The value is that the diagram is derived from the same traceable graph that came from the source text. A conventional process map often becomes a separate artifact that slowly diverges from the manual it was supposed to represent. Here, the process map is downstream of the knowledge layer.

That changes the review workflow. Instead of arguing over a manually drawn diagram, teams can inspect the triple, the source sentence, the stakeholder assignment, and the dependency. The diagram becomes an interface to audit the process, not just a slide for the weekly meeting. A noble upgrade, considering what weekly meetings have done to civilization.

Provenance is where the system stops being a chatbot

The provenance result deserves its own attention because it is the difference between an extraction demo and an auditable workflow system.

LangExtract classifies alignment between extracted triples and source text into three categories: exact matches, fuzzy matches, and lesser partial matches. The paper reports that all extracted triples were mapped to their respective source sentences. False positives appeared more often in fuzzy alignments than exact alignments, which is exactly where a reviewer should expect risk to concentrate.

Alignment category	Short-context FP / TP	Long-context FP / TP	Practical reading
MATCH_EXACT	4 / 138	3 / 149	Verbatim grounding is comparatively safer
MATCH_FUZZY	14 / 273	12 / 276	Semantic mapping is useful but needs more review attention
MATCH_LESSER	0 / 22	0 / 17	No false positives were observed, but the category is small

This is a useful risk map. It says reviewers should not treat every extracted triple equally. A fuzzy source alignment is not automatically wrong, but it is a better candidate for inspection than an exact match. That opens the door to review prioritization: spend human time where the extraction system itself signals weaker textual anchoring.

The paper’s deterministic anchoring relies on methods such as text.find() and difflib.SequenceMatcher() alongside the probabilistic extraction process. That combination is not glamorous. It is also exactly the kind of unglamorous engineering that makes AI systems deployable.

A black-box answer asks the organization to trust the model. A provenance-preserving triple asks the organization to inspect the evidence. That is a healthier relationship. Less romantic, perhaps, but software procurement has suffered enough romance already.

The business value is cheaper operational alignment, not magical autonomy

The immediate business pathway is not “replace airport operations experts.” The paper does not show that, and no serious reader should want that.

The more realistic pathway is this:

Unstructured operational manuals
      ↓
Traceable procedural triples
      ↓
Airport operations knowledge graph
      ↓
Stakeholder-attributed process maps
      ↓
Training, compliance review, incident analysis, and simulation inputs
      ↓
Future integration with live operational monitoring

For airports, logistics networks, hospitals, manufacturing plants, utilities, and other operationally dense organizations, the same pattern is familiar. The organization already has the knowledge. It is just trapped in documents, local terminology, and stakeholder memory. The cost is not only search. The cost is reconciliation: getting teams to agree what the process actually says, who owns each step, and which dependency matters when conditions change.

This paper suggests a way to reduce that reconciliation cost. It does not remove expert judgment; it gives expert judgment better objects to inspect.

What the paper directly shows	What Cognaptus infers for business use	What remains uncertain
LangExtract can extract A-CDM procedural triples with high precision and recall in this setting	Similar ontology-scaffolded pipelines may accelerate SOP-to-KG conversion in other structured operations domains	Performance may change with longer manuals, messier documents, more stakeholders, or weaker ontologies
Long-context extraction outperforms page-level extraction on this manual	Chunking strategy should be chosen based on procedural dependency structure, not habit	The result may not generalize to documents with more dispersed dependencies or larger context demands
Every extracted triple is mapped to source sentences	Review workflows can prioritize low-confidence or fuzzy-grounded triples	Source grounding does not guarantee the extracted relationship is operationally complete
KG triples can generate stakeholder swimlane diagrams	Process maps can become auditable interfaces rather than manually maintained artifacts	The paper does not measure whether generated diagrams improve real-world training or coordination

The strongest business implication is therefore governance-by-design. Instead of bolting audit controls onto an LLM after it generates answers, the workflow produces auditability as part of extraction. That is the difference between an AI assistant and an operational knowledge system.

For management teams, this also changes the procurement question. Asking “Which model is best?” is too shallow. The better question is: “What structure constrains the model, how is each output grounded, and how will reviewers know where to look first?”

The model matters. The surrounding system matters more.

The boundary is one manual, one model setup, and offline evaluation

The limitations are not fatal, but they are material.

First, the evidence comes from one 16-page A-CDM manual. It is a meaningful test because the workflow is procedural, multi-stakeholder, and dependency-heavy. Still, one document is not a deployment benchmark. The pipeline may behave differently on larger airport documentation sets, local airport variants, multilingual manuals, contradictory SOPs, or documents with diagrams and tables that were not cleanly converted into text.

Second, the ground truth is manually curated. That is appropriate for evaluation, but it means the system still depends on expert knowledge at the schema and validation layers. The paper’s contribution is semi-automation, not full autonomy. This is not a weakness; it is honesty wearing sensible shoes.

Third, the extraction setup uses Gemini-2.5-flash through LangExtract with fixed default-style parameters and max_workers=1. The paper therefore evaluates a specific implementation path, not an abstract property of all LLMs. Another model, prompt design, ontology, or document conversion pipeline could change the results.

Fourth, the future-work vision — real-time operations verification using video analytics and transponder data against KG-defined procedures — is an exploratory extension. It is a plausible next step, but it is not established by the current experiment. Mapping low-level sensor detections to high-level KG entities is its own hard problem, involving entity resolution, data association, and temporal uncertainty. In other words: still very much engineering, not fairy dust.

These boundaries do not weaken the paper’s core message. They define where it is useful.

The paper is strongest as a prototype architecture for turning authoritative operational documents into auditable knowledge structures. It is weakest if read as proof that airports can now plug manuals into an LLM and obtain production-grade operational control systems. Anyone selling the latter has probably discovered PowerPoint again.

The better mental model: LLMs as constrained translators between text and systems

The paper’s most useful conceptual shift is to stop treating LLMs as answer machines.

In this architecture, the LLM is a translator between unstructured human documentation and structured operational systems. But it is a constrained translator. It receives a domain scaffold, produces triples, and must leave a trail back to the source. The output is not valuable because it sounds right. It is valuable because it can be inspected, queried, visualized, and corrected.

That has broader implications for enterprise AI.

Many organizations are still stuck between two unsatisfying options. Traditional knowledge engineering is precise but slow. Open-ended LLM systems are fast but slippery. The interesting middle path is not compromise for its own sake. It is division of labor: symbolic systems define what counts; LLMs scale extraction; deterministic anchoring preserves auditability; human experts review the risky edges.

That is not as cinematic as “autonomous AI agent manages the airport.” Good. Airports have enough drama.

The paper shows that, at least in this controlled A-CDM setting, a scaffolded LLM pipeline can recover procedural knowledge with high fidelity, benefit from long context when dependencies cross page boundaries, and produce process diagrams that retain stakeholder attribution and source provenance. The contribution is not merely better extraction metrics. It is a more credible shape for operational AI: not black-box intelligence, but traceable transformation.

For businesses, the takeaway is simple and inconvenient: if you want AI systems to support real operations, you cannot skip the structure. The ontology, the graph, the provenance layer, and the review workflow are not bureaucratic extras. They are the machinery that turns fluent output into usable knowledge.

The boarding gate is not impressed by vibes.

Cognaptus: Automate the Present, Incubate the Future.

Darryl Teo, Adharsha Sam, Chuan Shen Marcus Koh, Rakesh Nagi, and Nuno Antunes Ribeiro, “Semi-Automated Knowledge Engineering and Process Mapping for Total Airport Management,” arXiv:2603.26076v1, 27 March 2026, https://arxiv.org/html/2603.26076. ↩︎

The real problem is not reading documents; it is making extracted knowledge accountable#

The mechanism is symbolic scaffolding first, LLM extraction second#

Long context wins because the process is not written in the order it happens#

The experiment is main evidence; the diagrams are operationalization, not decoration#

Provenance is where the system stops being a chatbot#

The business value is cheaper operational alignment, not magical autonomy#

The boundary is one manual, one model setup, and offline evaluation#

The better mental model: LLMs as constrained translators between text and systems#