CLOZE Encounters: When LLMs Start Editing Medical Ontologies

Opening — Why this matters now

Medical ontologies age faster than clinical practice. New diseases appear, old terminology mutates, and clinicians keep writing whatever reflects reality today. The result: a widening semantic gap between structured ontologies and the messy, unstructured world of clinical notes.

In the era of LLMs, that gap is no longer just inconvenient—it’s a bottleneck. Every downstream application, from diagnosis prediction to epidemiological modeling, depends on ontologies that are both up‑to‑date and hierarchically consistent. And updating these ontologies manually is about as scalable as handwriting ICD‑12 on stone tablets.

Enter CLOZE: a zero-shot, privacy-preserving framework that lets LLMs read clinical notes, extract new disease concepts, and insert them into a hierarchical ontology—without training, labels, or risking PHI exposure. It’s a quietly radical proposition: let LLM agents extend medical knowledge directly.

Background — Context and prior art

Ontology extension has always required domain experts, labeled corpora, or rule-based heuristics—none of which scale. Rule‑based and statistical systems (regexes, co-occurrence analysis) are interpretable but brittle. Learning-based pipelines offer richer semantic modeling but choke without annotated datasets, which are rare and expensive to produce in clinical domains.

Clinical notes themselves contain the richest and most immediate expression of medical reality. They capture rare symptoms, nuanced disease variants, and social determinants that structured EHR fields consistently miss. But because they also contain sensitive PHI, they’re notoriously difficult to use for automated ontology work.

CLOZE threads this needle by combining:

LLM‑based PHI de-identification,
zero‑shot disease entity extraction,
SapBERT biomedical semantic embeddings,
and LLM-driven hierarchical reasoning.

The result is an automated system that reads clinical notes and outputs vetted additions to a disease ontology.

Analysis — What the paper introduces

CLOZE proposes a two-stage pipeline:

1. Medical Entity Extraction

Two LLM agents handle:

PHI Removal — Using LLaMA‑3‑70B, CLOZE removes identifiers across a standard PHI schema (names, dates, IDs, locations, etc.).
Disease Entity Extraction — A second LLM identifies disease-related mentions using a structured prompt. No labels, no training, no domain-specific fine-tuning.

The clever part: PHI removal produces structured JSON, ensuring traceability and compliance, while keeping everything on‑premise for privacy.

2. Hierarchical Ontology Extension

Once CLOZE has clean disease mentions, it inserts them into a hierarchical ontology (Disease Ontology) by combining:

SapBERT embeddings for semantic similarity,
LLM-based relation classification (equivalence / subset / unrelated),
Recursive layer-by-layer descent through the ontology.

This hybrid structure counters the Achilles’ heel of pure LLM approaches: hierarchical depth. By anchoring each step with domain-specific embeddings, CLOZE avoids the semantic drift and hallucinated hierarchies typical of pure prompt-driven methods.

Findings — Results with visualization

The paper evaluates CLOZE on three fronts: PHI de‑identification, disease extraction, and ontology insertion.

1. PHI De-identification Performance

Model	Precision	Recall	F1
PhysioNet (rule-based)	0.39	0.24	0.28
GPT‑3.5	0.43	0.60	0.48
GPT‑4	0.53	0.69	0.58
LLaMA‑3‑8B	0.46	0.59	0.50
LLaMA‑3‑70B (CLOZE)	0.60	0.68	0.62

A reminder that size still matters—at least for PHI detection.

2. Disease Entity Extraction

Across similarity thresholds (60–80), CLOZE achieves the highest F1 score, outperforming Stanza and BioEN. The small but consistent edge suggests that zero‑shot LLM prompts already rival fine‑tuned biomedical NER models.

3. Ontology Extension Accuracy

LLM-based evaluators and human experts assessed a large set of inserted relationships.

LLM Evaluation (Precision):

Method	Precision
LLM-Onetime	38.9%
LLM-Hierarchical	36.7%
SapBERT + LLM-Onetime	43.1%
SapBERT + LLM-Hierarchical (CLOZE)	79.6%

Human Evaluation (Average Scores, 0–2):

Model	Relevance	Accuracy	Importance	Overall
LLM-Onetime	0.09	0.67	0.63	0.06
SapBERT + LLM-Onetime	1.04	1.53	1.00	1.00
CLOZE Full Model	1.05	2.00	1.87	1.87

The human evaluation confirms what the LLM assessment hinted: grounding LLM reasoning in SapBERT’s domain-specific embeddings is what keeps ontology structure coherent.

Implications — Why this matters for the AI ecosystem

CLOZE is less about medical terminology—and more about agentic architecture design. It illustrates:

1. Multi-agent LLM systems can perform domain‑specific knowledge maintenance.

Instead of training new models, we orchestrate existing ones: PHI scrubbers, extractors, reasoning agents.

2. Zero-shot pipelines are now competitive with specialized NLP models.

Biomedical NER once required expensive annotation projects. CLOZE bypasses that entirely.

3. Ontology maintenance becomes automatable—a foundational shift for AI systems relying on structured knowledge.

In AI governance and enterprise knowledge management, ontology drift is a constant threat. CLOZE-style architectures could auto‑curate compliance taxonomies, risk ontologies, process hierarchies, and more.

4. Privacy-preserving LLM workflows are becoming viable.

On-premise LLaMA‑3 for processing PHI, with Azure‑hosted GPT used only for safe evaluative tasks, shows a maturing pattern: sensitive data stays local, intelligence is federated.

Conclusion — The pragmatic future of ontology extension

CLOZE hints at a future where ontologies evolve continuously alongside real‑world data, powered not by manual updates or massive labeling campaigns, but by agentic LLM systems that reason, compare, and integrate.

It is not flawless—its evaluation dataset is small, and recursive LLM reasoning still carries interpretability challenges—but the conceptual leap is unmistakable.

The machines aren’t just reading clinical notes. They’re starting to keep the medical ontology tidy.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper introduces#

1. Medical Entity Extraction#

2. Hierarchical Ontology Extension#

Findings — Results with visualization#

1. PHI De-identification Performance#

2. Disease Entity Extraction#

3. Ontology Extension Accuracy#

Implications — Why this matters for the AI ecosystem#

1. Multi-agent LLM systems can perform domain‑specific knowledge maintenance.#

2. Zero-shot pipelines are now competitive with specialized NLP models.#

3. Ontology maintenance becomes automatable—a foundational shift for AI systems relying on structured knowledge.#

4. Privacy-preserving LLM workflows are becoming viable.#

Conclusion — The pragmatic future of ontology extension#