From Hallucination to Verification: Why AI Needs a Pharmacist’s Mindset

Prescription checks are a good way to humble AI.

Not because the language is impossible. Drug labels, clinical notes, dosage instructions, contraindications, and interaction warnings are all text-heavy. LLMs are good at text. That part is not the problem.

The problem is that prescription verification is not a writing task. It is a safety task disguised as a reading task. A pharmacist is not merely asking, “Does this paragraph sound medically reasonable?” The real question is narrower and harsher: given this patient, this drug, this dose, this route, this timing, this interaction profile, and this missing or available clinical data, is there a specific safety issue that must be raised?

That is where ordinary LLM behavior becomes awkward. A plausible answer is not enough. A fluent answer is not enough. Even a medically sophisticated answer is not enough if it cannot show where the rule came from, whether the patient condition was actually known, and which part of the prescription triggered the warning.

The paper behind PharmGraph-Auditor, A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification, proposes a useful answer to this problem.¹ Its core message is not “use an LLM for prescription auditing.” That would be the lazy version, and medicine already has enough ways to create paperwork with confidence. The more interesting message is architectural: in high-stakes work, the LLM should stop acting as an oracle and start acting as a coordinator around a governed knowledge system.

The paper’s design is built around three connected contributions.

First, it introduces PharmGraph-Auditor, a hybrid prescription-auditing system based on a Hybrid Pharmaceutical Knowledge Base, or HPKB. The knowledge base deliberately separates numerical constraint checks into a relational store and semantic relationship reasoning into a graph store.

Second, it proposes a construction workflow: Iterative Schema Refinement, section-aware multi-agent extraction, and provenance tracking. The point is not just to extract more facts from documents. It is to extract facts into a structure where every stored claim can be traced back to source text.

Third, it introduces a KB-grounded Chain of Verification, where the LLM decomposes the audit task, retrieves evidence through structured queries, and synthesizes a report grounded in the knowledge base. The model is still useful, but it is no longer free to freestyle. A small mercy.

The usual AI options fail in different ways

The easiest way to understand the paper is not to start with PharmGraph-Auditor itself. Start with the alternatives.

A hospital already has human pharmacists. It may also have a traditional Clinical Decision Support System, or CDSS. A modern AI team might propose a zero-shot LLM, a full-document RAG system, or a GraphRAG-style knowledge graph. Each option has a genuine strength. Each also fails in a way that matters for prescription auditing.

Approach	What it does well	Where it breaks in prescription verification
Human experience review	Very high precision; experienced pharmacists avoid frivolous warnings	Misses risks when memory, workload, or complex evidence chains become limiting
Rule-based CDSS	Deterministic checks for known encoded rules	Rigid, noisy, manually maintained, and weak with unstructured clinical context
Zero-shot LLM	Flexible language understanding	No reliable source grounding, no traceable verification, and unsafe factual uncertainty
Full-document RAG	Gives the model external text	Can flood the model with irrelevant context and still leaves reasoning loosely controlled
GraphRAG / pure graph approach	Good for relationships and multi-hop traversal	Not naturally optimized for exact numerical thresholds, ranges, and conditional filters
Hybrid VKG-based auditing	Routes different reasoning tasks to different stores	Requires careful schema design, domain governance, and validation before deployment

This comparison is the main reason the paper is worth reading beyond the healthcare setting. The contribution is not merely that the authors achieved better numbers than a rule engine. The contribution is that they show why different safety tasks need different computational surfaces.

A dosage check is not the same kind of operation as an allergy hierarchy check. One asks whether a patient falls into a numerical or categorical rule set. The other asks whether entities are connected through a clinical relationship chain. Treating both as “retrieval” is convenient. It is also how systems become impressive in demos and irritating in production.

Some medical knowledge behaves like a table; some behaves like a network

The HPKB is built under a Virtual Knowledge Graph paradigm. In the paper’s formulation, it has three major components:

Component	Role in the system	Example use
Relational component	Stores atomic, numerical, and conditional facts	Dosage thresholds, renal-function cutoffs, age restrictions
Graph component	Stores semantic, hierarchical, and transitive relationships	Drug interactions, ingredients, allergies, duplicate therapies
Mapping layer	Links both stores so they behave as one knowledge system	Connects drug entities, rule records, and evidence provenance

This split matters because prescription auditing contains two different reasoning styles.

For set constraint satisfaction, relational databases are the natural tool. A dosage rule often looks like a stack of filters: age above a threshold, creatinine clearance below a threshold, hepatic impairment status, indication, route, and dose. SQL-style filtering is boring, strict, and efficient. In safety systems, boring is often a compliment.

For topological traversal, graphs are the natural tool. Allergy checking, drug-drug interactions, active ingredient relationships, and therapeutic duplication often depend on traversing relationships across entities. A graph database is designed for this kind of path-following.

The misconception the paper quietly attacks is that GraphRAG, or any full-document RAG system, can simply absorb the entire problem. It cannot. A vector search system can retrieve semantically related text, but semantic similarity is not the same as evaluating a dosage threshold. A graph can traverse relationships, but forcing every numeric range into graph nodes is not elegant. A relational store can filter numbers efficiently, but recursive semantic relationships become clumsy.

The paper’s answer is not romantic. It is a division of labor.

That is the pharmacist’s mindset in architectural form: use the right check for the right risk, and do not let a fluent model substitute for verification.

The knowledge base is built, not wished into existence

Many AI systems fail at the least glamorous step: the knowledge base.

The paper spends real attention on this. PharmGraph-Auditor does not assume a perfect schema drops from the ceiling. Instead, it proposes Iterative Schema Refinement, a semi-automated loop where LLMs detect schema gaps and human experts decide whether those gaps should become generalized structures.

The distinction matters. An unchecked extraction system may create fragmented schemas: separate fields for renal adjustment, hepatic adjustment, geriatric adjustment, infusion timing, and so on. That feels detailed, but it quickly becomes brittle. The expert’s role is to abstract these into more stable categories, such as a generalized constraint structure.

The workflow is roughly:

Step	LLM role	Human expert role	Operational value
Initial extraction	Detect facts that do not fit the current schema	Review whether the gap is clinically meaningful	Reduces omitted knowledge
Schema proposal	Suggest new fields, tables, or relations	Merge narrow proposals into general structures	Prevents schema fragmentation
Stratification	Identify whether a fact is constraint-like or topology-like	Assign it to relational or graph storage	Preserves the hybrid architecture
Stabilization	Continue scanning documents until valid new schema gaps decline	Decide when the schema is stable enough	Controls endless schema expansion

After the schema is stabilized, the system uses section-aware multi-agent extraction. This is not “multi-agent” in the theatrical sense where five chatbots have a meeting and somehow call it governance. The idea is more practical: drug documents have sections, and sections imply extraction tasks. Dosage sections should be handled by agents configured for dosage tuples. Interaction sections should be handled by agents configured for relationships.

This improves attention and reduces extraction noise. More importantly, every extracted fact is required to carry provenance: the raw source text from which it came. In a clinical setting, this is not a nice-to-have. It is the difference between “the model says so” and “the system can show the evidence.”

Chain of Verification turns the LLM into a controlled reasoning engine

The paper’s auditing workflow is the KB-grounded Chain of Verification, or CoV.

The phrase sounds like yet another chain-of-something acronym, because apparently AI research papers now require at least one. But the mechanism is useful. CoV changes the LLM’s role from direct judge to structured coordinator.

The auditing process has four stages:

Stage	What happens	Safety purpose
Task decomposition	The LLM breaks the prescription audit into specific subtasks	Prevents vague “check this prescription” reasoning
Hybrid query generation	A deterministic engine generates SQL or Cypher depending on task type	Avoids hallucinated database fields and loose retrieval
Evidence retrieval and curation	The system retrieves and prunes evidence using patient-profile logic	Reduces irrelevant context and conflicting rules
Evidence-grounded synthesis	The LLM writes the audit report from curated evidence	Converts verified facts into usable clinical explanation

The third stage is especially important. The paper introduces a Patient Profile-driven Evidence Selection Tree, or P-EST. In plain terms, the system tries to retrieve the most specific applicable rule for the patient. If an exact match is not available, it falls back up the rule hierarchy rather than dumping all possible rules into the model context.

This is a subtle but important correction to ordinary RAG thinking. Retrieval is not merely about getting more relevant documents. In high-stakes work, retrieval must sometimes select one rule that applies and suppress twenty rules that only look related.

The final synthesis stage also includes information-gap handling. If the evidence requires renal-function data and the patient profile does not include it, the system should flag a missing-data issue instead of manufacturing a confident verdict.

That design choice deserves more attention than it usually gets. In many business workflows, the safest AI output is not an answer. It is a refusal to close the case because a required condition is missing.

The main evidence: better balance, not magic perfection

The experiments answer two questions.

First, can the system construct a high-quality HPKB from pharmaceutical documents?

Second, can it perform prescription auditing better than human experience review and traditional CDSS?

For the knowledge-base construction task, the authors use a gold-standard HPKB built from 100 pharmaceutical documents, annotated by a senior clinical pharmacist with more than 10 years of experience in medical informatics and ontology construction. The benchmark contains 2,951 relational records and 923 graph relations, for 3,874 extracted records in total.

The paper compares PharmGraph-Auditor against a zero-shot OpenIE baseline, treated as GraphRAG-style extraction, and a one-shot schema-guided baseline, adapted from AutoKG-style extraction.

Knowledge population method	Overall precision	Overall recall	Overall F1
PharmGraph-Auditor with GPT-4o	0.8260	0.8491	0.8374
PharmGraph-Auditor with Deepseek-V3	0.8235	0.8603	0.8415
PharmGraph-Auditor with Qwen3-32B	0.8750	0.8603	0.8676
Zero-shot OpenIE / GraphRAG-style	0.8365	0.4860	0.6148
One-shot schema-guided / AutoKG-style	0.8023	0.7709	0.7863

The pattern is more informative than the headline score. Zero-shot extraction has acceptable precision, but recall collapses. It misses too much. The schema-guided baseline improves recall, but still trails the section-aware multi-agent method. PharmGraph-Auditor performs better because it does not ask one large prompt to understand the whole document in one pass.

This is not merely a model-size story either. The framework is tested with GPT-4o, Deepseek-V3, and Qwen3-32B. The exact performance varies, but the architectural pattern remains strong.

For prescription auditing, the authors use 100 real inpatient medical records and prescriptions from a hospital. These produce 500 audit points across five categories: indications, dosage, contraindications, special populations, and interactions. The gold standard comes from a knowledge-assisted pharmacist review, which identifies 37 issues out of 500 audit points.

The comparative result is the heart of the paper:

Auditing method	Precision	Recall	F1-score	Interpretation
Human experience review	100.0%	45.9%	62.9%	Very few false alarms, but many missed risks
Rule-based CDSS	52.1%	67.6%	58.8%	Better coverage, but noisy alerts
PharmGraph-Auditor	74.3%	70.3%	72.2%	Better balance between missed risks and false alarms

This is the right way to read the result: the system is not “better than pharmacists” in some crude general sense. The human pharmacist’s precision is perfect in this experiment. The weakness is recall when relying only on experience. The system’s value is that it increases detection while keeping precision far above the rule-based CDSS.

That distinction matters for business interpretation. A safety assistant does not need to replace an expert to create value. It can create value by catching more issues without burying the expert in garbage alerts.

Alert fatigue is not a cosmetic problem. If a system produces too many false positives, users learn to ignore it. The paper’s comparison with CDSS is therefore more important than the comparison with a human pharmacist. A noisy system can be technically “safer” on recall while operationally dangerous because it trains people not to listen.

The risk-category analysis shows where the architecture helps—and where it still lacks hospital common sense

The paper’s fine-grained analysis by risk type adds an important clue.

The proposed method performs especially well in categories such as interactions, dosage, and special populations. The special-populations case is revealing because rule-based systems often struggle when patient constraints are embedded in unstructured notes or laboratory reports rather than clean database fields. A hybrid system with semantic reasoning and structured evidence retrieval has a better chance of identifying those constraints.

But the indication category exposes a boundary. The paper reports lower precision for indications, with false positives caused by limited clinical situational awareness. The example is 0.9% sodium chloride. In inpatient settings, saline may be used as a solvent or for line flushing, even if that procedural use is not formally represented as a therapeutic indication in package inserts. A strict evidence-based system can therefore flag a mismatch that a human clinician would understand as routine.

This is an excellent limitation because it is not generic. It tells us exactly what kind of knowledge is missing: local procedural knowledge.

Package inserts are not the whole hospital. Guidelines are not the whole workflow. A real deployment would need to encode not only formal pharmaceutical rules but also accepted institutional practices, workflow conventions, and contextual exceptions. Otherwise the system may become very good at reading documents and still somewhat naive about the ward.

The ablation is about mechanism, not a second clinical trial

The ablation study should be read carefully.

The main clinical evidence comes from the 100 real inpatient prescription sets. The ablation uses a synthetic benchmark built through a red-teaming process with over 1,000 generated prescription test cases. The purpose is not to replace the real-world evaluation. It is to isolate whether the external knowledge base and CoV framework actually matter.

The tested variants are:

Setting	Precision	Recall	F1	Cost
Proposed method	0.7924	0.9504	0.8642	$0.0225
Without CoV	0.5757	0.7645	0.6561	$0.0250
Without CoV and without knowledge	0.3927	0.5233	0.4487	$0.0055

The “without CoV” version resembles a full-text RAG setup: give the model the relevant drug document and let it reason over the long text. It performs better than pure zero-shot reasoning, but much worse than the proposed method. It is also more expensive than the full method in the reported experiment, because full-text context consumes more tokens.

This is the paper’s most transferable lesson for enterprise AI. Better grounding is not always achieved by stuffing more text into the prompt. Sometimes better grounding comes from retrieving less text, but with stricter structure.

A long context window is not a governance framework. It is just a larger room in which the model can misplace the furniture.

The case study shows why traceability changes the user experience

The paper includes a case involving a 59-year-old female patient with HR-positive, HER2-negative metastatic breast cancer and tuberculosis. The prescription includes abemaciclib, while the patient is also treated with rifampin.

The important detail is that rifampin is a strong CYP3A4 inducer, and that interaction matters for abemaciclib. The system’s value is not simply that it identifies the interaction. The value is that it traces the reasoning path:

decompose the prescription into verification tasks;
identify a drug-interaction check;
retrieve the relevant interaction evidence;
connect the evidence to the patient’s current medications;
generate an audit finding with source-backed explanation.

That workflow changes the nature of the alert. A traditional rule engine might say: warning. A loose LLM might say: possible concern. A verification-oriented system can say: this specific patient has this specific drug combination; this evidence supports the concern; this is the recommended interpretation; and here is what information is missing, if any.

For pharmacists, compliance officers, auditors, or financial controllers, that difference is not aesthetic. It determines whether the user can trust the system enough to act.

The business lesson is not “medical AI is coming”; it is “verification architecture travels”

The direct domain is prescription verification. The transferable pattern is broader.

Many enterprise workflows have the same structure:

Business domain	Constraint-like knowledge	Topology-like knowledge	Why hybrid verification helps
Compliance review	Thresholds, deadlines, required fields	Entity relationships, ownership chains, policy dependencies	Separates strict rule checks from relationship reasoning
Contract review	Dates, payment terms, liability caps	Clause dependencies, referenced obligations, party roles	Grounds legal interpretation in traceable evidence
Insurance claims	Coverage limits, eligibility criteria	Event relationships, provider networks, claimant history	Reduces unsupported claim decisions
Finance controls	Exposure limits, approval thresholds, transaction rules	Counterparty networks, account relationships, escalation paths	Supports auditability and exception handling
Operations safety	Sensor thresholds, checklist rules	Equipment dependencies, process flows, incident chains	Helps detect risks without relying on free-form model judgment

The architecture implies a practical division of labor:

Layer	What it should do	What it should not do
Relational store	Enforce numeric and categorical constraints	Pretend to understand every semantic relationship
Graph store	Traverse entity relationships and dependencies	Replace exact threshold checks
Mapping layer	Keep stores aligned and auditable	Become an undocumented glue layer
Retrieval and curation logic	Select the applicable evidence	Dump long documents into the model
LLM	Decompose, explain, synthesize, and surface gaps	Invent facts or silently resolve missing data

This is where the paper becomes relevant to Cognaptus-style business automation. In high-stakes AI, the question is not only “Can the model answer?” The better question is: \astwhat must be true before the model is allowed to answer?\ast

The answer usually includes controlled knowledge stores, deterministic checks, traceable evidence, and explicit missing-information states. Not as decoration. As the operating system.

What the paper directly shows, what we can infer, and what remains uncertain

The paper directly shows three things.

First, section-aware, schema-guided extraction can build a higher-quality pharmaceutical knowledge base than zero-shot OpenIE or one-shot schema-guided baselines on the authors’ 100-document benchmark.

Second, the proposed auditing method achieves a better precision-recall balance than both human experience review and traditional CDSS on the reported 100 real inpatient prescription sets.

Third, the ablation suggests that both external knowledge and the structured CoV workflow materially improve performance compared with zero-shot reasoning and full-text RAG-style reasoning.

What Cognaptus can reasonably infer is broader but not unlimited. The hybrid architecture is likely valuable in domains where rules are partly numerical, partly relational, and partly textual. That includes compliance, claims review, contract operations, finance controls, and regulated workflow auditing. The principle travels because the computational problem travels.

What remains uncertain is deployment robustness.

The real-world clinical test is meaningful but still limited in scale. The HPKB construction benchmark covers 100 pharmaceutical documents. The ablation uses synthetic cases, useful for mechanism testing but not equivalent to long-term hospital deployment. The system also needs richer clinical procedural knowledge to avoid false positives in cases where formal documentation does not capture accepted routine use.

There is another practical boundary: governance cost. Building a hybrid knowledge base with expert-reviewed schema refinement is not free. It requires domain experts, extraction validation, provenance management, local policy adaptation, and maintenance as guidelines change. That cost may be justified in safety-critical workflows. It may be overkill for low-risk summarization tasks.

In other words, this is not a universal recipe for every chatbot. It is a recipe for cases where being wrong is expensive, embarrassing, regulated, or all three. A surprisingly large share of enterprise AI lives there, whether vendors admit it or not.

The pharmacist’s mindset is verification before fluency

The paper’s most useful idea is not the acronym HPKB, the VKG framing, or even the Chain of Verification label. The useful idea is the role change.

The LLM is not treated as the authority. It is treated as an interface over structured authority.

That sounds modest. It is actually a major correction to how many AI systems are built. Too many products still begin with the model and then add retrieval, logging, and guardrails after the fact. PharmGraph-Auditor points in the opposite direction: begin with the structure of the domain, decide which knowledge belongs in which store, preserve provenance, retrieve only applicable evidence, and then let the model explain.

That is why the pharmacist metaphor works. A good pharmacist does not merely know drug names. A good pharmacist checks, cross-checks, notices missing information, distinguishes formal contraindications from practical routines, and refuses to guess when the evidence is insufficient.

AI systems in serious business workflows need the same temperament.

Less improvisation. More verification.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Yichi Zhu, Kan Ling, Xu Liu, Hengrun Zhang, Huiqun Yu, and Guisheng Fan, “A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification,” arXiv:2603.10891, 2026. https://arxiv.org/abs/2603.10891 ↩︎

The usual AI options fail in different ways#

Some medical knowledge behaves like a table; some behaves like a network#

The knowledge base is built, not wished into existence#

Chain of Verification turns the LLM into a controlled reasoning engine#

The main evidence: better balance, not magic perfection#

The risk-category analysis shows where the architecture helps—and where it still lacks hospital common sense#

The ablation is about mechanism, not a second clinical trial#

The case study shows why traceability changes the user experience#

The business lesson is not “medical AI is coming”; it is “verification architecture travels”#

What the paper directly shows, what we can infer, and what remains uncertain#

The pharmacist’s mindset is verification before fluency#