Pills, Protocols, and Parameters: When LLMs Sit the Pharmacist Exam

Exam rooms are wonderfully unsentimental. They do not care whether a model has a charming interface, a dramatic launch story, or a fan base that treats benchmark tables like sports scores. They ask a question, demand an answer, and mark it right or wrong.

That makes professional licensing exams tempting AI benchmarks. A pharmacist licensure exam, in particular, looks like a clean test of whether a large language model can handle the kind of knowledge society actually cares about: drugs, laws, prescriptions, clinical judgment, and the delicate art of not confidently recommending something dangerous. Minor detail.

A recent paper, Assessing LLMs’ Performance: Insights from the Chinese Pharmacist Exam, compares DeepSeek-R1 and ChatGPT-4o on 2,306 text-only questions from China’s pharmacist licensure examination between 2017 and 2021.¹ The headline result is simple: DeepSeek-R1 achieves 90.0% overall accuracy, while ChatGPT-4o reaches 76.1%. The difference is statistically significant, with the authors reporting $\chi^2(1, N = 2306) = 159.66$, $p < 2.2e^{-16}$.

That headline is real. It is also the least interesting way to read the paper.

The better reading is category-based. The exam is not one task. It is four different task families wearing the same institutional uniform: pharmaceutical foundations, pharmaceutical legislation and ethics, dispensing and prescription review, and clinical integration. Each category exposes a different relationship between language, professional knowledge, regulation, and judgment.

For business readers, that distinction matters more than the model ranking. The practical question is not “Which model won?” It is: Which parts of professional education can AI support safely, which parts need structured human review, and which parts should not be converted into automated answer machines just because a score looks impressive?

The study tests four kinds of competence, not one pharmacist-shaped blob

The paper’s dataset contains 2,306 Chinese-language questions: 2,114 single-choice items and 192 multiple-choice items. The authors excluded questions involving images or tables, then fed each remaining question to both models in its original Chinese format. The models received a consistent system-level prompt asking them to answer official licensing exam questions as medical experts, using pharmacological principles, regulatory knowledge, and clinical reasoning where appropriate.

Answers were evaluated by exact match against official keys. For multiple-choice questions, partial correctness did not count. That is a strict but sensible rule: in medication-related settings, “mostly right” can become a rather expensive philosophical position.

The four exam units matter because they correspond to different operational uses of AI:

Exam unit	What it mainly tests	What the paper shows	Practical interpretation
Unit 1: Pharmaceutical Foundations	Pharmacology, chemistry, pharmaceutics, factual recall	DeepSeek-R1 strongly outperforms ChatGPT-4o, but Unit 1 is also the weakest category for both models in intra-model analysis	Useful for review, quiz generation, and knowledge checks, but still vulnerable to precise factual gaps
Unit 2: Pharmaceutical Legislation and Ethics	Legal language, professional ethics, region-specific rules	DeepSeek-R1 has large and mostly significant advantages in table-level comparisons	Good candidate for guided learning support, not final legal interpretation
Unit 3: Dispensing and Prescription Review	Case-based practical judgment, contraindications, counseling, prescription errors	Performance gaps are often not statistically significant year by year	Suggests rule-pattern judgment is competitive, but business use needs audit trails
Unit 4: Clinical Integration	Comprehensive case synthesis and guideline discernment	DeepSeek-R1 performs especially strongly; ChatGPT-4o is weaker than in Unit 3	Strong potential for structured case training, but risky as a clinical authority

This is why the paper is useful. It does not merely say “DeepSeek-R1 scored higher.” It lets us separate AI use cases by cognitive category.

That separation prevents two common mistakes. The first is assuming a high overall score means the model can safely grade, tutor, or advise across all pharmacist tasks. The second is assuming “judgment” is one mysterious human-only thing. The paper suggests something subtler: some judgment-like tasks are actually structured pattern applications, while others require current guideline prioritization, regulatory context, and professional accountability.

Unit 1: factual recall is easy until the fact is the whole game

Unit 1 covers pharmaceutical foundations: pharmacology, chemistry, pharmaceutics, and other core knowledge. On the surface, this should be the easy category for LLMs. The answer is often a concrete fact or relationship. No one needs a grand theory of professional ethics to rank antitussive potency.

And yet Unit 1 is where both models reveal a useful weakness.

DeepSeek-R1 significantly outperforms ChatGPT-4o across all five years in Unit 1. The gap is large: for example, in 2017, DeepSeek-R1 answers 106 of 120 correctly, while ChatGPT-4o answers 60 of 120 correctly. In 2018, the comparison is 108 of 119 versus 68 of 119. The model ranking is not subtle.

But the paper’s intra-model analysis adds a twist. Unit 1 is DeepSeek-R1’s weakest aggregate category at 85.1%, significantly lower than its performance in Unit 2 and Unit 4. For ChatGPT-4o, Unit 1 is also the weakest category, at 59.8%, significantly lower than all other units.

That means foundational knowledge is not automatically “solved” by a general language model, and not fully solved even by the better-performing model in this study. The category looks simple because the exam item is simple. But if the question depends on a precise pharmacological relationship, there is not much room to recover by sounding reasonable.

The paper’s first qualitative case illustrates this. The question asks about the comparative potency of antitussives. The correct order is Benproperine > Codeine > Pentoxyverine. ChatGPT-4o selects the wrong option; DeepSeek-R1 selects the correct one and provides a more accurate pharmacological explanation.

This is the first business lesson: AI can be useful for factual review, but factual review is not low-risk just because the interaction is educational.

In a pharmacy education platform, a model like this could help generate practice questions, explain answer choices, and identify weak knowledge areas. That is a plausible product use. But if the system is wrong on a foundational drug fact, the error is not merely cosmetic. It can train the learner into a false memory with a fluent explanation attached. The danger is not that the model says “I do not know.” The danger is that it knows incorrectly, with excellent grammar.

Unit 2: legal and ethical questions need context, not just answer keys

Unit 2 covers pharmaceutical legislation and ethics. In the paper’s table-level results, DeepSeek-R1 consistently outperforms ChatGPT-4o here as well. The differences are statistically significant across all five years. For example, DeepSeek-R1 scores 113/120 in 2017 and 107/120 in 2021, while ChatGPT-4o scores 94/120 and 82/120 in those same years.

The temptation is to treat this as another win for a Chinese-language or locally aligned model. That is partly fair, but it is not enough.

Legal and ethical exam questions are not just semantic puzzles. They sit inside a live institutional environment. Regulations change. Local interpretation matters. Enforcement norms may vary. A model can learn legal language patterns, but that does not make it a legal authority.

For business deployment, Unit 2 suggests a different product design from Unit 1. In foundations, the model can help students memorize and test factual material. In legislation and ethics, the model should behave more like a guided explainer:

Bad deployment pattern	Better deployment pattern
“Here is the correct legal answer.”	“Here is the likely answer under the exam rule, with the relevant principle and possible ambiguity.”
“This regulation means X.”	“This rule is commonly interpreted as X in this context; confirm against current local policy.”
“Automated legal grading.”	“Instructor-reviewed formative feedback with source links and uncertainty flags.”

This difference matters because education platforms are often rewarded for speed: instant feedback, instant scoring, instant remediation. But speed can flatten legal nuance. A pharmacist does not only need to recognize the correct option on an exam. They need to understand how a rule applies when local practice, institutional policy, and patient risk collide.

So yes, the result supports AI-assisted learning for regulatory content. It does not support replacing instructors, compliance officers, or legal reviewers. The model can map the room. It should not pretend to own the building.

Unit 3: “judgment” is sometimes structured pattern recognition

Unit 3 is the most interesting category because it refuses to behave like a simple “AI cannot do judgment” story.

This unit covers dispensing and prescription review: identifying drug interactions, spotting dosing issues, recognizing contraindications, and handling patient counseling scenarios. Many readers would expect this to be where a general-purpose model struggles most.

The paper finds something more nuanced. DeepSeek-R1 still performs strongly, but the year-by-year differences from ChatGPT-4o are often not statistically significant. In four out of five years—2017, 2018, 2020, and 2021—the Unit 3 difference does not reach significance. Only 2019 shows a significant gap.

The authors’ interpretation is sensible: the “judgment” tested in this unit may be less about open-ended moral reasoning and more about applying structured professional patterns. If a prescription scenario contains a known contraindication, a dosing issue, or a counseling cue, the model’s job is to map the case to a learned rule pattern.

That is exactly the kind of work LLMs can sometimes do surprisingly well. Not because they possess professional responsibility, but because many professional decisions contain recurring linguistic and procedural signatures.

For business use, this category points toward workflow assistance rather than autonomous decision-making. A pharmacy training system could ask a model to:

generate prescription-review cases;
highlight possible interaction risks;
compare a learner’s reasoning with a reference explanation;
create “what changed?” variants of a case;
produce reflective prompts after an incorrect answer.

But the system should not silently turn the model into a dispensing oracle. The difference is operational, not decorative. A “suggestion engine” supports the pharmacist-in-training. An “answer engine” invites overreliance.

The paper itself uses similar language near the discussion: LLMs are better positioned as suggesters or reflectors, not infallible judges. That is not academic politeness. It is the deployment boundary.

Unit 4: clinical integration is where model alignment looks valuable—and dangerous

Unit 4 focuses on clinical integration: case-based synthesis, treatment planning, and guideline-sensitive reasoning. This is where DeepSeek-R1 looks especially strong. The paper reports DeepSeek-R1’s aggregate Unit 4 accuracy at 93.7%, its highest category, significantly better than its Unit 1 performance. ChatGPT-4o reaches 74.2% in Unit 4, lower than its Unit 3 performance and statistically indistinguishable from Unit 2.

The second qualitative case shows why the category matters. The question concerns hormone replacement therapy and asks for the incorrect statement. The correct answer is the claim that HRT reduces cardiovascular risk. ChatGPT-4o selects the wrong option. DeepSeek-R1 selects the correct answer and explicitly refers to the Women’s Health Initiative trial.

This is not merely a “medical fact” case. It is a guideline-prioritization case. HRT has a history of shifting evidence and conflicting claims. A model trained or aligned around outdated associations may produce a plausible but dangerous answer. The issue is not whether it can write a coherent explanation. The issue is whether it can privilege the right evidence at the right time.

For AI product teams, Unit 4 is the most seductive category. High performance here suggests obvious applications: clinical case simulation, adaptive tutoring, reasoning feedback, guideline-based question generation, and continuing education support for pharmacists outside major urban centers.

It is also where the risk is easiest to underestimate.

A model that performs well on a text-only licensing question is not necessarily ready for real clinical cases involving lab values, images, tables, missing data, patient preferences, institutional constraints, or rapidly updated guidelines. The paper excluded image and table questions. It also did not test live retrieval against current clinical sources. So the valid business inference is narrower:

DeepSeek-R1-style systems may support structured clinical education in Chinese pharmacist training, especially for case explanation and formative assessment. They do not become clinical decision systems merely by performing well on exam questions.

That boundary is not a footnote. It is the product specification.

Multiple-choice questions weaken the leaderboard story

The most important brake on the headline result comes from the multiple-choice subset.

The study includes 192 multiple-choice questions, each requiring the full correct set of answers. These are harder because the model must identify every correct option and avoid every incorrect one. Partial correctness gets no credit. In practical terms, this is closer to many real professional tasks: one missed contraindication or one extra false assumption can change the outcome.

The paper reports that none of the unit-year multiple-choice comparisons between DeepSeek-R1 and ChatGPT-4o reach statistical significance. The authors attribute this partly to small sample sizes, with no more than 10 MCQs per unit-year cell.

This makes the MCQ analysis a robustness warning, not a second thesis. The main evidence still supports DeepSeek-R1’s overall advantage. But the MCQ subset says: be careful when the task requires complete set reasoning, multi-step synthesis, and exact exclusion of distractors.

There is also a small reporting wrinkle worth noting. The paper’s prose reports MCQ accuracy as 58.9% for DeepSeek-R1 and 53.6% for ChatGPT-4o. Summing the printed Table 2 gives ChatGPT-4o 103/192, matching about 53.6%, but gives DeepSeek-R1 118/192, about 61.5%. That discrepancy does not change the interpretation: DeepSeek-R1 appears directionally better, but the MCQ comparisons are not statistically resolved.

For business use, this is the part to remember. Single-answer educational tasks may make a model look stable. Multi-answer tasks reveal whether it can maintain constraint discipline. Many real workflows are multi-answer workflows disguised as short questions:

Which drugs are contraindicated?
Which counseling points must be mentioned?
Which regulatory conditions apply?
Which patient factors change the recommendation?
Which evidence is outdated and should be discounted?

A model that performs well when selecting one answer may still struggle when the correct output is a complete, bounded set. That is where evaluation should move next.

What the paper directly shows, and what business readers may infer

The paper directly shows that, on this dataset of 2,306 text-only Chinese pharmacist licensure questions, DeepSeek-R1 substantially outperforms ChatGPT-4o in overall accuracy. It also shows that the advantage is not uniform across task types. It is strongest and most consistent in several unit-level comparisons, but weaker or statistically unresolved in Unit 3 year-wise comparisons and in the MCQ subset.

Cognaptus would infer three business implications.

First, localized and domain-aligned models deserve serious consideration in professional education. The result challenges the lazy assumption that the globally famous general-purpose model is always the safest default. In Chinese-language pharmacy education, linguistic fit, exam style, regulatory context, and domain terminology matter.

Second, AI education products should be designed by task category. Factual recall, regulation, prescription review, and clinical integration should not share the same risk model. They require different interface designs, review requirements, and feedback styles.

Third, formative assessment is the near-term opportunity. The paper’s strongest practical pathway is not automated licensing or autonomous clinical recommendation. It is personalized quizzes, staged feedback, peer benchmarking, scenario variation, and instructor-supported learning. That may sound less glamorous than “AI pharmacist.” It is also far more deployable, which is a useful property in a product.

A simple deployment map would look like this:

Use case	Fit suggested by the paper	Required guardrail
Factual knowledge review	High	Verified answer bank and error monitoring
Quiz generation	High	Instructor sampling and source checking
Regulatory explanation	Medium	Current policy references and local review
Prescription-review training	Medium to high	Human-reviewed reasoning trails
Clinical case simulation	Medium to high	Guideline updates, expert review, and uncertainty labels
Automated licensure grading	Low	Not supported by the study
Clinical decision substitution	Very low	Not supported and ethically unsafe

The boring phrase here is “human oversight.” The useful version is more specific: human oversight should sit where the task category creates legal ambiguity, patient risk, or evidence-update risk. It should not be sprinkled everywhere as a ritual disclaimer. Nobody needs a committee to check every flashcard. A clinical-guideline case about cardiovascular risk is different.

The boundaries are not cosmetic

The authors list several limitations, and they materially affect interpretation.

The study lacks a human performance benchmark. So the paper does not show whether either model “passes” relative to real pharmacist candidates. It compares two models against official answer keys. That is still valuable, but it is not the same as measuring professional readiness.

The models were tested through publicly available web interfaces, using default settings. That makes the experiment realistic for typical users but less deterministic as a benchmark. Non-zero temperature and platform-side changes may introduce variation. The authors argue that the 14-point overall gap is too large to be explained by randomness alone, which is reasonable. Still, the result is a snapshot, not a frozen property of either model.

The dataset is Chinese, text-only, and exam-based. Questions with tables or images were excluded. Real pharmacy practice often includes structured records, lab results, labels, screenshots, charts, patient histories, and institutional constraints. A model that handles text-only exam questions may not handle multimodal professional work.

The study also compares model versions available in early 2025. DeepSeek-R1 had recently been released; ChatGPT-4o was not a Chinese-adapted chain-of-thought-enhanced variant in the authors’ framing. This is a fair real-world comparison of available tools at the time, but not a permanent verdict on model families.

Finally, the qualitative error analysis uses two representative cases. These are useful illustrations, not a systematic taxonomy of failure modes. A business team building a product from this evidence would still need its own error audit: by topic, risk type, guideline currency, wording sensitivity, and user behavior.

The real lesson: benchmark by professional function

The easiest article to write about this paper would be: “DeepSeek beats ChatGPT on Chinese pharmacist exam.” Accurate, clickable, and not very useful. A leaderboard headline tells us who scored higher. It does not tell us what to build.

The stronger lesson is that professional exams contain multiple AI markets inside one dataset. Foundational knowledge suggests review tools. Regulation suggests guided explanation. Dispensing and prescription review suggest reasoning support. Clinical integration suggests powerful case simulation—but also the highest need for evidence control.

For companies building AI education products, the implication is straightforward: do not sell a generic “AI tutor” into a high-stakes domain and hope the brand name carries the risk. Segment the curriculum. Assign risk tiers. Design the interface around whether the model is recalling, interpreting, suggesting, or synthesizing. Then evaluate each category separately.

The pharmacist exam is a useful benchmark precisely because it makes this segmentation visible. Pills, protocols, and parameters are not the same task. Treating them as one task is how a model score becomes a product mistake.

Cognaptus: Automate the Present, Incubate the Future.

Xinran Wang, Boran Zhu, Shujuan Zhou, Ziwen Long, Dehua Zhou, and Shu Zhang, “Assessing LLMs’ Performance: Insights from the Chinese Pharmacist Exam,” arXiv:2511.20526, https://arxiv.org/pdf/2511.20526. ↩︎

The study tests four kinds of competence, not one pharmacist-shaped blob#

Unit 1: factual recall is easy until the fact is the whole game#

Unit 2: legal and ethical questions need context, not just answer keys#

Unit 3: “judgment” is sometimes structured pattern recognition#

Unit 4: clinical integration is where model alignment looks valuable—and dangerous#

Multiple-choice questions weaken the leaderboard story#

What the paper directly shows, and what business readers may infer#

The boundaries are not cosmetic#

The real lesson: benchmark by professional function#