Training looks simple from far away. Put people in a room, give them scenarios, let an experienced instructor correct them, repeat until competence appears.

This is charming. It is also how organizations quietly discover that “human expertise” does not scale just because someone bought a learning management system.

The new PACE paper, PACE: A Personalized Adaptive Curriculum Engine for 9-1-1 Call-taker Training, studies a very specific version of this problem: training emergency call-takers, the people who answer 9-1-1 calls before police, fire, or medical responders enter the scene.1 The paper’s setting is unusually useful because the stakes are high, the skill structure is complex, and the training bottleneck is not vague. A call-taker must master more than a thousand interdependent procedural skills across 63 incident types. A missed question or wrong instruction can cascade across an entire protocol.

The tempting interpretation is to say: “Here is another LLM tutor for emergency dispatch.” That would be the easy article. It would also miss the point.

PACE is not mainly a role-play chatbot. It is not mainly an automated debriefing assistant. Its more interesting contribution is a curriculum-control layer: a system that estimates what a trainee probably knows, estimates what they may forget, and chooses what they should practice next.

That distinction matters. The business value is not “AI can talk to trainees.” The business value is cheaper diagnosis, better sequencing, and less instructor time wasted on reconstructing a trainee’s hidden skill state from fragments of past performance. In other words: PACE is less a tutor and more a dispatch system for training itself.

The bottleneck is not content; it is curriculum choice under uncertainty

Emergency-call training is not just a content-delivery problem. The trainee does not merely need to “know medical calls” or “practice traffic accidents.” They need to know which question becomes mandatory after a condition is established, which instruction follows from which caller answer, and which downstream branch collapses if an earlier assessment is wrong.

The paper’s motivating study makes this concrete. The authors analyze a local call-taking training manual and 923 training session logs containing transcripts, evaluation rubrics, and trainer feedback. They report an average turnaround time of 11.58 minutes for a simulation-debriefing cycle. With a typical structure of one trainer for about 12 trainees, at least 3 sessions per trainee, and at least 12 calls per session, full review coverage would require more than 83 hours of review per day.

That number is useful because it removes the decorative language around “personalization.” Personalization is not hard because organizations lack empathy. It is hard because someone has to inspect many calls, map each error to a protocol dependency, decide whether the error reflects a missing concept or a one-time slip, and then choose the next simulation. Repeat that across a cohort and suddenly the “human-centered training model” starts looking like a spreadsheet with a panic disorder.

The second observation is more important: localized skill gaps cascade. The authors use 857 discrete assessment checkpoints from the procedural manual, generate 500 synthetic call instances, and simulate removing individual skills. Removing a single skill orphaned, on average, 11.17% of downstream checkpoints, or 96 out of 857, and caused 48.80% of complete call evaluations to fail.

This is why coarse training labels are dangerous. A trainee may appear broadly competent in “medical calls” while still failing a foundational consciousness-assessment step that affects cardiac arrest, choking, drowning, overdose, and trauma scenarios. The category label hides the actual operational risk.

PACE begins from that observation. If the skill space is structured, then training should not be selected from a flat menu of scenarios. It should be selected from a graph of dependencies.

PACE models training as a graph, not a playlist

The paper formalizes call-taking knowledge as a directed graph with 1,053 nodes and 1,283 edges across 63 incident types. The nodes come in three types:

Node type What it represents Operational meaning
Condition nodes Incident-state premises Facts established during the call
Question nodes Information-gathering skills What the call-taker must ask
Instruction nodes Directive-delivery skills What the call-taker must tell the caller

The edges encode procedural order, prerequisite relationships, implication links, and entailment links. This matters because skill mastery is not independent. If a trainee demonstrates a related skill in one scenario, the system may have partial evidence about another skill that has not yet been directly observed.

This is the first mechanism in PACE: belief tracking over a skill graph.

For each skill node, PACE keeps a Beta posterior representing uncertainty over mastery. A correct, incorrect, or partial observation updates the posterior. The update is weighted by the type of evidence. Prompted success counts less than independent success. Misconceptions are penalized more heavily than slips. This is a small but important design choice: not every error carries the same pedagogical meaning.

PACE then propagates evidence to similar nodes. Similarity is computed by combining semantic embeddings of node descriptions with positional compatibility in the protocol graph. In plain language, performance on one skill can inform beliefs about nearby or semantically related skills, but PACE does not treat every textually similar item as equally transferable. A question asked early in a call and an instruction issued near the end may be semantically related but pedagogically different.

This is where the system becomes more interesting than a transcript summarizer. A transcript summarizer says, “The trainee missed X.” PACE asks, “If the trainee missed X, what else should we now suspect, and what should we practice next?”

That is curriculum intelligence.

The system has to remember that humans forget

The second mechanism is temporal. PACE estimates two trainee-specific parameters:

Parameter Meaning Why it matters
$\lambda$ Learning pace How quickly a trainee gains mastery from practice
$\psi$ Forgetting rate How quickly mastery decays over time

The forgetting model uses a power-law decay form:

$$ \theta_v^{(\tau+\Delta\tau_v)} = \theta_v^{(\tau)} \cdot (1 + \kappa \cdot \Delta\tau_v)^{-\psi} $$

The practical point is simple. A skill that was “mastered” two weeks ago may no longer be equally safe to assume, especially for a quick forgetter. If the curriculum engine only tracks whether a skill was once demonstrated, it will schedule as if memory were a permanent database. Humans, inconveniently, are not databases.

This is the weakness of many LLM tutoring systems. They can sustain a conversation. They may even produce a nice learner profile. But long-term training requires more than a profile. It requires a model of time. A 6-to-8-week 9-1-1 training program with more than 200 simulated calls per trainee cannot be handled as one long chat history, unless the goal is to perform interpretive dance inside a context window.

PACE separates the durable learner state from the dialogue. It stores beliefs, uncertainty, learning pace, forgetting rate, weak-skill averages, and forgetting-risk indicators. The LLM component extracts structured observations, but the curriculum decision is made by a persistent model.

That separation is one of the paper’s real business lessons. In enterprise AI, the expensive mistake is to ask the language model to be the entire system. PACE instead uses language modeling as one part of a larger control loop.

Scenario selection becomes a bandit problem

Once the system has beliefs about skills and estimates of learning/forgetting dynamics, it still has to choose what to do next.

PACE frames scenario selection as a contextual bandit problem. At each session, it constructs a context vector containing features such as belief uncertainty, current coverage, estimated learning pace, estimated forgetting rate, average mastery of weak skills, number of skills near the forgetting threshold, and training progress. It then selects a batch of five scenarios from 297 operationally valid candidates.

The bandit uses Thompson Sampling, a Bayesian method for balancing exploitation and exploration. In this setting:

  • Exploitation means targeting suspected weaknesses.
  • Exploration means selecting scenarios that improve diagnostic coverage over uncertain regions of the skill graph.

This is not exploration for novelty’s sake. It is exploration because bad belief states lead to bad curriculum choices. If PACE does not know whether a trainee can handle a downstream protocol branch, it may need to test that region before making confident recommendations.

The paper’s behavioral analysis is designed to test exactly this. It compares PACE’s estimated mastery over 1,053 skill nodes with the simulated trainee agent’s ground-truth mastery. The approximation gap decreases over sessions across trainee archetypes. Fast learners stabilize around session 20, moderate learners around session 30, and struggling learners around session 40. The explore-exploit pattern also shifts as beliefs consolidate: when beliefs are unreliable, the system focuses on suspected weaknesses; when belief confidence improves, it expands coverage.

This is not just a nice algorithmic flourish. It means PACE is trying to solve two problems at once: teach the trainee and learn the trainee.

What the experiments actually show

The evaluation has three main layers, and they should not be mixed together.

Evidence layer Likely purpose What it supports What it does not prove
Simulated trainee experiments Main controlled evidence PACE improves learning metrics under known learner parameters Real human trainees would improve by the same magnitude
Ablations Mechanism validation Graph propagation and dynamics estimation both contribute Every implementation detail is optimal
Expert co-pilot study Operational alignment PACE recommendations often match training officers and reduce turnaround time PACE independently replaces human trainers

The main system-level comparison uses role-played trainee agents rather than real trainees. The authors justify this because real training spans weeks, ground-truth competence is latent, and counterfactual curriculum comparisons are hard: once a trainee receives instruction, you cannot rewind them into a clean alternative condition. The simulated agents have controllable ground-truth competency states, learning rates, forgetting rates, behavioral personas, and scratchpad memory. PACE and the baselines use OpenAI GPT-5, while the trainee agents are instantiated with Claude Sonnet 4.5 to avoid same-model contamination.

That design is reasonable for controlled analysis. It is also a boundary. The headline learning gains are simulated-learning results, not direct field-trial outcomes with human trainees.

The paper compares PACE against Round-Robin, Deficit-Driven selection, GraphRAG, GenMentor, Agent4Edu, and two PACE ablations: without propagation and without dynamics. The four trainee archetypes are Fast Learner, Moderate Learner, Struggling Learner, and Quick Forgetter.

A few results matter more than the rest:

Trainee type Key PACE result Best comparison point Interpretation
Fast Learner Z2H = 22.19 sessions; C@50 = 95.27% Agent4Edu Z2H = 27.58; C@50 = 88.18% PACE reaches competence faster and ends with broader mastery
Moderate Learner C@50 = 91.51%; RE = 90.11 Agent4Edu C@50 = 84.57%; RE = 86.17 Gains persist beyond the easiest learner profile
Struggling Learner C@50 = 86.91%; RE = 88.18 Agent4Edu C@50 = 80.48%; RE = 81.15 PACE helps, but the efficiency challenge remains harder
Quick Forgetter C@50 = 91.22%; Z2H = 25.91; RE = 91.04 Agent4Edu C@50 = 86.74%; Z2H = 32.55; RE = 85.78 Explicit forgetting modeling matters most when decay is high

For fast learners, the paper reports a 19.50% reduction in time-to-competence relative to Agent4Edu: 22.19 sessions versus 27.58. Terminal coverage also rises from 88.18% to 95.27%.

The quick-forgetter case is more revealing. GenMentor declines from 71.66% coverage at session 10 to 67.51% at session 50. That is negative learning progress in the metric, which sounds absurd until one remembers that “learned once” and “retained later” are different states. Agent4Edu keeps improving but plateaus below PACE. PACE’s explicit forgetting model gives it a structural advantage when retention is the problem.

The ablations reinforce this interpretation. Removing propagation roughly doubles time-to-competence for fast learners: 46.13 sessions versus 22.19. That suggests graph-based inference is not cosmetic. Without it, the system must directly observe too many skills. Removing dynamics hurts quick forgetters especially: random exam score drops from 91.04 to 80.16 because the population-average forgetting rate underestimates their true decay.

This is the paper’s best evidence for mechanism, not just performance. The gains are tied to the two design choices the authors care about: graph propagation and trainee-specific temporal dynamics.

Fine-grained diagnosis is not a luxury feature

The within-PACE granularity analysis is one of the more useful parts of the paper for business readers. The authors compare three versions:

Variant Belief granularity Example level
PACE-coarse 3 categories Police, fire, medical
PACE-medium 63 categories Car crash, structure fire
PACE-fine 1,053 skill nodes Specific questions, conditions, instructions

PACE-fine wins consistently. Its C@50 values across the four archetypes are 95.27%, 91.51%, 86.91%, and 91.22%. PACE-medium reaches only 68.49%, 56.11%, 50.93%, and 59.67%, with PACE-coarse worse overall.

This result should be boring, but it is not. Many enterprise training systems still organize capability at exactly the wrong level: course completion, module score, topic category, department label. These are administratively convenient. They are also operationally blurry.

PACE’s result says: if failure happens at the atomic procedural level, learner modeling must also happen at that level. A dashboard saying “medical protocol: 82%” may comfort management, but it does not tell a trainer whether the trainee can distinguish a routine choking response from a choking complication. The dashboard is clean because it has hidden the dangerous details. Very tidy. Very unsafe.

The business implication is broader than 9-1-1. In cybersecurity response, aviation operations, compliance investigations, claims processing, medical triage, and industrial maintenance, competency often lives below the level of job role or training module. If the organization cannot represent the skill graph, AI cannot optimize the curriculum except in a vague motivational-poster sense.

The co-pilot result is strong, but it is not the same as a human-learning trial

The field-facing part of the paper evaluates PACE as a co-pilot using real training data from the partner 9-1-1 center. The authors collect 923 training catalog entries containing simulation logs, debriefing results, and trainer comments with next-session assignments. They run PACE on the dataset and compare its recommendations with expert pedagogical decisions.

PACE achieves 95.45% alignment, or 881 out of 923 entries. The paper also reports a reduction in average turnaround time from 11.58 minutes to 34 seconds in the adaptive phase, a 95.08% reduction. In a survey of 21 domain experts, including 12 training officers and 9 active call-takers with training experience, agreement with PACE’s design choices averages 4.62 out of 5, and overall helpfulness averages 4.43 out of 5.

These results support the co-pilot framing. PACE appears to produce recommendations that experienced trainers often recognize as reasonable, while reducing the administrative time needed to generate those recommendations.

But alignment is not outcome proof. If PACE matches experts, it may inherit expert judgment. That is valuable. It does not by itself prove that a cohort trained with PACE will outperform a cohort trained by experts alone. The controlled learning results come from simulated trainees; the real-world study validates recommendation alignment and workflow efficiency.

That boundary is not a weakness to hide. It is the right reading of the paper.

What businesses should take from PACE

PACE is about 9-1-1 call-taker training, but the architecture points to a general pattern for AI-assisted workforce development.

PACE mechanism General business equivalent ROI pathway
Skill graph Operational competency map Makes hidden dependencies explicit
Beta belief tracking Probabilistic employee skill state Avoids overconfidence from sparse observations
Evidence propagation Transfer across related tasks Reduces direct assessment burden
Learning/forgetting model Personalized reinforcement schedule Prevents decay of critical skills
Contextual bandit Next-best-practice recommendation Improves sequencing under limited training time
Human co-pilot interface Trainer decision support Saves expert time without removing accountability

The practical lesson is not that every firm needs Thompson Sampling in its training stack by Monday. Please do not let a vendor put that on a slide with fireworks.

The lesson is that training automation becomes serious only when it moves beyond content generation. A system that writes quizzes or role-plays customers may be helpful. A system that maintains a persistent, uncertainty-aware model of procedural competence is strategically different. It can answer harder questions:

  • Which skills have we not observed directly but can infer from related performance?
  • Which “mastered” skills are now at risk of decay?
  • Which scenario gives the highest learning value for this specific person now?
  • Where should scarce expert review time be spent?

That is where cost reduction and capability improvement begin to meet. The instructor is no longer asked to manually reconstruct the whole learner state. The system surfaces a diagnosis and a next action; the human judges, coaches, and overrides when context demands it.

Where the evidence stops

PACE is promising because its mechanisms match the structure of the problem. The skill graph addresses interdependence. The Beta beliefs address uncertainty. The forgetting model addresses time. The bandit addresses sequential choice.

Still, several boundaries matter.

First, the main learning gains are based on simulated trainee agents. The simulation is carefully designed, with controllable archetypes and error templates from anonymized logs, but it remains a simulation. Real trainees may respond differently to scenario sequencing, feedback timing, stress, fatigue, instructor style, and organizational culture.

Second, the partner setting is one 9-1-1 call-taking context. The authors worked with domain experts from training, quality assurance, and operations, which strengthens ecological validity. It also means transfer to other agencies or domains requires revalidating the skill graph, annotation schema, scenario set, and instructor acceptance.

Third, PACE depends on structured observations extracted from transcripts and debriefings. If observation extraction is noisy, biased, or poorly calibrated, the belief tracker becomes elegantly wrong. Elegant wrongness is still wrongness; it just comes with equations.

Fourth, expert alignment is not the same as independent superiority. The 95.45% alignment rate is meaningful for co-pilot adoption, but a stronger deployment claim would require prospective human-trainee trials comparing training outcomes, retention, trainer workload, and safety-relevant performance.

These limitations do not reduce the paper to “interesting but impractical.” They identify the next validation layer: field trials with real trainees and longitudinal retention outcomes.

The real shift: from AI tutor to AI curriculum operator

PACE is useful because it changes the object of automation.

The object is not the lesson. It is the decision about what lesson should happen next.

That shift is easy to underestimate. A content generator reduces preparation cost. A role-play bot increases practice availability. A curriculum engine changes the training loop itself. It decides when to reinforce, when to probe, when to advance, and when to revisit a skill that looked safe but may have decayed.

For high-stakes procedural work, that is the more valuable layer. Organizations rarely fail because one training video was not generated quickly enough. They fail because they do not know which human capability is fragile until the real incident exposes it.

PACE gives us a concrete design pattern for avoiding that failure: map the skill graph, update beliefs from observed performance, model forgetting, and allocate practice like a scarce resource.

The phrase “AI designs the curriculum” may sound grand. In this paper, it means something more precise and more useful: AI helps decide the next practice scenario under uncertainty, while human trainers remain responsible for judgment and coaching.

That is a quieter form of intelligence. Also, probably the one enterprises should have wanted before they asked chatbots to write motivational onboarding emails.

Cognaptus: Automate the Present, Incubate the Future.


  1. Zirong Chen, Hongchao Zhang, and Meiyi Ma, “PACE: A Personalized Adaptive Curriculum Engine for 9-1-1 Call-taker Training,” arXiv:2603.05361, 2026, https://arxiv.org/abs/2603.05361↩︎