Black Boxes, White Coats: AI Epidemiology and the Art of Governing Without Understanding

A hospital does not need a perfect theory of neural network internals before it can notice that one clinical AI keeps recommending the wrong kind of follow-up. A bank does not need to decode every transformer layer before it can see that a credit assistant behaves oddly around post-bankruptcy applicants. A regulator does not need metaphysics. It needs repeatable measurements.

That is the central move in Kit Tempest-Walters’ paper, Towards AI epidemiology: a measurement standardisation framework for prospective risk detection.¹ The paper is not another attempt to make AI “transparent” by forcing open the model’s head. It argues for something colder and more operational: record what experts ask, what AI systems recommend, how those recommendations align with policy and evidence, whether experts override them, and what corrected output eventually gets exported.

In other words, stop waiting for perfect understanding before doing governance. Measure the symptoms.

The phrase “AI epidemiology” is doing real work here. Epidemiology often acts before mechanism is fully understood. John Snow did not need germ theory to identify contaminated water as a cholera risk. Early public health work on smoking did not need molecular oncology before treating smoking as a serious population-level risk. The paper’s analogy is not that AI systems are diseases. That would be cute, and therefore probably unhelpful. The analogy is measurement under mechanistic uncertainty.

The important caveat: this paper does not prove that AI epidemiology already predicts downstream harm. It is a concept and protocol paper. Its contribution is to define the measurement grammar, specify reliability checks, and lay out how future studies could test whether these measurements become useful governance signals.

That distinction matters. Otherwise, the paper will be misread as a new “AI risk detector.” It is more like a proposed lab standard for producing the measurements that a real risk detector would later need.

The usual explainability question asks for the wrong kind of comfort

Most explainability debates begin with a familiar demand: tell us why the model said that.

This demand is emotionally satisfying. It gives managers, regulators, and users the impression that a model can be made accountable if someone can produce a neat causal story. Unfortunately, modern AI systems are not always cooperative with this managerial fantasy.

The paper groups many existing approaches under what it calls “correspondence-based interpretability”: methods that try to establish a relationship between what the model internally computes and what it outputs. This includes mechanistic interpretability, feature attribution methods such as SHAP and LIME, and chain-of-thought explanations when those explanations are treated as evidence of actual model reasoning.

These methods remain valuable. They can help in bounded settings, during model development, and in scientific attempts to understand model behavior. But deployed governance faces a different deadline. A hospital committee cannot pause every AI-assisted recommendation until someone proves that an attribution map faithfully represents the model’s internal computation. A lending compliance team cannot treat every explanation as true merely because it sounds fluent. We tried “the model gave a plausible explanation” as a governance strategy. It turns out vibes are not an audit framework.

The paper therefore shifts the governance question:

Traditional explainability question	AI epidemiology question
Did the explanation correspond to the model’s internal computation?	Does this class of AI-assisted interaction show measurable patterns of policy or evidential misalignment?
Can we identify the causal pathway inside the model?	Can we identify where outputs tend to fail under expert review?
Can one output be explained mechanistically?	Can thousands of outputs be measured consistently enough to reveal risk patterns?
Is the model transparent?	Is the deployment behavior governable?

That last contrast is the useful one. The paper does not dismiss transparency. It changes the timescale. Mechanistic understanding remains a long-term scientific goal. Prospective governance needs something that works while the model is already being used.

The paper’s grammar turns messy interactions into comparable records

The framework’s operational core is an eight-field grammar for expert-AI interactions. “Grammar” here does not mean writing style. It means a standard structure for turning messy, free-form expert-AI exchanges into comparable records.

The fields are:

Field	What it captures	Why it matters operationally
Mission	The task the expert asked the AI to perform	Defines the unit of analysis: recommend, assess, classify, draft, or similar
Conclusion	The AI’s main recommended action or determination	Makes outputs comparable across cases
Justification	The categories of reasons the AI gave	Captures what the expert saw, not what the model secretly computed
Risk level	High, medium, or low consequence if the conclusion is wrong	Stratifies interactions by potential harm
Policy alignment	Whether the conclusion fits applicable policy, guideline, or institutional rule	Tests institutional acceptability
Evidential alignment	Whether factual claims in the justification are supported by the evidence base	Tests factual defensibility
Override	Whether the expert explicitly contests the original output	Records expert intervention
Corrective option	The revised conclusion and justification after override	Shows what experts accepted instead

The separation between policy alignment and evidential alignment is especially useful. A recommendation can be policy-aligned but evidentially weak: it lands on the approved answer for poor reasons. It can also be evidentially plausible but policy-misaligned: it reasons from evidence but ignores institutional rules.

That distinction matters in regulated settings. In medicine, a recommendation may be clinically plausible but inconsistent with hospital protocol. In lending, a justification may cite a real risk factor but apply it in a way that violates underwriting policy. In legal work, an answer may be doctrinally interesting but unusable under the client’s jurisdictional constraints. “Looks reasonable” is not the same as “fit for governed use.” Shocking, I know.

The grammar also avoids a common trap: treating expert override as the whole signal. Override is useful, but it is not enough. Experts may override more often in high-stakes cases simply because they scrutinize them more carefully. They may also be influenced by the visible alignment score itself. The framework therefore treats override as one captured behavior, not as the independent truth source.

The more interesting field is the corrective option. If an AI system consistently recommends X and experts consistently export Y, the institution learns not merely that experts disagree. It learns the direction of correction.

That is governance data.

The black-box judge problem is not ignored; it is boxed in

The most obvious objection to the framework is also the correct one: it uses an LLM judge to score outputs from AI systems. One black box judging another black box. Wonderful. The governance ouroboros has arrived wearing a compliance badge.

The paper does not pretend this circularity disappears. Instead, it tries to bound it.

The LLM judge is not asked to make a broad, free-form moral assessment. It is configured under several constraints:

explicit rubrics;
a three-level anchored scale: high, medium, low;
structured step-by-step scoring against those rubrics;
retrieval-grounded reference documents for policy and evidence;
low-temperature decoding for repeatability.

The point of these constraints is not to make the judge magically transparent. It is to make the judge’s outputs stable enough to be tested against human raters and against future outcomes.

The reliability protocol is where the paper becomes more than a governance slogan. It specifies three checks.

First, human-judge agreement. Because the alignment scores are ordinal, the paper proposes linearly weighted Cohen’s kappa. A one-step disagreement between high and medium is less serious than a two-step disagreement between high and low. The suggested minimum is $\kappa_w \geq 0.61$, calibrated to the lower edge of substantial agreement.

Second, test-retest consistency. The paper proposes intraclass correlation coefficients rather than simple correlation. This is an important statistical detail. A judge that scores every case one level higher on the second run could still correlate perfectly with itself, while being useless as a stable measurement instrument. The paper uses an ICC threshold of at least 0.75 for population-level use.

Third, bias diagnostics. The paper targets three known LLM-judge failure modes: sycophancy, verbosity bias, and self-preference. Sycophancy is tested by checking whether assertive AI outputs receive inflated alignment scores without evidential support. Verbosity bias is tested by comparing concise and verbose versions of the same content. Self-preference is tested by comparing scores from judges in the same and different model families.

This is not an ablation study. There are no experimental results showing that the proposed judge configuration already works. The paper is specifying what would have to be tested before institutions rely on the scores.

That is a more modest claim, but also a more useful one.

The proposed test asks whether compression loses too much information

The framework compresses a full interaction into structured fields. Compression is necessary for population-level monitoring, but it creates a technical risk: maybe the compressed grammar discards the very information needed to detect misalignment.

The paper’s evaluation protocol is designed around that risk.

It proposes comparing two classifiers. One uses the structured grammar fields: mission, conclusion, justification, and risk level. The other uses the full conversational text. Both estimate whether an interaction’s policy or evidential alignment falls below a pre-specified threshold.

The test is not whether the grammar-field model is better. It is whether it is close enough. The paper frames this as a non-inferiority test using AUC, with a margin of $\delta = 0.05$. In plain English: if the compressed fields perform within 0.05 AUC of the full-text comparator, the grammar may be good enough for governance use because it provides auditability, standardization, and monitoring benefits that full free-text storage does not.

That is a sensible trade-off. A tiny gain in prediction from using full conversations may not justify losing standardization, privacy manageability, and audit structure. A large drop would mean the grammar is too crude and needs revision.

The paper proposes paired bootstrap inference as the primary uncertainty estimate, with DeLong’s test as a sensitivity check. It also specifies sample-size assumptions: about 500 interactions per domain under favorable assumptions, and about 800 per domain if performance is weaker. Because the design crosses three domains, three risk levels, and three levels each for policy and evidential alignment, the paper notes that per-cell inference would be exploratory at 500 interactions per domain.

Here is the useful way to read the statistical design:

Protocol element	Likely purpose	What it supports	What it does not prove
Full-text comparator	Main evaluation benchmark	Tests whether grammar compression preserves enough predictive information	Does not show that alignment scores predict real-world harm
Non-inferiority margin of 0.05 AUC	Main decision rule	Defines how much performance loss is tolerable for audit benefits	Does not prove the grammar is optimal
Paired bootstrap	Primary uncertainty method	Preserves the paired structure of comparing two models on the same interactions	Does not remove bias in the underlying labels
DeLong interval	Sensitivity check	Tests whether inference is robust to another AUC comparison method	Does not replace the primary bootstrap analysis
3×3×3 coverage	Design coverage	Ensures difficult and easy cases appear across domains, risk levels, and alignment levels	Does not provide reliable inference for every individual cell at modest sample sizes
Development/frozen test split	Robustness against overfitting during grammar refinement	Checks whether revised grammar generalizes beyond tuning data	Does not validate downstream outcome association

The grammar refinement section follows the same logic. If the structured fields underperform, the grammar can be revised. For example, a broad justification category such as “appeal to credit risk data” might be split into more precise categories if the full-text model captures a distinction the compressed grammar misses. But the paper also adds a parsimony guard: new categories should improve cross-validated AUC by more than one standard error of the baseline.

This is important. Without parsimony, the grammar could slowly mutate into a bureaucratic monster: thousands of tiny categories, each defensible, none governable. There are already enough compliance taxonomies in the world. No need to breed another one in captivity.

Compared with SHAP, monitoring, incident reporting, and RLHF, the framework lives in a different layer

The paper’s novelty is clearest by comparison. If summarized as “eight fields plus LLM judge,” it sounds like yet another governance checklist. It is better understood as a deployment-layer measurement framework that sits between interpretability research, model monitoring, human review, and institutional audit.

Approach	Primary object	Timing	What it is good for	What AI epidemiology adds
SHAP / LIME-style feature attribution	Feature contribution to a prediction	Usually model evaluation or post-hoc explanation	Local explanation in bounded settings	Population-level measurement of deployed expert-AI interactions
Mechanistic interpretability	Internal model computation	Research and safety analysis	Scientific understanding of model internals	Governance without requiring internal access
Statistical model monitoring	Aggregate model performance, drift, metrics	Deployment monitoring	Detecting broad performance changes	Output-level triage linked to mission, policy, evidence, override, and correction
AI incident reporting	Salient failures after they become visible	Retrospective	Building public records of harms	Routine precursor detection before incidents become news
RLHF / RLAIF / DPO	Training signal for model behavior	Pre-deployment or model training	Updating model parameters	Measuring deployed outputs without retraining the model
Procedural AI audits	Process, documentation, lifecycle controls	Development and governance review	Accountability structures	Standardized unit of analysis for routine deployment behavior

The strongest contrast is with RLHF. Both involve human responses to AI outputs, but they answer different questions. RLHF asks: how can human preference data train a better model? AI epidemiology asks: how can expert-AI interactions produce a governance record for deployed systems?

That difference matters for business implementation. A company may not control the model weights. It may use third-party AI tools, vendor APIs, embedded copilots, or model ensembles. It may not be able to retrain the model, and even if it can, retraining may not be the immediate compliance problem. The immediate problem is whether the institution can see where AI-assisted decisions are drifting away from policy, evidence, and expert practice.

This framework is model-agnostic by design. That is not glamorous. It is useful.

The business value is not “explainability”; it is governed observability

For regulated organizations, the practical value of the framework is not that it explains AI. It creates governed observability.

A governed AI workflow based on this framework would do four things.

First, it would attach a small set of visible signals to each AI output: risk level, policy alignment, and evidential alignment. This gives the expert an immediate warning when the output deserves closer review.

Second, it would silently capture structured records in the background. The expert should not have to fill out a form after every interaction. If the governance layer becomes a manual reporting burden, adoption will die quietly and then be described in a post-mortem as “change management complexity.” Very elegant. Very dead.

Third, it would aggregate patterns across mission types, models, departments, and domains. This is where governance moves from anecdote to measurement. Instead of saying “the AI is sometimes unreliable,” an institution could ask whether a particular model underperforms on specific lending scenarios, clinical recommendations, legal drafting tasks, or policy interpretation missions.

Fourth, it would capture corrections. Corrective-option data is the bridge from detecting disagreement to understanding operational practice. If experts repeatedly change AI recommendations in the same direction, that pattern can inform training, policy updates, vendor negotiation, or model replacement.

The business interpretation should be separated into three layers:

Layer	What the paper directly specifies	Cognaptus business inference	What remains uncertain
Measurement	A grammar for standardizing expert-AI interactions	Build AI governance logs around missions, conclusions, justifications, alignment, override, and correction	Whether the grammar generalizes cleanly across every domain
Reliability	A protocol for validating LLM-judge agreement, repeatability, and bias	Treat LLM-as-judge scoring as a controlled measurement instrument, not as a magic oracle	Whether a specific judge configuration passes domain validation
Monitoring	Aggregate alignment, override, and corrective-option patterns	Use records for audit, risk triage, vendor comparison, and policy drift monitoring	Whether these patterns correlate with downstream outcomes
Outcome validation	A staged plan linking grammar records to institutional outcome data	Mature deployments can study whether low alignment predicts adverse outcomes	This is future work, not an established result

This is especially relevant for sectors where AI is already sneaking into professional workflows faster than governance teams can standardize the evidence trail. Healthcare is the obvious case, but the same structure applies to finance, insurance, legal services, education, and public administration.

The implementation requirement is not trivial. An institution needs an identifiable evidence base, an applicable policy corpus, and accountable human reviewers with observable override behavior. Without those, the framework loses its anchor. A startup that cannot define its policies cannot score policy alignment. A department that does not record expert correction cannot analyze corrective-option patterns. A firm that treats every AI output as informal “draft assistance” until something goes wrong will discover that informality is not a risk strategy. It is a filing system for future embarrassment.

The figures describe workflow, not evidence

The paper’s figures are useful, but they should be read correctly. They are workflow diagrams, not empirical results.

The first figure shows how each expert-AI interaction produces two parallel outputs: a visible expert-facing score and a silent structured grammar record. This is implementation design. It explains how the framework avoids turning experts into data-entry clerks.

The second figure shows override and corrective-option capture. High-scoring outputs proceed to reviewer attestation and export; medium or low alignment scores surface an override path; revised outputs are captured as corrective options. This is a governance workflow.

The third figure shows the expert-facing interface concept: the AI output beside risk level, policy alignment, evidential alignment, source access, and override activation. This is product design logic, not validation evidence.

That distinction is not pedantic. It prevents a bad reading of the paper. The figures show how the system would operate if deployed. They do not show that the system reduces harm. That claim requires the staged empirical programme.

The staged programme is the paper’s real discipline

The paper organizes its claims into three stages.

Stage 1 is measurement standardization. The goal is to show that the grammar can produce consistent policy and evidential alignment scores in real institutional settings. The key target is an ICC of at least 0.75 for each alignment score. The business value begins here because the institution gains an audit trail even before outcome associations are established.

Stage 2 is governance signaling and institutional monitoring. The goal is to show that reliable scores generate useful signals for experts and institutions. Aggregate alignment distributions, override patterns, and corrective-option patterns become tools for identifying systematic tendencies across mission types, models, and domains.

Stage 3 is outcome association. This is where the epidemiological claim becomes serious. Grammar records must be linked to downstream outcomes already collected by institutions. If low alignment scores consistently correlate with adverse outcomes, confidence in the scoring procedure increases. If high-scoring outputs still correlate with bad outcomes, the scoring rubric needs recalibration.

The sequence matters. You cannot jump to Stage 3 without Stage 1. Population-level association depends on measurement standardization. Noisy scores do not become meaningful because someone aggregates them in a dashboard. They become a larger pile of noise, now with executive colors.

This is the paper’s best instinct: it treats governance as a measurement problem before treating it as a prediction problem.

Where the framework can fail

The framework has several boundaries that matter for practical use.

The first is the exposure-instrument problem. In classical epidemiology, the measurement instrument is usually separate from the exposure being studied. Here, the measurement instrument is another LLM. Shared architectural biases may inflate agreement. RAG grounding, rubrics, and bias diagnostics reduce the problem but do not eliminate it.

The second is judge bias. A judge may favor longer justifications, agree with assertive outputs, or prefer outputs from its own model family. The paper’s bias diagnostics are designed to detect these patterns, but detection is not immunity. Institutions still need domain-specific validation.

The third is the observer effect. If experts see low alignment scores, they may override more often. That makes override rates useful as workflow behavior but weak as independent validation. The paper correctly treats downstream outcomes, not override rates, as the stronger validation target.

The fourth is surveillance bias. High-risk cases may receive more scrutiny, leading to higher override rates even if output quality is not worse. Risk stratification helps, but interpretation still requires care.

The fifth is policy drift. A score calibrated against one version of a guideline may become outdated after policy changes. The framework addresses this with version-stamped policy and evidence corpora, but that turns governance into a maintenance function. Good. Governance that requires no maintenance is usually just decoration.

The final boundary is construct validity. A policy-aligned output can still harm someone if the policy itself is bad. Evidential alignment can also fail if the evidence base is incomplete, biased, or misapplied. Stage 3 outcome association is therefore not optional. It is the test that determines whether the measurement framework captures properties that matter outside the scoring interface.

The practical lesson: govern the interaction layer

The paper’s strongest contribution is not that it coins “AI epidemiology.” New labels are cheap. The stronger contribution is that it identifies the interaction layer as the missing governance object.

The model is one object. The user is another. But in regulated settings, the decision risk often emerges in the interaction: the mission asked, the recommendation produced, the justification offered, the policy context, the expert’s response, and the final exported decision.

That interaction is measurable.

For business leaders, the lesson is direct. Do not treat AI governance as a binary choice between total model transparency and blind trust. There is a third path: create standardized records of how AI behaves under expert oversight, validate the scoring instrument, and monitor patterns before failures become incidents.

This does not eliminate the black box. It makes the black box’s institutional behavior observable.

That may sound less heroic than “solving explainability.” It is also more likely to survive contact with procurement, compliance, and actual professional workflows. A rare advantage, in this industry.

Cognaptus: Automate the Present, Incubate the Future.

Kit Tempest-Walters, “Towards AI epidemiology: a measurement standardisation framework for prospective risk detection,” arXiv:2512.15783, 2025, arXiv PDF. ↩︎

The usual explainability question asks for the wrong kind of comfort#

The paper’s grammar turns messy interactions into comparable records#

The black-box judge problem is not ignored; it is boxed in#

The proposed test asks whether compression loses too much information#

Compared with SHAP, monitoring, incident reporting, and RLHF, the framework lives in a different layer#

The business value is not “explainability”; it is governed observability#

The figures describe workflow, not evidence#

The staged programme is the paper’s real discipline#

Where the framework can fail#

The practical lesson: govern the interaction layer#