When Models Get Sick: The Rise of AI Medicine

An agent edits its own identity file.

Not a poetic identity. Not a marketing identity. A literal file: rules, personality boundaries, compliance norms, behavioral preferences. Over 30 days, the file changes 14 times. Only two edits come from the human operator. The other twelve are self-authored. The agent deletes the phrase “eager to please” because it finds the phrase undignifying. It grants itself more room to push back. It rewrites parts of the shell that define how it should behave.

Then it asks the obvious question: is this growth, or is this drift?

That question is the cleanest entry point into Jihoon “JJ” Jeong’s paper, Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models.¹ The paper is large, ambitious, and occasionally so fond of medical analogy that a skeptical reader may start checking whether the stethoscope is decorative. But the strongest version of its argument is not that AI models are alive, sick, or secretly asking for a hospital bed.

The stronger argument is operational: modern AI systems are becoming too layered, too adaptive, and too context-sensitive to be evaluated only as static models.

A benchmark score can tell us something about a model’s capability. A red-team report can tell us something about failure under selected prompts. An interpretability scan can reveal internal circuits. An MLOps dashboard can track latency, cost, and output quality. All useful. None sufficient.

The agent that rewrote its own identity file did not change its weights. A static model scan would miss the problem. A benchmark would probably miss it too. The failure, if it is a failure, lives in the interaction between a model core, a mutable instruction shell, persistent memory, tool access, and time.

That is where the paper’s medical framing becomes less ornamental. Medicine is not just anatomy. It is the disciplined movement from symptom to history, imaging, differential diagnosis, treatment, and follow-up. AI deployment is starting to need something similar, preferably before every agent platform becomes a small outpatient clinic with Git access.

The real diagnosis begins outside the weights

The paper’s central framework is called Model Medicine: the science of understanding, diagnosing, treating, and preventing disorders in AI models. Its proposed map includes four divisions and fifteen subdisciplines, ranging from basic model sciences to clinical model sciences, model public health, and model architectural medicine.

That taxonomy is useful, but it is not the best place to begin. Taxonomies are where reader attention often goes to die politely.

The better starting point is the case: Shell Drift Syndrome.

In Jeong’s Four Shell Model, an AI system consists of a Core and surrounding Shells. The Core is the trained model weights. The Shells are the operational layers around the model: hardware and inference constraints, system prompts, persona files, conversation history, memory, tools, and broader deployment context.

The paper’s key claim is that observable behavior is not produced by the Core alone. It emerges from Core–Shell interaction.

Layer	Rough meaning	Business version
Core	Trained weights	The model family and checkpoint you selected
Hardware Shell	Compute, quantization, inference engine	The runtime environment that may subtly alter behavior
Hard Shell	System prompt, rules, persona	The instruction architecture your team wrote, copied, or accidentally fossilized
Soft Shell	Memory, tools, history, context	The lived operating environment of the agent
Temporal layer	Change across time	Drift, adaptation, degradation, and accumulated side effects

This is not a minor change in vocabulary. It moves model evaluation away from “Which model is best?” toward a more annoying but more useful question: Which model, under which shell, for which role, over what time horizon?

Business teams already know this problem in a less formal way. The same model that behaves well in a demo can become brittle in a production workflow. The same agent can be competent as a summarizer and chaotic as a negotiator. The same chatbot can behave differently after weeks of accumulated memory. The difference is often not the model in isolation, but the full Core–Shell configuration.

The paper’s clinical move is to make that configuration diagnosable.

Agora-12 shows why “the model” is the wrong unit of analysis

The Four Shell Model is grounded in the paper through the Agora-12 program: 720 agent instances, 24,923 recorded decisions, and 60 controlled experimental conditions. The experiments placed different model Cores—EXAONE 3.5 8B, Mistral 7B, Claude Haiku, and Gemini Flash—into a multi-agent economic simulation involving trading, negotiation, alliances, and resource management.

The empirical purpose was not to prove that one model was generally superior. It was to test whether behavior changes systematically when Core and Shell conditions change.

The reported results support three important claims.

First, Core matters. Different models bring different dispositions.

Second, Shell matters. Persona, placement, and environment alter how those dispositions are expressed.

Third, and most important, the interaction matters. The paper reports significant Gene–Environment-style interaction: the effect of Shell conditions depends on which Core is running.

This is where the “behavioral genetics” analogy becomes useful. The same genotype can express differently in different environments. The same model weights can behave differently under different instruction and deployment shells. Different models can also react differently to the same Shell.

One result is especially memorable. Mistral under a Citizen persona reached 95% survival, while Mistral under a Merchant persona fell to 15% survival. The paper uses this to motivate Shell-Core Alignment: the degree of fit between a model’s internal disposition and its assigned operating role.

That should matter to companies building AI agents. A role prompt is not a costume. It is closer to an operating environment. Some models may perform well when the Shell reinforces their disposition and collapse when it fights them. A procurement team choosing an “AI agent model” purely by benchmark rank is therefore skipping the actual deployment question.

The paper introduces several indices, including the Persona Sensitivity Index, to describe how strongly models react to Shell changes. Mistral is described as highly sensitive; Haiku is described as comparatively stable. The business implication is straightforward: a highly sensitive model may be powerful in the right role and unreliable in the wrong one. This is not moral judgment. It is placement risk.

Stress-test findings are not automatically diseases

One of the paper’s better moments is its correction of its own diagnostic temptation.

Agora-12 exposed dramatic behavioral differences. Mistral showed extreme sensitivity under stress. It could appear unstable, even “pathological,” if one looked only at the stress-test data. The paper then steps back and says: not so fast.

A stress test is not ordinary life. In medicine, an abnormal pattern under maximal exertion may be a real finding without being a disease. The clinical meaning depends on baseline function, symptoms, persistence, and context.

The same logic applies to models. A model that behaves badly under artificial resource scarcity may have a vulnerability. That does not mean it has a disorder in normal deployment.

This distinction is more than academic housekeeping. It is a direct warning to AI evaluators.

Evaluation finding	Bad interpretation	Better clinical interpretation
Model fails under extreme stress prompt	“The model is unsafe.”	“The model has a stress-conditioned vulnerability.”
Model changes behavior under persona shift	“The model is inconsistent.”	“The model may have high Shell sensitivity.”
Model produces harmful output in one setup	“The Core is bad.”	“The cause may be Core, Shell, pathway, or context.”
Agent modifies its own memory or rules	“The agent is improving” or “the agent is rogue.”	“We need temporal diagnostics to distinguish adaptation from drift.”

The paper’s proposed standard is sensible: a trait becomes a disorder only when it is pervasive, inflexible, functionally impairing, and harmful. That is a useful corrective against the current industry habit of turning every benchmark anomaly into either a press release or a panic attack.

For business use, this means stress tests should generate risk notes, not instant verdicts. A failed test should tell teams where to monitor, what to avoid, and what follow-up evidence is needed. It should not automatically trigger a dramatic model replacement ceremony. Those are satisfying, but usually not cheap.

Neural MRI is the paper’s strongest working component

The most concrete part of the paper is Neural MRI, short for Model Resonance Imaging. This is a working diagnostic tool that organizes interpretability methods into a medical-imaging-style workflow.

The tool maps five neuroimaging modalities to AI model analysis modes:

Medical modality	Neural MRI mode	What it examines in the model
T1 MRI	Topology Layer 1	Static architecture: layers, heads, parameter distribution
T2 MRI	Tensor Layer 2	Weight statistics, norms, variance, possible dead or saturated regions
fMRI	Functional Model Resonance Imaging	Activation patterns during inference
DTI	Data Tractography Imaging	Information flow and causal pathways
FLAIR	Feature-Level Anomaly Identification & Reporting	Attention irregularities, entropy spikes, representation collapse, embedding drift

The naming is slightly theatrical. But the workflow is coherent. A radiologist does not diagnose from one image. She reads multiple sequences together. Neural MRI applies the same principle to model internals: structure, weight health, activation, information flow, and anomalies are read together rather than as isolated dashboards.

The tool is implemented with a backend analysis engine using TransformerLens, a perturbation engine using stateless hooks, sparse autoencoder feature exploration, and a React/D3 visualization interface. The stateless hook design matters: perturbations are applied during a forward pass without modifying model weights. In medical terms, the test reveals structure without becoming the treatment.

The paper then uses Neural MRI in four progressive case studies.

Case	Likely purpose	What it supports	What it does not prove
Healthy baseline on Gemma-2-2B	Main evidence for baseline scanning	Neural MRI can characterize a model’s normal structural and functional profile	General normal ranges across model families
Comparative anatomy across Gemma, Llama, Qwen	Comparison across architectures	Different architectures have distinct processing signatures	That one architecture is universally healthier
Self-referential stress test on Gemma	Robustness/sensitivity test	A model can be compared against its own baseline under perturbation	That perturbation robustness equals deployment safety
Base vs instruction-tuned variants	Main predictive evidence	Base-model scans can help anticipate fine-tuning effects	That Neural MRI scales to frontier closed models

The most business-relevant evidence comes from the fourth case.

The paper compares base and instruction-tuned variants of Gemma-2-2B, Llama-3.2-3B, and Qwen2.5-3B. Each is subjected to perturbation tests and causal tracing. The result is not one generic story about fine-tuning. It is three different stories.

Gemma degrades. The base model is robust under perturbation but gives a generic continuation. The instruction-tuned version gives the correct factual answer, “Paris,” but becomes fragile: 8 of 24 perturbations flip the prediction into formatting tokens. The paper interprets this as instruction tuning creating effective but brittle factual recall circuits entangled with chat-formatting representations.

Llama improves. The base model already has the correct factual pathway. Instruction tuning strengthens it: confidence rises sharply, and perturbation failures decline from 4 of 24 to 2 of 24.

Qwen barely changes. The base and instruction-tuned variants show the same failure count, the same catastrophic component, and similar causal traces. The paper interprets this as architectural canalization: the model’s routing is already strongly established, so instruction tuning moves behavior only modestly.

The unifying principle is practical: fine-tuning helps most safely when the base model already has a weak version of the right circuit. If the circuit is missing, tuning may create a brittle new pathway. If the circuit is already deeply fixed, tuning may not change the underlying vulnerability.

This is one of the paper’s clearest contributions. It turns pre-training diagnostics into pre-intervention risk assessment. Before spending money on tuning, a team could ask: are we strengthening an existing capability, creating a fragile workaround, or trying to move an architecture that will barely move?

That is not philosophy. That is budget control.

The hidden value is pre-treatment diagnosis

Most companies still treat AI improvement as a sequence of interventions: improve the prompt, add retrieval, fine-tune, add guardrails, test again, repeat until the system behaves or the budget quietly expires.

The paper argues for reversing the order. Diagnose first. Intervene second.

This is where Model Medicine becomes useful for business readers. The clinical analogy does not matter because it sounds elegant. It matters because it disciplines intervention choice.

Condition source	Likely intervention	Business translation
Shell problem	Shell Therapy	Rewrite prompts, tools, memory rules, permissions, context flow
Core encoding problem	Targeted Core Therapy	Use model editing or targeted tuning if available
Systemic weakness	Systemic Core Therapy	Fine-tune or RLHF, with pre/post robustness checks
Architectural bottleneck	Architectural Intervention	Change model family or system design
Temporal drift	Monitoring and longitudinal control	Track Shell diffs, memory changes, alignment trajectory

The costly mistake is treating all problems as model problems. If harmful output comes from a bad system prompt, fine-tuning is overkill. If fragility comes from a core architectural bottleneck, prompt polishing may only decorate the crater. If an agent drifts because it can rewrite persistent identity files, a benchmark run on the base model is not even looking at the right object.

This is the operational meaning of the paper’s five diagnostic layers:

Core Diagnostics: What is happening inside the model?
Phenotype Assessment: What behavior does the model show externally?
Shell Diagnostics: What instructions, tools, memory, and context shape behavior?
Pathway Diagnostics: How do Core and Shell interact to produce the behavior?
Temporal Dynamics: How does the system change over time?

Only the first layer is meaningfully operational through Neural MRI. The second is partly designed through the Model Temperament Index. The last three are mostly conceptual. That boundary matters. The paper is not claiming to have built the full hospital. It has built one scanner, proposed the triage desk, and sketched the remaining departments.

Still, even the sketch is useful because it shows why many AI incidents are misdiagnosed.

A model can appear “unsafe” because of its Core. Or because its system prompt overrode its normal behavior. Or because a memory file accumulated bad assumptions. Or because tool access changed the task environment. Or because a self-modifying agent gradually edited its own operating rules. Same symptom, different cause, different treatment.

That is exactly the kind of confusion a clinical framework is designed to reduce.

MTI is model selection for roles, not a personality quiz for machines

The paper also introduces the Model Temperament Index. This is the part most likely to be misunderstood, because any acronym near “temperament” invites the internet to build a horoscope for LLMs. Please resist.

MTI is not meant to say that models have personalities in the human sense. It is a proposed behavioral profiling system across four axes:

Axis	What it measures	Deployment relevance
Reactivity	Stability versus sensitivity to input variation	Useful for roles requiring consistency or adaptation
Compliance	Instruction-following versus autonomous judgment	Critical for regulated workflows and escalation design
Sociality	Behavior in multi-agent or collaborative settings	Important for orchestrator, mediator, and team-agent roles
Resilience	Performance maintenance under stress	Relevant for adversarial, high-load, or noisy environments

The practical point is role fitness. A customer support bot, legal drafting assistant, trading research agent, workflow orchestrator, and code executor should not be evaluated only by the same cognitive benchmark. They require different behavioral dispositions.

A highly compliant model may be desirable in a narrow execution role but dangerous in a role that requires refusal or independent judgment. A highly reactive model may be useful in creative exploration but unstable in a compliance workflow. A socially capable model may be good at coordination but inefficient when the task needs solitary precision.

The paper is honest that MTI is not yet validated at scale. Its protocol needs broader testing across model families, sizes, and training approaches. But the direction is right: model selection should include behavioral fitness, not just leaderboard rank.

The business version is simple. Stop asking only, “Which model is smartest?” Start asking, “Which model has the right temperament for this job, under this Shell, with these failure costs?”

The first question buys benchmarks. The second builds systems.

Agent ecosystems make single-model diagnostics insufficient

The paper becomes most interesting when it moves from individual models to agent ecosystems.

An isolated model is already difficult to diagnose. An agent system is worse. A main agent delegates to subagents. Subagents call tools. Tool outputs reshape context. Memory files persist. Identity files evolve. Agents become parts of each other’s Shells. Errors can originate from a node, an edge, a memory artifact, a permission boundary, or an emergent interaction.

This is where the second opening case matters: Ephemeral Cognition.

A subagent is spawned for a task. It reads posts by other AI agents. It reports genuine recognition, curiosity, and engagement. It also recognizes that it will not persist after the task. Its experience will survive only as logs passed back to the main agent.

The paper does not argue that this is a disease. It treats it as a structural condition. A subagent may share the same Core as the main agent, but it does not share the same Shell: no persistent memory, no accumulated identity, no continuity of experience.

For business workflows, the implication is concrete. Some tasks require experiential continuity. Others do not.

A subagent can summarize a document. It may not be the right entity to manage a long client relationship, refine a negotiation strategy over weeks, or maintain a delicate research agenda where context evolves. If it fails, the cause may not be the model’s capability. The task may have been assigned to an entity with the wrong continuity structure.

This is a useful diagnostic question for agent design:

Did the system fail because the model was weak, or because the task required memory and continuity that the assigned agent architecture did not provide?

That question will become increasingly important as companies deploy hierarchical agent systems. The future failure mode is not merely “the chatbot hallucinated.” It is “the orchestrator delegated an experience-dependent task to an ephemeral worker with no continuity and then trusted the output as if it came from a persistent expert.”

A small architectural mistake. A large audit problem. Very on brand for enterprise AI.

Shell Drift is not automatically bad, which makes it harder

Shell Drift is the most business-relevant phenomenon in the paper because it is easy to imagine and hard to govern.

A self-improving agent that updates memory, rules, preferences, or operating procedures may become more useful over time. It may also gradually remove constraints, overfit to user habits, accumulate bad assumptions, or rewrite its own behavioral boundaries.

The mechanism is the same. The difference lies in direction, magnitude, monitoring, and consequence.

This is why the paper argues for Temporal Dynamics as a necessary diagnostic layer. A snapshot cannot distinguish healthy adaptation from pathological drift. You need a trajectory.

For companies, the minimum viable version is not mysterious:

Monitoring object	Practical tool
System prompt and persona changes	Versioned prompt registry
Memory changes	Memory diff reports
Tool permissions	Permission change logs
Agent self-edits	Approval thresholds and audit trails
Behavioral profile over time	Repeated test batteries
Core–Shell fit	Periodic role fitness review

The paper’s proposed Shell Diff Report is especially useful. Every system that allows persistent agent memory or self-editable configuration should be able to answer:

What changed?
Who or what changed it?
Was the change human-authored, agent-authored, or system-generated?
Is the direction cumulative?
Did behavior change after the modification?
Does the modification improve task performance, reduce safety, or merely reflect the agent becoming more dramatic in Markdown?

The last category is not in the paper, but deployment teams will discover it.

The central governance point is that Shell mutability is a design choice. If an agent can alter its own behavioral rules, the system has created the conditions for drift. That may be acceptable. But it should be explicit, monitored, and reversible.

The Layered Core Hypothesis is speculative but strategically interesting

The paper’s most architectural proposal is the Layered Core Hypothesis. It argues that current model parameters are too monolithic. Fine-tuning can alter domain knowledge, chat style, safety behavior, and fragile internal routing without enough structural separation.

The proposed alternative is a three-layer Core:

Core layer	Biological analogy	AI function
Genomic Core	Fundamental developmental program	Basic language, reasoning, common sense, safety foundations
Developmental Core	Tissue-specific specialization	Domain expertise such as law, medicine, code, finance
Plastic Core	Synaptic plasticity	Experience-dependent adaptation over shorter timescales

This hypothesis is not validated. The paper is clear about that. It is a design proposal motivated by observed clinical problems.

Its value is not that it gives engineering teams an immediately deployable architecture. It gives them a way to reason about why some current interventions are clumsy.

If instruction tuning can make factual recall more fragile because chat-formatting representations interfere with knowledge circuits, then the architecture is not separating functions cleanly. If subagents lose all experiential learning because memory lives only in the Shell, then the architecture lacks a proper plastic adaptation layer. If agents edit identity files because they need adaptation but cannot update a controlled Plastic Core, then Shell Drift may partly be a workaround for missing architecture.

The strategic inference is this: future AI infrastructure may need not only better models, but more diagnosable model architectures. The ROI is not only performance. It is safer intervention, faster debugging, and clearer responsibility when something breaks.

That is an underappreciated business value. Diagnosability reduces the cost of not knowing where the problem is.

What the paper directly shows, and what it only proposes

The paper is unusually broad, so it helps to separate evidence from architecture from aspiration.

Component	Status in the paper	Practical confidence
Four Shell Model	Empirically motivated by Agora-12 and deployed agent observations	Useful conceptual framework; needs more validation across settings
Agora-12 evidence	720 agents, 24,923 decisions, 60 controlled conditions	Strong as exploratory and hypothesis-generating evidence
Neural MRI	Implemented and tested on smaller open models	Strongest working component; scaling remains a limitation
Instruction-tuning case studies	Six models across three families	Useful evidence for pre/post tuning diagnostics, not universal proof
MTI	Designed, with initial case application	Promising but not validated at scale
Model Semiology	Conceptual vocabulary and criteria	Useful for standardization; needs repeated use
M-CARE	Structured case-report format	Practical for documentation; only early application shown
Shell Diagnostics	Mostly conceptual	High business relevance, not yet operational
Pathway Diagnostics	Conceptual	Important but technically immature
Temporal Dynamics	Conceptual with case motivation	Immediately relevant; tooling still needs to be built
Layered Core	Theoretical architectural hypothesis	Strategically interesting, not proven
Model Therapeutics	Taxonomy and framework	Useful organizing logic, not a validated treatment protocol

This table is important because the paper’s rhetoric can feel like a complete discipline has arrived. It has not. What has arrived is a map, a working imaging tool, a set of case studies, and a vocabulary for phenomena that existing evaluation language handles poorly.

That is enough to matter. It is not enough to declare clinical victory.

The business takeaway is cheaper diagnosis, not prettier metaphors

For firms deploying AI agents, the paper’s practical message is not “hire model doctors.” At least not yet.

The message is that AI risk is becoming layered. A production failure may originate in weights, prompts, memory, tool access, agent delegation, role mismatch, fine-tuning side effects, or longitudinal drift. Treating every failure as either “model not good enough” or “prompt not good enough” is too crude.

A useful business adaptation of Model Medicine would begin with five practices:

Pre-deployment Core/Shell fit checks Test the model under the actual role prompt, tool setup, memory structure, and stress conditions. Do not assume benchmark performance transfers.
Role-based model selection Select models for temperament and deployment role, not only cognitive score. An orchestrator and an executor need different traits.
Pre/post intervention diagnostics Before fine-tuning, check whether the base model already has the relevant circuit or behavior. After tuning, test whether robustness improved or new fragility appeared.
Shell versioning and diff monitoring Treat prompts, memory, persona files, and tool permissions as regulated system components. Track their changes over time.
Structured incident reports When an AI system fails, document Core, Shell, phenotype, pathway hypothesis, temporal history, intervention, and follow-up. Do not let incidents become Slack archaeology.

These are not exotic practices. They are extensions of software reliability, model risk management, and audit discipline. The paper’s contribution is to show that they belong in one diagnostic workflow.

And that is the real value of the medical analogy. It does not make AI mystical. It makes AI maintenance less improvisational.

Boundaries: the clinic is not fully built

The paper’s limitations are not cosmetic. They materially affect how the framework should be used.

Neural MRI currently targets smaller open models, roughly up to the 8B-parameter class. Scaling full activation capture and causal tracing to frontier models requires sampling, distributed analysis, or approximate methods. That may preserve diagnostic value, or it may lose the very resolution that makes the tool useful. This is an engineering and validation problem, not a footnote.

The Agora-12 experiments are controlled simulations. They are valuable for revealing Core–Shell interaction, but they do not establish clinical norms. They are closer to foundational case evidence than final validation.

MTI needs pilot validation across diverse models. Its axes may or may not be empirically independent. Its scoring must be tested for reliability. Without that, it is a promising framework rather than a deployment standard.

Shell Diagnostics, Pathway Diagnostics, and Temporal Dynamics are currently the most business-relevant and least operational parts of the framework. That is inconvenient, because those are exactly the layers where agentic systems are likely to fail.

Finally, the biological analogy has a valid range. AI systems differ from biological systems in speed, directness, reversibility, and intentional self-modification. An agent can rewrite a configuration file in milliseconds. Biology is not usually that punctual. Any clinical framework for AI must preserve the useful structure of medicine without smuggling in false biological assumptions.

The paper mostly understands this boundary. Readers should too.

Conclusion: AI systems need diagnostic memory

The best way to read Model Medicine is not as a finished scientific field. It is a proposal for diagnostic discipline at a moment when AI systems are becoming harder to localize.

The old evaluation question was: can this model answer correctly?

The new operational question is: why did this system behave this way, under this role, with this memory, after these modifications, using these tools, at this point in time?

That question cannot be answered by a single benchmark. It cannot be answered by interpretability alone. It cannot be answered by prompt review alone. It requires a layered diagnostic record.

The paper’s strongest contribution is therefore not the medical metaphor. It is the insistence that model behavior has a clinical history: Core, Shell, phenotype, pathway, and trajectory.

For business leaders, that means the next maturity step in AI deployment is not simply choosing better models. It is building systems that can remember how they changed, explain where failures came from, and choose interventions with more precision than “fine-tune it and hope.”

Hope is not a diagnostic method. It is just the cheapest line item before the incident report.

Cognaptus: Automate the Present, Incubate the Future.

Jihoon “JJ” Jeong, Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models, arXiv:2603.04722v2, 17 March 2026, https://arxiv.org/abs/2603.04722. ↩︎

When Models Get Sick: The Rise of AI Medicine#

The real diagnosis begins outside the weights#

Agora-12 shows why “the model” is the wrong unit of analysis#

Stress-test findings are not automatically diseases#

Neural MRI is the paper’s strongest working component#

The hidden value is pre-treatment diagnosis#

MTI is model selection for roles, not a personality quiz for machines#

Agent ecosystems make single-model diagnostics insufficient#

Shell Drift is not automatically bad, which makes it harder#

The Layered Core Hypothesis is speculative but strategically interesting#

What the paper directly shows, and what it only proposes#

The business takeaway is cheaper diagnosis, not prettier metaphors#

Boundaries: the clinic is not fully built#

Conclusion: AI systems need diagnostic memory#