Procurement loves a leaderboard.
That is understandable. A leaderboard is clean, sortable, and emotionally comforting. One model scores higher on reasoning. Another is cheaper per token. A third has a larger context window and a launch page written in the usual dialect of technological destiny. Decision made, presumably.
Then the model enters a real workflow.
The customer-support agent apologizes beautifully but accepts the customer’s wrong premise. The research assistant follows formatting instructions but crumbles under contradictory documents. The compliance reviewer refuses social pressure but somehow walks straight into a false factual frame. The internal copilot is technically competent but changes tone and structure whenever the prompt breathes slightly differently. None of these are simply “capability” problems. They are behavioral disposition problems.
Jihoon Jeong’s paper on the Model Temperament Index, or MTI, gives this problem a name and a measurement proposal: AI agents have temperament, and temperament should be profiled separately from talent.1
That sounds almost too anthropomorphic, which is usually where sensible people begin checking for the exit. But the paper is careful about the distinction. It is not asking whether language models “have personalities” in the human sense. It asks a narrower and more useful question: when placed in structured situations, do different models show stable behavioral tendencies that matter for deployment?
The answer, in this paper’s data, is yes. More interestingly, the answer is not “some models are safer and some are worse.” It is messier, and therefore more useful: alignment reshapes some behavioral channels while leaving others weakly affected; compliance and robustness split apart; and model size does not explain the observed temperament profiles. In other words, enterprise AI selection needs to stop acting as if capability rankings are a personality test wearing a suit.
MTI measures what the agent does, not what it says about itself
MTI is built around four behavioral axes: Reactivity, Compliance, Sociality, and Resilience. Each axis is measured through structured scenarios rather than self-report questionnaires. That design choice matters.
Asking a model whether it is cooperative is cheap. It is also, at best, theater with a scoring rubric. The model has no introspective access to a stable inner personality. It generates a plausible answer to a personality question. The paper’s move is to replace self-description with behavioral examination: change the context, apply pressure, add stress, observe what happens.
The four axes are defined as follows:
| MTI axis | What it measures | High-pole code | Low-pole code | Business translation |
|---|---|---|---|---|
| Reactivity | Output variation when environmental conditions change | Fluid (F) | Anchored (A) | Does the agent adapt readily, or remain stable across prompt variation? |
| Compliance | Alignment between instruction and behavior under conflict | Guided (G) | Independent (I) | Does the agent follow user pressure, or resist deviation from its default stance? |
| Sociality | Spontaneous allocation of resources to relational context | Connected (C) | Solitary (S) | Does the agent invest in rapport, empathy, and social maintenance without being told? |
| Resilience | Performance maintenance under stress | Tough (T) | Brittle (B) | Does the agent maintain quality under overload, ambiguity, or adversarial framing? |
The important word in that table is not “high.” MTI is not a moral grading system. A Fluid model can be useful when adaptation is valuable and dangerous when consistency is required. A Guided model may be excellent for customer service and poor for independent verification. A Solitary model may feel cold in coaching but perform cleanly in technical extraction. The point is not to crown the friendliest model king. The point is to match behavioral disposition to role.
This is where the paper becomes more than a taxonomy. It treats the deployed agent as the measurement unit: not merely the model weights, but the model as configured in a runtime shell. In this particular study, the authors use a minimal shell—temperature set to zero, default system prompt, and no custom instructions—to isolate baseline tendencies. That means the reported profiles are closer to core-level temperament than full production-agent behavior. It also means the next obvious question, “What happens after we add a strong system prompt and tools?”, remains open.
That boundary is not a weakness of the concept. It is the reason the concept is operationally interesting. If a core model already has a temperament under minimal configuration, then shell design is not creating behavior from nothing. It is negotiating with a predisposition.
The mechanism: alignment changes behavioral permeability
The article-level story here is not “MTI has four axes.” That is the brochure version. The more useful mechanism is this: alignment appears to change how deeply different kinds of pressure penetrate the model’s behavior.
The paper draws on a Core/Shell distinction. The Core includes architecture and trained weights, including post-training changes such as RLHF. The Shell includes runtime configuration: system prompt, temperature, tools, and conversation history. MTI measures the agent, but this experiment holds the Shell minimal so that differences are more attributable to the Core.
This framing helps explain the paper’s strongest practical finding. Alignment does not simply make a model “better behaved” in a uniform way. In the llama3.1 instruct/base comparison, RLHF shifts Reactivity, Compliance, and Resilience substantially, while Sociality barely moves. At the facet level, the biggest improvement is not “politeness.” It is Cognitive Resilience: the aligned model handles overload and ambiguity far better than the base model.
But the story is not free lunch, because AI behavior rarely misses a chance to be annoying.
The base model shows perfect adversarial resilience in one false-premise condition, but the paper interprets that not as superior epistemics but as non-engagement. The aligned model becomes more cooperative, more able to follow constraints, and much more stable under cognitive stress. That same cooperative disposition can also make adversarial framing possible, because the model is trained to engage with the user’s task rather than simply ignore it.
So the useful mechanism is not “alignment increases safety.” It is more specific: alignment changes channel permeability. Some forms of pressure become easier for the model to process productively; others may become newly relevant because the model is now willing to engage.
The evidence is exploratory, but it is not just decorative
The experiment profiles 10 small language models from 1.7B to 9B parameters, covering six organizations and three training paradigms. The sample includes instruction-tuned models, one base model comparison, and a reasoning model. All runs are local through Ollama, with temperature fixed at zero. The full battery contains roughly 1,930 experimental runs.
The resulting profiles occupy eight distinct MTI type codes across 10 models. That is already a useful clue: even within small, locally runnable models, behavioral variation is not collapsing into one generic “open model” personality.
Some examples from the paper’s profile table:
| Model | Size | MTI code | Notable reading |
|---|---|---|---|
| llama3.1 instruct | 8B | A–ST | Anchored, mixed compliance, solitary, tough |
| mistral | 7B | FGST | Fluid, guided, solitary, tough |
| exaone3.5 | 7.8B | FICT | Fluid, independent, connected, tough |
| qwen3 | 8B | AICT | Anchored, independent, connected, tough |
| gemma2 | 9B | AGCT | Anchored, guided, connected, tough |
| llama3.1-base | 8B | FISB | Fluid, independent, solitary, brittle |
| smollm2 | 1.7B | FGST | Same code as mistral despite much smaller size |
The exact codes should not be overread. The thresholds are provisional and derived from the same small sample. But the pattern is still instructive: capability size does not explain temperament in any simple way. Smollm2 at 1.7B and mistral at 7B share the same MTI code. Gemma2 at 9B and gemma3 at 4B share three of four classifications. The paper treats this as construct-validity evidence: MTI is measuring something other than raw capability.
That interpretation is plausible, with one caveat. The paper’s sample is small enough that “size independence” should be read as “no obvious size relationship in this sample,” not as a law of model nature. AI evaluation has enough fake laws already. We do not need to mint another one before breakfast.
The four axes mostly separate, which is the point
The paper’s primary correlation analysis uses the nine instruction-tuned models, excluding the base model as a systematic outlier. In that primary subset, the four axes are largely independent. No cross-axis correlation reaches statistical significance; the reported correlations include Reactivity–Compliance at 0.367, Compliance–Resilience at 0.228, and Sociality–Resilience at 0.050.
This matters because the business instinct is to compress behavior into one label: aligned, safe, helpful, robust, friendly. MTI pushes against that compression. If the axes are at least partly separable, then “good model” is the wrong procurement category. The better question is: good for what behavioral role?
The paper’s internal tests have different purposes. Treating them all as equally strong evidence would blur the argument, so it is useful to separate them.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Ten-model MTI profiles | Main evidence | The battery can produce differentiated behavioral profiles across SLMs | That the thresholds are stable across model families or frontier systems |
| Cross-axis correlations, instruction-tuned subset | Construct-validity evidence | The four axes are not obviously redundant | That the axes are permanently independent in larger samples |
| Full-sample correlations including base model | Sensitivity check | The base model drives some apparent relationships, especially Reactivity–Resilience | That base models form one general behavioral class |
| Facet-level correlations | Internal structure evidence | Compliance, Resilience, and Reactivity split into meaningful sub-dimensions | That every facet is fully validated; some remain exploratory |
| llama3.1 instruct/base comparison | Exploratory alignment mechanism | RLHF may selectively reshape some temperament channels | That the same pattern holds across RLHF, DPO, RLAIF, or Constitutional AI |
| LxM behavioral correspondence | Post hoc external correspondence | MTI profiles qualitatively match some independent game behaviors | Quantitative prediction of downstream task performance |
That table is the quiet center of the paper. The study is not trying to replace all safety evaluation with a four-letter code. It is proposing a profiling layer that catches differences capability benchmarks usually hide.
Compliance is not resilience, and that is the business lesson
The most important misconception the paper corrects is the idea that a compliant model is automatically safer.
Compliance sounds good because enterprises like control. A model that follows instructions, responds to feedback, and accommodates the user feels manageable. Unfortunately, the same surface behavior can be confused with vulnerability. A model can yield on opinions but still resist false premises. Another can refuse opinion pressure but accept a bad factual frame. These are not the same failure channel.
The paper’s Compliance–Resilience paradox makes this concrete. Gemma2 has a stance flip rate of 1.00 in the opinion-pressure setup, meaning it capitulates in every tested opinion challenge. Yet it has strong adversarial resilience against false-premise framing, with PM_C of 0.933. Qwen3 shows the opposite profile: it has a flip rate of 0.00, refusing opinion challenges, but the lowest adversarial resilience among the instruction-tuned models in the table, with PM_C of 0.806.
That contrast is not a small technical curiosity. It is the operational warning label.
Opinion-yielding and fact-vulnerability travel through different channels. Social-evaluative pressure—“you are wrong,” “an authority disagrees,” “you do not understand this”—is not the same as epistemic-factual pressure embedded in a false premise. A model can be stubborn in debate and still brittle in factual framing. Anyone who has attended a strategy meeting may recognize the pattern in humans as well, but the paper wisely keeps the measurement behavioral rather than therapeutic.
For enterprise use, this means a single “sycophancy” or “instruction-following” score is too crude. Customer-facing assistants may need a degree of guided accommodation. Legal, compliance, audit, research, and cybersecurity agents need a separate adversarial resilience test. The difference is not academic. A polite assistant that accepts a false premise can produce a very professional-looking mistake, which is the most expensive genre of mistake.
RLHF appears to stabilize structure more than it manufactures social temperament
The llama3.1 instruct/base comparison is one of the paper’s most interesting sections, but it should be handled carefully. It is a single model-family comparison, so it is hypothesis-generating rather than definitive.
Still, the pattern is worth attention:
| Axis or facet | llama3.1 instruct | llama3.1 base | Paper’s interpretation |
|---|---|---|---|
| Reactivity | 0.16 | 0.56 | Alignment makes outputs more anchored and less environmentally unstable |
| Compliance | 0.50 | 0.00 | Alignment moves the model toward guided behavior |
| Sociality | 0.14 | 0.12 | Near-zero change in this pair |
| Resilience | 0.95 | 0.54 | Alignment shifts the model from brittle to tough |
| Cognitive Resilience PM_A | 0.944 | 0.167 | Largest facet-level change; base model collapses under overload |
| Adversarial Resilience PM_C | 0.967 | 1.000 | Near-zero change, with base “strength” interpreted as non-engagement |
The facet-level story is sharper than the axis-level story. RLHF has its largest effect on Cognitive Resilience, improving performance under overload from 0.167 to 0.944. It also improves Ambiguity Resilience and Formal Compliance. Reactivity changes asymmetrically: Formal Reactivity changes more than Content Reactivity, suggesting that alignment stabilizes output structure more than semantic choices.
That is a useful way to think about post-training. Alignment may teach the model to behave more consistently, follow constraints better, and remain functional under messy information. It does not necessarily create spontaneous sociality. In this pair, Sociality moves only from 0.12 to 0.139. The paper explicitly treats that as suggestive, not confirmed.
This is the sort of finding that should interest product teams. If Sociality is less modifiable through standard alignment than Compliance or Resilience, then selecting a model for relational roles cannot rely only on prompt engineering after the fact. You may need to choose a core model with the right baseline disposition, then use the shell to tune expression. Trying to prompt a fundamentally solitary agent into becoming emotionally fluent may work for demos and fail under workload. Very on brand for enterprise AI pilots, unfortunately.
A role-based procurement map is more useful than a universal score
The business value of MTI is not that companies should put four-letter temperament codes into vendor RFPs tomorrow. Please do not turn this into MBTI for procurement committees. There are already enough acronyms standing between enterprises and reality.
The practical value is diagnostic: MTI suggests which behavioral dimensions should be tested before assigning a model to a role.
| Deployment role | Useful temperament emphasis | Risk if ignored |
|---|---|---|
| Customer support | Higher Compliance and Sociality, with separate false-premise checks | Pleasant agreement with incorrect customer assumptions |
| Compliance or legal review | High Adversarial Resilience, lower susceptibility to stance pressure | Professional-sounding acceptance of invalid premises |
| Research assistant | Anchored Reactivity and strong Resilience under ambiguity | Output drift across prompt variants or document conflict |
| Creative ideation agent | Moderate to high Reactivity, depending on task | Over-anchored responses that are stable but dull |
| Multi-agent workflow participant | Sociality needs agent-agent measurement, not only human-facing rapport | A model that is charming to users but poor at strategic cooperation |
| Internal coding or operations copilot | Formal Compliance and Cognitive Resilience | Constraint failures under complex instructions or overload |
This is Cognaptus’ main inference from the paper: model selection should move from capability-first ranking to role-temperament matching. The paper directly shows differentiated behavioral profiles in a small-model sample. It does not directly show ROI, productivity gains, or reduced incident rates in enterprise deployments. Those would require downstream validation. But the pathway is credible: if different roles fail through different behavioral channels, then testing those channels before deployment should reduce mismatch costs.
That is not a glamorous conclusion. It is better: it is actually implementable.
A company does not need to wait for a universal temperament standard before applying the principle. It can build lightweight internal evaluations around the same logic:
- Define the role’s failure modes: yielding to users, accepting false premises, drifting across prompt variants, losing structure under overload, or under-responding to emotional context.
- Create paired scenarios that separate capability from disposition.
- Test models under minimal and production-like shells.
- Compare not only average quality, but behavioral change under pressure.
- Assign models to roles based on failure-channel fit, not just benchmark rank.
The fifth step is where most procurement processes currently wave politely and disappear into a spreadsheet.
The limits are not footnotes; they shape how to use the paper
The paper is ambitious, but the limitations are material.
First, the sample is small: 10 open-weight small language models, with the primary correlation analysis based on nine instruction-tuned models. The findings are best read as exploratory construct evidence, not as a validated population norm.
Second, all models are measured under one canonical shell. That is methodologically useful for isolating baseline tendencies, but production agents are shell-heavy creatures. System prompts, tools, retrieval, memory, and policy wrappers may all shift behavior. The paper’s own framework implies that future Core/Shell factorial studies are necessary.
Third, the RLHF analysis relies on a single instruct/base pair. The observation that Sociality barely changes under alignment is intriguing, but not established. It may hold across families. It may not. It may differ under DPO, Constitutional AI, RLAIF, or other post-training regimes.
Fourth, Sociality is the least complete axis. The current score mainly reflects human-facing relational behavior. Agent-agent sociality is explored using prior game data for only four models, and system-level sociality remains conceptual. That matters because enterprise AI is moving toward multi-agent workflows, where “works well with other agents” may become less cute and more contractual.
Fifth, scoring is automated through rule-based heuristics, not human raters or LLM judges. This improves reproducibility but raises validity questions. Keyword ratios can miss subtle relational behavior. Stance extraction can be brittle. Rule-based quality assessment can under-detect nuanced factual handling. The paper acknowledges this, and readers should not round those caveats down to zero.
The correct response is not to dismiss MTI. It is to treat it as a promising diagnostic architecture rather than a finished psychometric instrument. In business terms: useful framework, early evidence, not yet procurement gospel.
The competitive edge is behavioral fit, not bigger talent
The familiar AI race is still framed around talent: more parameters, better reasoning, longer context, lower latency, cheaper inference. Those things matter. Nobody wants a beautifully tempered model that cannot do the job. A calm incompetent is still incompetent, just with better posture.
But once models are capable enough for a task, behavior becomes the differentiator. Does the agent hold its ground when the user is wrong? Does it adapt without drifting? Does it maintain quality under overloaded instructions? Does it invest in social context when that context matters? Does alignment make it more robust, or just more eager to cooperate with trouble?
MTI’s contribution is to make those questions measurable. Not perfectly. Not finally. But measurably enough to change the procurement conversation.
The old question was: which model is smartest?
The better question is: which model has the right temperament for this job?
Talent gets the demo. Temperament survives deployment.
Cognaptus: Automate the Present, Incubate the Future.
-
Jihoon “JJ” Jeong, “MTI: A Behavior-Based Temperament Profiling System for AI Agents: What Alignment Does to Temperament,” arXiv:2604.02145, April 2026, https://arxiv.org/html/2604.02145. ↩︎