Words, Not Just Answers: Using Psycholinguistics to Test LLM Alignment

TL;DR for operators

Most AI evaluation still asks whether a model can produce the right answer. This paper asks a quieter but more commercially awkward question: when a model uses a word, does it attach human-like emotional, concrete, familiar, gendered, or sensory associations to that word?¹

The authors propose using established psycholinguistic word norms as an automated alignment test. Instead of hiring new human raters every time, they reuse datasets where humans have already rated thousands of English words on features such as arousal, valence, concreteness, imageability, familiarity, gender association, and sensory modalities.

The result is not “LLMs fail language”. That would be too easy, and also wrong. The evaluated models align reasonably better with humans on several Glasgow norms: arousal, valence, concreteness, imageability, and familiarity. But alignment is weaker on Lancaster sensory norms: haptic, gustatory, auditory, olfactory, visual, and interoceptive associations.

For business use, this matters because customer-facing AI does not merely answer questions. It chooses words. Those words carry emotional load, sensory expectation, clarity, intimacy, and social implication. A model that can solve a reasoning benchmark may still mishandle the difference between “crisp”, “soft”, “sharp”, “warm”, “safe”, “clinical”, “fresh”, or “heavy” in ways that matter for brand voice, search relevance, product descriptions, education, accessibility, and safety review.

The practical takeaway is not to replace existing benchmarks. It is to add a cheap diagnostic layer. Psycholinguistic tests can reveal whether a model’s word-level associations are human-like in the domains where your product depends on language perception. The strongest immediate use is in tone, concreteness, familiarity, and emotional calibration. Sensory product language remains more uncertain unless the model has better grounding, multimodal training, or domain-specific evaluation.

A model can answer correctly and still not mean words like people do

A customer asks for a “light” fragrance. Another asks for a “warm” hotel room. A student asks for a “concrete” explanation. A patient-facing chatbot chooses between “mild discomfort” and “sharp pain”. None of these interactions is primarily about factual correctness. The model is not solving algebra. It is navigating the human associations attached to words.

This is where conventional LLM evaluation starts to look slightly overconfident. Leaderboards are good at measuring task success: maths, reasoning, coding, question answering, summarisation, translation. They are less good at measuring whether the model’s internal sense of a word resembles the associations humans bring to it. Unfortunately, users do not read outputs as benchmark submissions. They read them as language.

The mechanism problem is simple. Humans learn words through language, but not through language alone. We learn “chair” partly from sitting. We learn “purple” partly from seeing. We learn “lemon” partly from taste, smell, texture, colour, social context, and the mild facial betrayal of biting into one too confidently. LLMs, especially text-only models, learn from distributions of symbols. They can infer a lot from text, but some associations may be missing, distorted, or unevenly represented.

The paper’s central move is therefore useful: instead of asking only whether an LLM gives the right output to a task, ask whether its ratings of word features correlate with established human ratings.

That turns psycholinguistics into an evaluation instrument.

The benchmark uses old human experiments as a new model diagnostic

The authors use two established English word-norm datasets.

The first is the Glasgow norms, covering 5,553 English words. These include seven features: arousal, valence, dominance, concreteness, imageability, familiarity, and gender association.

The second is the Lancaster norms, covering 39,707 English words. These include perceptual modality ratings: touch, hearing, smell, taste, vision, and interoception. The Lancaster dataset also contains body-part associations, but the authors exclude those because the original human instructions used images of body parts. Since most evaluated models are pure language models, that would make the comparison unfair rather than enlightening. Sensible restraint; rare enough to deserve a small nod.

The authors then ask eight LLMs to rate words using prompts adapted from the original human studies. The evaluated models are Llama-3.2-3B, Llama-3.1-8B, Llama-3.2-11B Vision-Instruct, Gemma-2-9B, Yi-1.5-9B, Occiglot-7B, GPT-4o, and GPT-4o-mini.

The rating procedure has two variants. One is the model’s direct numeric answer. The other uses the model’s estimated probabilities over possible rating values and computes an expected score. The paper presents the probability-based estimate in the main results because it generally aligns better with human ratings.

The evaluation metrics are Pearson and Spearman correlations, computed both on original values and rounded values. This is not statistical decoration. It matters because some word-feature distributions are skewed. For sensory dimensions, many words are barely related to a modality at all. Pearson and Spearman can tell different stories depending on whether we care more about high-rating outliers or small rank differences in the crowded low end of the scale.

Test component	Likely purpose	What it supports	What it does not prove
Glasgow norms	Main evidence	Whether LLMs align with human ratings on affective and semantic word features	Full behavioural alignment in conversation
Lancaster sensory norms	Main evidence	Whether LLMs align with human sensory associations	That text-only models cannot learn any sensory meaning
Probability-based ratings	Implementation detail	A smoother estimate than one direct numeric answer	That the model has stable internal human-like concepts
Pearson vs Spearman	Sensitivity and diagnostic check	Whether alignment depends on outliers or rank ordering	A single universal “best” alignment score
Rounded vs original values	Robustness/sensitivity check	Whether tiny Likert differences are driving correlations	Psychological validity of every model judgement
Multimodal model comparison	Exploratory comparison	Whether multimodal status obviously improves visual alignment	A definitive test of multimodal grounding

The important point: the paper is not proposing a new leaderboard badge for model marketing teams to weaponise by Friday. It is proposing a reusable diagnostic method.

Glasgow results: models know more about affect and concreteness than sceptics might expect

The Glasgow results are the more encouraging half of the paper. Across the evaluated models, alignment is generally better for arousal, valence, concreteness, imageability, and familiarity. Alignment is weaker for gender association and dominance.

That split makes intuitive sense. Arousal and valence are heavily represented in language. Words are constantly described as positive, negative, exciting, calm, pleasant, threatening, familiar, abstract, concrete, vivid, or vague. A model trained on enough text has many opportunities to learn these associations indirectly.

Concreteness is a useful example because it links directly to practical communication. The paper compares the words “bicycle” and “bid”. Human concreteness ratings are far apart: 6.81 for “bicycle” and 3.42 for “bid”. Llama-3.2-3B gives much more similar ratings, 4.73 and 4.50. GPT-4o, by contrast, separates them more sharply, with 7 and 2.96.

That example does more work than a neat average score would. It shows that model differences are not merely abstract leaderboard noise. They affect whether the model distinguishes tangible from abstract language. In education, support, legal explanation, healthcare, sales copy, and onboarding, that matters. “Make it more concrete” is not a vibe. It is a measurable language feature.

The paper also notes that GPT-4o and GPT-4o-mini generally show stronger alignment across Glasgow features, while smaller models can perform well on specific dimensions, such as Gemma-2-9B on gender association. This is a useful operational warning: model size and brand prestige are not substitutes for dimension-level testing. A weaker model overall may still be acceptable for a narrow linguistic function; a stronger model overall may still need checking in the specific association space your product depends on.

Lancaster results: sensory language is where the floor drops

The Lancaster results are less flattering. The correlations between LLM ratings and human ratings are much lower for sensory associations than for the Glasgow features.

The authors interpret this as consistent with embodied cognition: humans do not learn sensory meaning from text alone. We have bodies, senses, environments, habits, food, surfaces, sounds, smells, discomfort, balance, and memory. LLMs have text. Quite a lot of text, admittedly. Still text.

The “lemon” example is the paper’s cleanest illustration. Humans rate “lemon” highly for gustatory association, with a rating of 4.45. Gemma-2-9B produces 0.01, essentially missing the taste association. GPT-4o produces 4.49, nearly matching the human mean.

The point is not that every model fails every sensory word. GPT-4o can clearly capture obvious cases. The deeper issue is distributional reliability. A production system cannot depend only on obvious examples. Sensory language in commerce and customer experience is full of subtle distinctions: matte versus glossy, crisp versus brittle, earthy versus smoky, bright versus harsh, silky versus slippery, fresh versus chemical. When these associations drive user expectations, weak alignment becomes a product risk.

The paper also finds that multimodality does not obviously solve the visual feature. The multimodal models considered—Llama-3.2-11B Vision-Instruct, GPT-4o, and GPT-4o-mini—do not show a clear advantage over the rest on visual alignment. This should not be overread. The study is not a full audit of multimodal training. But it does puncture a lazy assumption: attaching images to a model does not automatically give it human-like sensory word meaning. Apparently, “now with vision” is not a sacrament.

Correlations are doing interpretive work, not just statistical housekeeping

The paper’s use of Pearson and Spearman correlations deserves attention because this is where careless benchmarking can produce false confidence.

Pearson correlation is sensitive to linear relationships and gives more influence to observations far from the mean. In this setting, it can reward a model for correctly identifying words that are strongly associated with a feature, such as highly taste-related words in the gustatory dimension.

Spearman correlation focuses on rank ordering. It can be more sensitive to how the model orders words around the dense middle or low end of the distribution. For sensory norms, where many words sit near “not associated with this sense”, that can change the interpretation.

The authors find that Pearson and Spearman generally agree, but diverge for gustatory and olfactory ratings. That divergence is not a nuisance to be averaged away. It tells evaluators what kind of alignment they are seeing.

If a product team cares about catching the obvious high-signal sensory terms—say, food, perfume, cosmetics, materials, or hospitality descriptors—Pearson-like behaviour may be more relevant. If the team cares about fine-grained ranking across many weakly sensory words—say, semantic search, recommendation, or accessibility labelling—Spearman-like behaviour may matter more.

The rounded-value test adds another guardrail. Tiny differences such as 1.01 versus 1.02 on a Likert-like scale may not be psychologically meaningful. If a correlation disappears or changes substantially after rounding, the apparent alignment may have depended on differences no human would notice. That is a lovely little trapdoor under many numerical evaluations. Best to know where it is before standing on it.

What the paper directly shows

The direct contribution is methodological and empirical.

First, it shows that psycholinguistic word norms can be reused as an automated LLM-human alignment test. This is cheaper and more scalable than commissioning fresh human ratings for every model and every feature.

Second, it demonstrates the method across eight contemporary LLMs and thirteen word features drawn from two datasets.

Third, it finds a structured pattern: better alignment on several affective and semantic features, weaker alignment on sensory features.

This is already enough to matter. It means “alignment” can be decomposed into measurable word-level dimensions rather than treated as a single moral halo hovering over the model. Product teams do not need to ask whether a model is aligned in general. They can ask whether it aligns with humans on the linguistic associations their application actually uses.

What Cognaptus infers for business use

The business inference is not that psycholinguistic scores should replace task benchmarks. They should sit beside them.

A model used for coding support, legal extraction, or financial analysis still needs task-level evaluation. But a model used for customer-facing language also needs word-association evaluation. The moment an AI system writes, rewrites, ranks, filters, recommends, translates, labels, or personalises language for humans, psycholinguistic alignment becomes operationally relevant.

Business domain	Relevant word features	Practical diagnostic question	What the paper suggests
Marketing and brand voice	Valence, arousal, familiarity, imageability	Does the model choose words with the intended emotional and vividness profile?	Current models may be usable, but should be dimension-tested
Education and explanation	Concreteness, familiarity, imageability	Does the model make concepts feel tangible and accessible?	Concreteness testing can reveal model differences
Search and recommendation	Sensory modality, concreteness, familiarity	Does the model understand human associations behind descriptive queries?	Sensory alignment is weaker and needs domain evaluation
Product descriptions	Taste, smell, touch, vision, auditory association	Does the model preserve sensory expectations accurately?	High-risk area for text-only assumptions
Brand safety and UX review	Arousal, valence, dominance, gender association	Does generated language carry unintended emotional or social associations?	Useful as a screening layer, not a final safety verdict
Model selection and routing	Feature-specific correlations	Which model is good enough for this linguistic function?	Global model ranking may hide feature-level strengths and weaknesses

The immediate ROI is diagnostic, not magical. These tests can help teams find mismatch early, before vague user complaints accumulate into the traditional enterprise dashboard category known as “sentiment seems off”. Very scientific. Very expensive.

A practical workflow might look like this:

Identify the word-feature dimensions that matter to the product.
Build a domain-specific word list from real prompts, product copy, search terms, support tickets, or educational content.
Rate candidate models using psycholinguistic prompts and compare them with available norms or newly collected human ratings for domain-critical terms.
Track feature-level correlations across model versions, prompts, fine-tunes, and routing policies.
Use failures to guide prompt design, retrieval grounding, synthetic data, post-training, or human review.

This is especially useful for model governance. A model update can improve factual accuracy while degrading tone, concreteness, or sensory association. Without a diagnostic like this, the degradation may only appear after deployment. Which is a bold testing strategy, in the same way that checking whether a bridge works by driving payroll across it is bold.

The strongest use case is not “alignment”; it is language QA

The word “alignment” is doing a lot of work here. In AI discourse, alignment often implies safety, values, obedience, harmlessness, or preference satisfaction. This paper uses alignment in a narrower sense: agreement between LLM ratings and human ratings for psycholinguistic word features.

That narrower sense is commercially valuable precisely because it is measurable.

For operators, the right mental model is language QA. Psycholinguistic norms provide test cases for whether the model’s lexical associations resemble human associations on dimensions that affect user perception.

This is not a philosophical solution to whether models “understand”. It is a practical way to ask: when the system says “fresh”, “safe”, “sharp”, “warm”, “plain”, “intimate”, “dominant”, “familiar”, “abstract”, or “vivid”, does it operate in a space similar enough to the user’s?

For many business systems, similar enough is the point. A hotel chatbot does not need a soul. It does need to avoid describing a windowless room as “airy” because the distributional ghost of travel copy possessed it.

The sensory gap is a warning for multimodal product strategy

The Lancaster result should influence how teams think about multimodal AI.

A common business assumption is that multimodal models will naturally become better at sensory language because they have access to images, audio, or other modalities. This may eventually be true in specific architectures and training regimes. The paper, however, does not show a simple visual advantage for the multimodal models tested.

There are several possible reasons. The multimodal training may not be optimised for psycholinguistic sensory associations. The visual feature may not map cleanly to what current image-text training captures. The prompt may not activate multimodal representations in a way that improves word-level ratings. Or the model may know visual facts without organising them like human perceptual experience.

The operational conclusion is straightforward: do not assume multimodality equals grounded language. Test it.

For retail, hospitality, food, cosmetics, design, real estate, entertainment, and education, sensory language is not ornamental. It is part of expectation management. A mismatch between model associations and human associations can produce misleading descriptions, weak search results, strange recommendations, or copy that sounds fluent but subtly wrong.

That last category—fluent but subtly wrong—is where LLMs have built quite the franchise.

Boundaries that matter before anyone turns this into a leaderboard

The paper is explicitly an initial study. Its limitations affect how the results should be used.

Only two datasets are evaluated. Both are English. The model set is representative but limited. The metrics are inherited from psycholinguistic practice and may not be the final form of LLM-human word alignment measurement. Correlation is useful, but it is not the same as causal understanding, safe behaviour, robust deployment performance, or user satisfaction.

There is also a domain issue. General word norms may not capture specialised associations. A medical term, financial term, luxury brand adjective, construction material, food descriptor, or regional expression can carry meanings that are not well represented by broad English norms. In those cases, the method still helps, but organisations may need custom human ratings for their domain vocabulary.

Finally, the benchmark evaluates isolated word ratings. Real user interactions involve phrases, context, intent, culture, discourse history, and stakes. A model might rate “cold” correctly as a word and still mishandle “cold tone”, “cold storage”, “cold email”, “cold symptoms”, and “cold brew” in different contexts. Words are not the whole game. They are just where many mistakes begin.

The operator’s version of the research agenda

The authors suggest that psycholinguistic norms could become part of standard LLM evaluation and could help guide improvements, including synthetic text generation, post-training, and deeper study of multimodal models. That is the research path.

The operator path is narrower and more immediate.

Use psycholinguistic evaluation when language perception matters. Use it to compare models. Use it to detect regressions. Use it to check whether fine-tuning or prompt changes improve the feature you actually care about. Use it to separate “the model sounds better in demos” from “the model’s word associations moved closer to human ratings on the dimensions our customers notice”.

The deeper lesson is that LLM evaluation needs more than answer correctness. A model can pass a reasoning test and still choose words with mismatched emotional force, concreteness, familiarity, or sensory implication. That does not make the model useless. It makes it a machine whose language should be inspected at the level where humans actually experience language.

Words are not just answer tokens. They are tiny packets of expectation. The paper’s contribution is to show that we already have decades of human data that can help test whether models are carrying those packets in roughly the right shape.

Not glamorous. Very useful. The best benchmarks often are.

Cognaptus: Automate the Present, Incubate the Future.

Javier Conde, Miguel González, María Grandury, Gonzalo Martínez, Pedro Reviriego, and Marc Brysbaert, “Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans,” arXiv:2506.22439, 2025. ↩︎

TL;DR for operators#

A model can answer correctly and still not mean words like people do#

The benchmark uses old human experiments as a new model diagnostic#

Glasgow results: models know more about affect and concreteness than sceptics might expect#

Lancaster results: sensory language is where the floor drops#

Correlations are doing interpretive work, not just statistical housekeeping#

What the paper directly shows#

What Cognaptus infers for business use#

The strongest use case is not “alignment”; it is language QA#

The sensory gap is a warning for multimodal product strategy#

Boundaries that matter before anyone turns this into a leaderboard#

The operator’s version of the research agenda#