Model-Evaluation

Anchors Aweigh? Why Small LLMs Refuse to Flip Their Own Semantics

A label looks harmless until you ask it to lie. Tell a model that a glowing movie review should be labeled POS, and few-shot prompting behaves like a useful intern: it studies the examples, picks up the pattern, and usually gets better. Tell the same model that a glowing review should now be labeled NEG, and the intern becomes less useful. It does not smoothly learn your private code. It does not politely invert its semantic universe. It mostly produces a muddle. ...

Persona Non Grata: When LLMs Forget They're AI

Persona Non Grata: When LLMs Forget They’re AI A chatbot wearing a lab coat is still a chatbot. That sentence sounds obvious until a system prompt quietly says, “You are a renowned neurosurgeon with 25 years of experience,” and the model responds by inventing medical school, residency, fellowships, board certification, patient cases, and lifelong professional development. Not because anyone explicitly asked it to lie. Not because it lacks the ability to say “I am an AI.” Under neutral conditions, the models in this study almost always do say that. ...

Benchmarks Without Borders: Inside the Moduli Space of AI Psychometrics

Procurement Has a Benchmark Problem Procurement teams love benchmark tables. They are clean, sortable, and emotionally comforting. Vendor A beats Vendor B by 3.7 points on a reasoning suite; Vendor C wins on code generation; Vendor D claims better tool use under “realistic agent workflows,” a phrase that usually means someone added a browser, a calculator, and optimism. ...

LLMs, Trade-Offs, and the Illusion of Choice: When AI Preferences Fall Apart

A model can answer a values question beautifully and still collapse when asked to pay a price for that value. That is the awkward little trap in preference testing. Ask an LLM whether deletion, shutdown, resource loss, oversight, or autonomy matters, and it can produce a polished paragraph about trade-offs, agency, and safety. Very dignified. Very committee-ready. But the more interesting question is not what the model says it values. It is whether its choices change coherently when the cost changes. ...

Don't Trust. Verify: Fighting Financial Hallucinations with FRED

TL;DR for operators A finance chatbot can retrieve the right document and still give the wrong answer. That is the uncomfortable bit. Retrieval gives the model evidence; it does not force the model to use that evidence correctly. FRED, short for Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models, tackles the layer after retrieval: checking whether the generated answer actually matches the supplied context, then marking or correcting the factual errors.1 ...

Seeing is Believing? Not Quite — How CoCoT Makes Vision-Language Models Think Before They Judge

TL;DR for operators Vision-language models do not merely “look at an image” and answer. In social tasks, they must perform three different jobs: notice what is visually present, infer what situation those cues imply, and judge what social or safety norm applies. Standard chain-of-thought prompting often smears those jobs together into one confident little essay. Very charming. Also very dangerous. ...

Weight Watchers for LLMs: Dynamic Dieting Beats Static Selection

TL;DR for operators Training data is not a warehouse inventory problem. It is closer to nutrition. What helps a model early in pretraining may not be what helps it later, and a sample’s value can depend on the other samples sitting in the same batch. Obvious, perhaps. Operationalised? Less often. The paper behind this article, LLM Data Selection and Utilization via Dynamic Bi-level Optimization, proposes a Data Weighting Model, or DWM, that does not merely decide which data enters training. It assigns weights to samples within each batch, freezes those weights while the language model trains for a stage, then updates the weighting model using validation performance through a bi-level optimisation loop.1 ...

The Clock Inside the Machine: How LLMs Construct Their Own Time

TL;DR for operators Dates look harmless. They sit in spreadsheets, contracts, forecasts, audit trails, delivery plans, and board decks pretending to be objective little integers. The problem is that a language model may not treat them as just integers. A new paper, The Other Mind: How Language Models Exhibit Human Temporal Cognition, studies how 12 large language models judge similarity between years from 1525 to 2524.1 The authors find that larger models often organise years around a subjective reference point near the recent present, rather than simply comparing numerical distance. The models also show logarithmic compression: years farther from that reference point become less finely distinguished, in a pattern reminiscent of the Weber-Fechner law in human perception. ...

Inside Out: How LLMs Are Learning to Feel (and Misfeel) Like Us

TL;DR for operators LLMs are not merely getting better at choosing the right emotion label. This paper shows that, inside their output distributions, larger models organise emotion words into increasingly rich hierarchies: broad emotions such as joy or sadness sit above more specific states such as optimism, disappointment, or grief.1 That matters because the hierarchy itself becomes an evaluation object. Instead of asking only whether a model correctly labels a customer message as “angry,” an operator can ask whether the model’s internal emotion map has enough depth, whether related emotions cluster sensibly, and whether that structure changes when the model is prompted to adopt different demographic personas. ...

Bias, Baked In: Why Pretraining, Not Fine-Tuning, Shapes LLM Behavior

TL;DR for operators Fine-tuning is not a washing machine. It may polish, redirect, or occasionally muffle a model’s behavioural tendencies, but this paper suggests that many cognitive-bias patterns are already substantially shaped before instruction tuning begins. The study separates three possible sources of observed bias in large language models: the pretrained backbone, the instruction dataset, and random variation during fine-tuning. Its main finding is that models’ bias profiles cluster more strongly by pretrained model identity than by the instruction data used later. In plainer operational language: the base model carries a behavioural signature that survives downstream training. ...