Cover image

Benchmarks Without Borders: Inside the Moduli Space of AI Psychometrics

Procurement Has a Benchmark Problem Procurement teams love benchmark tables. They are clean, sortable, and emotionally comforting. Vendor A beats Vendor B by 3.7 points on a reasoning suite; Vendor C wins on code generation; Vendor D claims better tool use under “realistic agent workflows,” a phrase that usually means someone added a browser, a calculator, and optimism. ...

November 25, 2025 · 16 min · Zelina
Cover image

LLMs, Trade-Offs, and the Illusion of Choice: When AI Preferences Fall Apart

A model can answer a values question beautifully and still collapse when asked to pay a price for that value. That is the awkward little trap in preference testing. Ask an LLM whether deletion, shutdown, resource loss, oversight, or autonomy matters, and it can produce a polished paragraph about trade-offs, agency, and safety. Very dignified. Very committee-ready. But the more interesting question is not what the model says it values. It is whether its choices change coherently when the cost changes. ...

November 18, 2025 · 12 min · Zelina
Cover image

Don't Trust. Verify: Fighting Financial Hallucinations with FRED

TL;DR for operators A finance chatbot can retrieve the right document and still give the wrong answer. That is the uncomfortable bit. Retrieval gives the model evidence; it does not force the model to use that evidence correctly. FRED, short for Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models, tackles the layer after retrieval: checking whether the generated answer actually matches the supplied context, then marking or correcting the factual errors.1 ...

July 29, 2025 · 17 min · Zelina
Cover image

Seeing is Believing? Not Quite — How CoCoT Makes Vision-Language Models Think Before They Judge

TL;DR for operators Vision-language models do not merely “look at an image” and answer. In social tasks, they must perform three different jobs: notice what is visually present, infer what situation those cues imply, and judge what social or safety norm applies. Standard chain-of-thought prompting often smears those jobs together into one confident little essay. Very charming. Also very dangerous. ...

July 29, 2025 · 17 min · Zelina
Cover image

Weight Watchers for LLMs: Dynamic Dieting Beats Static Selection

TL;DR for operators Training data is not a warehouse inventory problem. It is closer to nutrition. What helps a model early in pretraining may not be what helps it later, and a sample’s value can depend on the other samples sitting in the same batch. Obvious, perhaps. Operationalised? Less often. The paper behind this article, LLM Data Selection and Utilization via Dynamic Bi-level Optimization, proposes a Data Weighting Model, or DWM, that does not merely decide which data enters training. It assigns weights to samples within each batch, freezes those weights while the language model trains for a stage, then updates the weighting model using validation performance through a bi-level optimisation loop.1 ...

July 23, 2025 · 17 min · Zelina
Cover image

The Clock Inside the Machine: How LLMs Construct Their Own Time

TL;DR for operators Dates look harmless. They sit in spreadsheets, contracts, forecasts, audit trails, delivery plans, and board decks pretending to be objective little integers. The problem is that a language model may not treat them as just integers. A new paper, The Other Mind: How Language Models Exhibit Human Temporal Cognition, studies how 12 large language models judge similarity between years from 1525 to 2524.1 The authors find that larger models often organise years around a subjective reference point near the recent present, rather than simply comparing numerical distance. The models also show logarithmic compression: years farther from that reference point become less finely distinguished, in a pattern reminiscent of the Weber-Fechner law in human perception. ...

July 22, 2025 · 16 min · Zelina
Cover image

Inside Out: How LLMs Are Learning to Feel (and Misfeel) Like Us

TL;DR for operators LLMs are not merely getting better at choosing the right emotion label. This paper shows that, inside their output distributions, larger models organise emotion words into increasingly rich hierarchies: broad emotions such as joy or sadness sit above more specific states such as optimism, disappointment, or grief.1 That matters because the hierarchy itself becomes an evaluation object. Instead of asking only whether a model correctly labels a customer message as “angry,” an operator can ask whether the model’s internal emotion map has enough depth, whether related emotions cluster sensibly, and whether that structure changes when the model is prompted to adopt different demographic personas. ...

July 16, 2025 · 17 min · Zelina
Cover image

Bias, Baked In: Why Pretraining, Not Fine-Tuning, Shapes LLM Behavior

TL;DR for operators Fine-tuning is not a washing machine. It may polish, redirect, or occasionally muffle a model’s behavioural tendencies, but this paper suggests that many cognitive-bias patterns are already substantially shaped before instruction tuning begins. The study separates three possible sources of observed bias in large language models: the pretrained backbone, the instruction dataset, and random variation during fine-tuning. Its main finding is that models’ bias profiles cluster more strongly by pretrained model identity than by the instruction data used later. In plainer operational language: the base model carries a behavioural signature that survives downstream training. ...

July 13, 2025 · 16 min · Zelina
Cover image

School of Thought: How Fine-Tuned Open LLMs Are Challenging the Giants in Education

TL;DR for operators A useful AI education product does not always need the largest model in the room. Sometimes it needs a smaller model that has been taught one job properly and then told, firmly, not to hand students the answer on a silver platter. The paper behind this article studies exactly that: whether supervised fine-tuning can make open-source models good enough to explain C programming errors for novice students. The authors use real CS1/2 error logs from DCC Help, generate 40,000 structured explanations with GPT-4.1, fine-tune Qwen3-4B, Llama-3.1-8B, and Qwen3-32B using QLoRA, then compare them against base models, GPT-4.1, and the original deployed DCC Help responses. ...

July 9, 2025 · 18 min · Zelina
Cover image

Collapse to Forget: Turning Model Collapse into a Privacy Feature for LLMs

TL;DR for operators When an LLM leaks sensitive, copyrighted, or otherwise forbidden information, the obvious repair is to fine-tune it away from the bad answer. That sounds sensible until you notice the small operational comedy: the remediation process keeps using the very answer it is supposed to remove. The paper behind this article proposes Partial Model Collapse (PMC), a machine unlearning method that avoids directly optimising on ground-truth forget answers. Instead, PMC asks the model the sensitive question, samples multiple responses from the model itself, selects a response that is less like the model’s original answer, and fine-tunes on that self-generated response while also training on retain data to preserve general utility.1 ...

July 8, 2025 · 16 min · Zelina