Language Models

If Logic Were Enough: Why LLMs Still Miss the Point of Conditionals

A promise is rarely just a logical operator. “If you mow the lawn, I’ll give you 50 dollars” does not sound like a philosophical exercise in truth tables. It sounds like a deal. Most people hear it as: no mowing, no money. By contrast, “If you’re hungry, there’s pizza in the oven” does not mean the pizza appears only under the metaphysical condition of your hunger. It means the pizza is there, and your hunger merely explains why I am telling you. ...

Prompt and Circumstance: Why One Accuracy Number Is Not a Reliability Audit

Opening — Why this matters now The AI market has learned to worship benchmark tables with the solemnity once reserved for quarterly earnings. One model is up two points on MMLU, another is slightly better at reasoning, a third is cheaper, smaller, faster, and therefore apparently ready to run your compliance workflow by Tuesday. ...

When Language Models Ask for Help: The Curious Case of Uncertain AI

Escalation is the least glamorous part of automation. It is also where many systems either become useful or become expensive theatre. In a normal business workflow, we understand escalation almost instinctively. A junior analyst handles routine invoices. An exception goes to a senior reviewer. A suspicious transaction goes to compliance. A warehouse robot follows a route until the floor plan stops behaving like yesterday’s floor plan. Nobody sensible asks the senior reviewer to approve every invoice. Nobody sensible lets the junior analyst improvise when the case is clearly outside their experience. ...

Recurrent Revival: How Retrofitted Depth Turns LLMs Into Deeper Thinkers

Compute is the bill that arrives after every AI strategy meeting. Everyone wants stronger reasoning. Fewer hallucinations. Better mathematical reliability. More robust planning. The usual menu is familiar: train a bigger model, sample more answers, generate longer chain-of-thought, bolt on a verifier, or pray to the GPU procurement gods. Elegant, in the way an invoice can be elegant. ...

When Noisy Data Talks Back: The Fragile Art of Learning Under Infinite Contamination

Bad data is not one problem. It is at least three problems wearing the same cheap trench coat. There is bad data that appears once and disappears. There is bad data that keeps appearing, but becomes rarer as the corpus grows. And there is bad data that settles in at a stable rate, like a permanent tenant with poor hygiene and legal representation. Business discussions about AI training data often compress these into one vague category called “noise”. Convenient, yes. Informative, no. ...

Unpacking the Explicit Mind: How ExplicitLM Redefines AI Memory

Memory is useful until nobody can find where it lives. That, in miniature, is the operational problem with today’s language models. They can answer questions, imitate expertise, retrieve fragments of the past, and produce very confident nonsense with the composure of a senior consultant who has just discovered bullet points. But when a model gives a wrong factual answer, the organisation deploying it faces an awkward question: where, exactly, is that wrong fact stored? ...

Benchmarks That Fight Back: Adaptive Testing for LMs

A benchmark is supposed to be a measuring instrument. In practice, many AI benchmarks behave more like a tired clipboard. Every model gets the same questions. Every question receives the same accounting treatment. The final score is usually a mean accuracy number, neat enough for a leaderboard and blunt enough to hide the messy truth underneath. Some items are too easy to tell strong models apart. Some are too hard to tell weak models apart. Some are mislabeled. Some have stopped mattering because everyone competent now solves them. Yet the ritual continues: run the suite, average the answers, update the chart, pretend the thermometer is not melting. ...

Circuits of Understanding: A Formal Path to Transformer Interpretability

TL;DR for operators Debugging. That is the useful mental entry point, not “AI transparency,” which has become a conference badge phrase with slightly better lighting. The paper at the centre of this article, Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small, shows that a real linguistic behaviour in a transformer can be decomposed into a circuit of internal components, then tested using causal interventions rather than admired through colourful attention maps.1 The task is indirect object identification: given a sentence where two names appear and one is repeated, the model predicts the other name. Small grammar problem, large interpretability bill. ...

Words, Not Just Answers: Using Psycholinguistics to Test LLM Alignment

TL;DR for operators Most AI evaluation still asks whether a model can produce the right answer. This paper asks a quieter but more commercially awkward question: when a model uses a word, does it attach human-like emotional, concrete, familiar, gendered, or sensory associations to that word?1 The authors propose using established psycholinguistic word norms as an automated alignment test. Instead of hiring new human raters every time, they reuse datasets where humans have already rated thousands of English words on features such as arousal, valence, concreteness, imageability, familiarity, gender association, and sensory modalities. ...

Bias Busters: Teaching Language Agents to Think Like Scientists

TL;DR for operators Language-model agents do not merely make wrong causal guesses. In this paper, they gather evidence in a biased way, then interpret that evidence through the same bias. That is the uncomfortable part. The study turns the classic Blicket Test from developmental psychology into a text-based active exploration game for LM agents. The agent must test objects, observe whether a machine turns on, then infer which objects are “Blickets” and whether the hidden rule is disjunctive — any Blicket activates the machine — or conjunctive — all relevant Blickets must be present together.1 ...