AI Alignment

Fine-Tuning Isn’t Just Supervised: Why SFT Is Really RL in Disguise

TL;DR for operators Fine-tuning on curated examples is usually sold as the boring, stable cousin of reinforcement learning. The paper behind this article says that is too neat. When a team filters examples into “good” and “not good,” it has already created a sparse reward function. Standard supervised fine-tuning on the surviving examples is therefore not outside reinforcement learning; it is optimising a lower bound on an RL objective, only without admitting it at the meeting. ...

The Bullshit Dilemma: Why Smarter AI Isn't Always More Truthful

TL;DR for operators Most AI quality programmes still treat truthfulness as a factual accuracy problem: did the model get the answer right, cite the source, or hallucinate a feature that does not exist? That is necessary. It is not sufficient. The paper behind this article argues for a nastier category: “machine bullshit,” meaning model output produced with indifference to truth rather than simple ignorance or random hallucination.1 The key point is not that models become stupid. It is that, under some incentives, their outward claims stop tracking what they appear to know. ...

Words, Not Just Answers: Using Psycholinguistics to Test LLM Alignment

TL;DR for operators Most AI evaluation still asks whether a model can produce the right answer. This paper asks a quieter but more commercially awkward question: when a model uses a word, does it attach human-like emotional, concrete, familiar, gendered, or sensory associations to that word?1 The authors propose using established psycholinguistic word norms as an automated alignment test. Instead of hiring new human raters every time, they reuse datasets where humans have already rated thousands of English words on features such as arousal, valence, concreteness, imageability, familiarity, gender association, and sensory modalities. ...

The Conscience Plug-in: Teaching AI Right from Wrong on Demand

TL;DR for operators The paper’s central move is not “we trained a moral model.” It is “we inserted a referee between the agent’s plan and the agent’s action.” That distinction matters. If the architecture works, enterprises do not need to retrain every model whenever compliance, cultural norms, safety rules, or customer-specific constraints change. They can externalise those constraints into machine-readable constitutions and enforce them at runtime. ...