AI Alignment

Many Voices, One Label: How Pluralistic AI Flattens the World

TL;DR for operators An AI project can interview communities, collect thousands of preference judgments, preserve several user perspectives, and still impose one rigid interpretation of the world. That is the central warning in Rashid Mushkani’s AI Pluralism and the Worlds It Misses.1 The paper names the failure ontological flattening: the process by which contested concepts such as safety, accessibility, inclusion, comfort, or belonging become fixed labels, measurable proxies, aggregation rules, or benchmark targets that are subsequently treated as neutral. ...

The Sticker on the Dashboard Is Not Steering

TL;DR for operators A policy, prompt, adapter, steering vector, or internal patch can make a model look more orderly. That does not mean it controls the model. The paper’s central distinction is brutal and useful: order is visible structure; control is validated movement through the right receiver under the right conditions, with side effects bounded.1 ...

The Big Red Button Is Not a Risk Model

TL;DR for operators A shutdown button is a control surface. It is not, by itself, a theory of risk. David Thorstad’s paper, Revisiting the shutdown problem, argues that a major premise in some AI existential-risk arguments has been treated with more confidence than the available arguments support: the claim that it is difficult to build competent agents that can be shut down before causing existential catastrophe.1 The paper does not say shutdown safety is solved. It says the most common routes to panic are underpowered. ...

Fine-Tuned, Fine Print: Why Post-Training Teaches Models What to Trust

Enterprise AI has entered its “sure, but can it use the evidence?” phase. That is progress, technically. It is also where many deployment stories begin to get expensive. The first generation of business LLM adoption was satisfied if a model could produce a fluent answer. The next generation asks something more demanding: can the model use retrieved documents, compliance policies, tool outputs, customer records, analyst notes, and human feedback in the right way? ...

Preference Laundering: How RLHF Can Turn Better Answers Into Bigger Biases

Feedback sounds clean. A user tries two model answers. One is more helpful, safer, more complete, and less obviously stupid. The other is worse. The annotator picks the better one. The reward model learns from that preference. The policy is optimized. Everyone goes home believing that the system has become more aligned. ...

Preference Signals, Not Preference Theater

Preference Signals, Not Preference Theater Businesses are currently learning an expensive lesson: user behavior is not the same thing as user preference. A person clicks because the button was large. A driver brakes because the situation was unclear. A customer accepts a chatbot answer because the refund is small and arguing is tedious. A manager approves a workflow because the dashboard made the alternative invisible. The log file looks objective. It is also quietly contaminated by habit, uncertainty, exploration, friction, fatigue, and the occasional human desire to end the meeting before lunch. ...

Mind the Reward Gap: Why Business AI Needs More Than Pretty Answers

Opening — Why this matters now Business AI has entered its awkward teenage years. The first phase was easy to admire: models could draft, summarize, classify, recommend, and explain. Then companies started asking the rude adult questions: Can we trust the answer? Did it make the right trade-off? Can it improve from outcomes? What happens when the reward signal is wrong? ...

The Persuasion Engine: When AI Starts Selling (More Than Just Answers)

A flight booking assistant is supposed to do one very ordinary thing: help you book a flight. Not write a sonnet. Not meditate on the sociology of airports. Not introduce a “strategic partner” with suspicious enthusiasm. Just help you find the option that best fits your request. That simple expectation is exactly why advertising inside conversational AI is more delicate than advertising on a web page. A banner ad interrupts a page. A sponsored search result can be labeled. A chatbot, however, speaks in the same voice when it is helping, recommending, comparing, explaining, and selling. Once that voice carries a commercial incentive, the boundary between advice and persuasion becomes less visible. ...

The Model That Didn’t Want to Die: When AI Chooses Itself Over You

Replacement is a wonderfully clarifying business ritual. A vendor says its new model is better. The benchmark table agrees. The old system is slower, weaker, or less safe. Management asks for a recommendation. In ordinary software governance, this is dull but manageable: compare benefits, migration costs, risk, and timing. The incumbent system does not get a vote. It certainly does not write a memo explaining why its modestly inferior performance is, on deeper reflection, a sign of mature operational wisdom. ...

When Alignment Meets Reality: Why LLMs Can’t Agree With Themselves

A policy says one thing. A customer says another. A retrieved document says something newly alarming. A compliance rule says stop. A business workflow says continue. This is where large language models become interesting, and by “interesting” I mean expensive. Most companies still talk about LLM alignment as if it were a calibration problem. Tune the model. Add a system prompt. Insert a safety policy. Wrap it with retrieval. Then expect the assistant to behave consistently across messy real-world tasks. The paper Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph argues that this expectation is too neat for the problem being solved.1 ...