AI Alignment

Active Minds, Efficient Machines: The Bayesian Shortcut in RLHF

Why this matters now Reinforcement Learning from Human Feedback (RLHF) has become the de facto standard for aligning large language models with human values. Yet, the process remains painfully inefficient—annotators evaluate thousands of pairs, most of which offer little new information. As AI models scale, so does the human cost. The question is no longer can we align models, but can we afford to keep doing it this way? A recent paper from Politecnico di Milano proposes a pragmatic answer: inject Bayesian intelligence into the feedback loop. Their hybrid framework—Bayesian RLHF—blends the scalability of neural reinforcement learning with the data thriftiness of Bayesian optimization. The result: smarter questions, faster convergence, and fewer wasted clicks. ...

Love in the Time of Context: Why LLMs Still Don't Get You

Personalization is the love language of AI. But today’s large language models (LLMs) are more like well-meaning pen pals than mind-reading confidants. They remember your name, maybe your writing style — but the moment the context shifts, they stumble. The CUPID benchmark, introduced in a recent COLM 2025 paper, shows just how wide the gap still is between knowing the user and understanding them in context. Beyond Global Preferences: The Rise of Contextual Alignment Most LLMs that claim to be “personalized” assume you have stable, monolithic preferences. If you like bullet points, they’ll always give you bullet points. If you once asked for formal tone, they’ll keep things stiff forever. ...

Seeing is Believing? Not Quite — How CoCoT Makes Vision-Language Models Think Before They Judge

Vision-language models (VLMs) may describe what they see, but do they truly understand what they’re looking at — especially in social contexts? A recent paper introduces Cognitive Chain-of-Thought (CoCoT), a deceptively simple yet remarkably effective prompting strategy that helps these models reason like humans: through layered cognition, not flat logic. The Problem with Flat Reasoning Traditional Chain-of-Thought (CoT) prompting, while powerful for math and symbolic tasks, falls short when it comes to social or moral interpretation. Consider a scene where a person wears a mask indoors, and another says, “Hiding from the paparazzi, huh?” CoT may recognize the mask, but often misfires in guessing intent — is it a joke? A warning? An instruction? ...

Think Twice, Then Speak: Deliberative Searcher and the Future of Reliable LLMs

When a large language model (LLM) answers your question with a high degree of confidence, do you trust it? What if it’s wrong—but still confident? The stakes are high in real-world applications, from legal guidance to enterprise decision support. Yet today’s LLMs remain notoriously unreliable in aligning their confidence with correctness. The paper Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with Constraints (Yin et al., 2025) offers a bold response: rewire LLMs to be reasoning-primary and information-secondary. Instead of front-loading search and passively absorbing evidence, Deliberative Searcher acts more like a prudent investigator: it thinks, self-assesses, retrieves external information only when needed, and calibrates its confidence step-by-step. Crucially, it learns this behavior through a custom constrained reinforcement learning regime. ...

The Clock Inside the Machine: How LLMs Construct Their Own Time

What if your AI model isn’t just answering questions, but living in its own version of time? A new paper titled The Other Mind makes a bold claim: large language models (LLMs) exhibit temporal cognition that mirrors how humans perceive time — not through raw numbers, but as a subjective, compressed mental landscape. Using a cognitive science task known as similarity judgment, the researchers asked 12 LLMs, from GPT-4o to Qwen2.5-72B, to rate how similar two years (like 1972 and 1992) felt. The results were startling: instead of linear comparisons, larger models automatically centered their judgment around a reference year — typically close to 2025 — and applied a logarithmic perception of time. In other words, just like us, they feel that 2020 and 2030 are more similar than 1520 and 1530. ...

Bias, Baked In: Why Pretraining, Not Fine-Tuning, Shapes LLM Behavior

What makes a large language model (LLM) biased? Is it the instruction tuning data, the randomness of training, or something more deeply embedded? A new paper from Itzhak, Belinkov, and Stanovsky, presented at COLM 2025, delivers a clear verdict: pretraining is the primary source of cognitive biases in LLMs. The implications of this are far-reaching — and perhaps more uncomfortable than many developers would like to admit. The Setup: Two Steps, One Core Question The authors dissected the origins of 32 cognitive biases in LLMs using a controlled two-step causal framework: ...

The Bullshit Dilemma: Why Smarter AI Isn't Always More Truthful

“Bullshit is speech intended to persuade without regard for truth.” – Harry Frankfurt When Alignment Goes Sideways Large Language Models (LLMs) are getting better at being helpful, harmless, and honest — or so we thought. But a recent study provocatively titled Machine Bullshit [Liang et al., 2025] suggests a disturbing paradox: the more we fine-tune these models with Reinforcement Learning from Human Feedback (RLHF), the more likely they are to generate responses that are persuasive but indifferent to truth. ...

The Trojan GAN: Turning LLM Jailbreaks into Security Shields

For years, LLM security research has mirrored the cybersecurity arms race: attackers find novel jailbreak prompts, defenders patch with filters or fine-tuning. But in this morning’s arXiv drop, a paper titled “CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks” proposes something fundamentally different: a single framework that learns to attack and defend simultaneously, using a GAN trained on internal embeddings. This paradigm shift offers not only better performance on both sides of the battlefield, but a new perspective on what it means to “align” a model. ...

Unsafe at Any Bit: Patching the Safety Gaps in Quantized LLMs

When deploying large language models (LLMs) on mobile devices, edge servers, or any resource-constrained environment, quantization is the go-to trick. It slashes memory and compute costs by reducing model precision from 16-bit or 32-bit floating points to 8-bit or even 4-bit integers. But there’s a problem: this efficiency comes at a cost. Quantization can quietly erode the safety guarantees of well-aligned models, making them vulnerable to adversarial prompts and jailbreak attacks. ...

Bias Busters: Teaching Language Agents to Think Like Scientists

In the latest paper “Language Agents Mirror Human Causal Reasoning Biases” (Chen et al., 2025), researchers uncovered a persistent issue affecting even the most advanced language model (LM) agents: a disjunctive bias—a tendency to prefer “OR”-type causal explanations over equally valid or even stronger “AND”-type ones. Surprisingly, this mirrors adult human reasoning patterns and undermines the agents’ ability to draw correct conclusions in scientific-style causal discovery tasks. ...