Language Models

Unpacking the Explicit Mind: How ExplicitLM Redefines AI Memory

Why this matters now Every few months, another AI model promises to be more “aware” — but awareness is hard when memory is mush. Traditional large language models (LLMs) bury their knowledge across billions of parameters like a neural hoarder: everything is stored, but nothing is labeled. Updating a single fact means retraining the entire organism. The result? Models that can write essays about Biden while insisting he’s still president. ...

Thinking in Circles: How Self-Questioning LLMs Learn Without Labels

What if an LLM could learn not by reading more, but by thinking harder? That’s the radical premise behind Self-Questioning Language Models (SQLM), a framework that transforms large language models from passive learners into active generators of their own training data. No curated datasets. No labeled answers. Just a prompt — and a model that gets smarter by challenging itself. From Self-Play in Robotics to Reasoning in Language The inspiration for SQLM comes from asymmetric self-play, a technique used in robotics where one agent proposes tasks and another learns to solve them. Here, that paradigm is adapted to LLMs: ...

Circuits of Understanding: A Formal Path to Transformer Interpretability

Can we prove that we understand how a transformer works? Not just describe it heuristically, or highlight patterns—but actually trace its computations with the rigor of a math proof? That’s the ambition behind the recent paper Mechanistic Interpretability for Transformers: A Formal Framework and Case Study on Indirect Object Identification. The authors propose the first comprehensive mathematical framework for mechanistic interpretability, and they use it to dissect how a small transformer solves the Indirect Object Identification (IOI) task. What results is not just a technical tour de force, but a conceptual upgrade for the interpretability field. ...

The Bullshit Dilemma: Why Smarter AI Isn't Always More Truthful

“Bullshit is speech intended to persuade without regard for truth.” – Harry Frankfurt When Alignment Goes Sideways Large Language Models (LLMs) are getting better at being helpful, harmless, and honest — or so we thought. But a recent study provocatively titled Machine Bullshit [Liang et al., 2025] suggests a disturbing paradox: the more we fine-tune these models with Reinforcement Learning from Human Feedback (RLHF), the more likely they are to generate responses that are persuasive but indifferent to truth. ...

Bias Busters: Teaching Language Agents to Think Like Scientists

In the latest paper “Language Agents Mirror Human Causal Reasoning Biases” (Chen et al., 2025), researchers uncovered a persistent issue affecting even the most advanced language model (LM) agents: a disjunctive bias—a tendency to prefer “OR”-type causal explanations over equally valid or even stronger “AND”-type ones. Surprisingly, this mirrors adult human reasoning patterns and undermines the agents’ ability to draw correct conclusions in scientific-style causal discovery tasks. ...