Cover image

Mind-Reading Without Telepathy: Predictive Concept Decoders

Opening — Why this matters now For years, AI interpretability has promised transparency while quietly delivering annotations, probes, and post-hoc stories that feel explanatory but often fail the only test that matters: can they predict what the model will actually do next? As large language models become agents—capable of long-horizon planning, policy evasion, and strategic compliance—interpretability that merely describes activations after the fact is no longer enough. What we need instead is interpretability that anticipates behavior. That is the ambition behind Predictive Concept Decoders (PCDs). ...

December 18, 2025 · 5 min · Zelina
Cover image

When Tokens Remember: Graphing the Ghosts in LLM Reasoning

Opening — Why this matters now Large language models don’t think—but they do accumulate influence. And that accumulation is exactly where most explainability methods quietly give up. As LLMs move from single-shot text generators into multi-step reasoners, agents, and decision-making systems, we increasingly care why an answer emerged—not just what token attended to what prompt word. Yet most attribution tools still behave as if each generation step lives in isolation. That assumption is no longer just naïve; it is actively misleading. ...

December 18, 2025 · 4 min · Zelina
Cover image

When Circuits Go Atomic: Pruning Transformers One Neuron at a Time

Opening — Why this matters now Mechanistic interpretability has a scaling problem. As language models grow larger and more embedded in high‑stakes workflows, the old habit of waving at “important attention heads” is starting to look quaint. If we want to understand how models reason — not just where something lights up — we need circuit discovery methods that scale without drowning GPUs in activations or collapsing everything into blunt architectural units. ...

December 12, 2025 · 4 min · Zelina
Cover image

Structure Matters: Externalities and the Hidden Logic of GNN Decisions

When explaining predictions made by Graph Neural Networks (GNNs), most methods ask: Which nodes or features mattered most? But what if this question misses the real driver of decisions — not the nodes themselves, but how they interact? That’s the bet behind GraphEXT, a novel explainability framework that reframes GNN attribution through the lens of externalities — a concept borrowed from economics. Developed by Wu, Hao, and Fan (2025), GraphEXT goes beyond traditional feature- or edge-based attributions. Instead, it models how structural interactions among nodes — the very thing GNNs are designed to exploit — influence predictions. ...

July 26, 2025 · 3 min · Zelina
Cover image

Good Bot, Bad Reward: Fixing Feedback Loops in Vision-Language Reasoning

1. A Student Who Cracked the Code — But Not the Meaning Imagine a student who aces every test by memorizing the positions of correct answers on multiple-choice sheets. He scores high, earns accolades, and passes every exam — but understands none of the material. His reward system is misaligned: success depends not on learning, but on exploiting test mechanics. Now, replace the student with an AI agent navigating a simulated room guided by language and images. This is the scenario that today’s leading research in Vision-and-Language Reinforcement Learning (RLVR) is grappling with. ...

June 13, 2025 · 5 min · Zelina