Transformers

Where to Go Deeper Beyond This Academy

A curated guide to textbooks, authors, websites, and papers for readers who want to study transformer internals, attention math, fine-tuning, GPU optimization, and benchmarking in more depth.

From Data to Atoms: How CliqueFlowmer Turns AI Into a Materials Inventor

Opening — Why this matters now For decades, discovering new materials has been painfully slow. The process typically involves theorizing candidate compounds, simulating their properties, synthesizing them in laboratories, and testing whether the results resemble the prediction. This loop—hypothesis, simulation, experiment—can take months or even years for a single promising compound. Artificial intelligence promised to accelerate this process. Yet most generative AI systems used in computational materials discovery behave like cautious imitators: they reproduce variations of materials already present in training datasets rather than aggressively searching for better ones. ...

When Tokens Explode: The Hidden Geometry Behind Attention Sinks

Opening — Why this matters now Large language models appear smooth from the outside: prompts go in, coherent text comes out. But internally, their numerical dynamics are anything but calm. In fact, inside many modern Transformers, certain tokens briefly explode into extreme values thousands of times larger than their neighbors. At the same time, a small set of tokens—often the very first token in a sequence—attracts an overwhelming share of attention from many heads. These are known as attention sinks. ...

Quantum Routes, Real Gains: When Transformers Meet CVRP

Opening — Why this matters now Routing problems are the unglamorous backbone of modern logistics. Every e‑commerce delivery, warehouse dispatch, and last‑mile optimization problem eventually collapses into some variant of the Capacitated Vehicle Routing Problem (CVRP). It is also, inconveniently, NP‑hard. Classical heuristics scale. Deep learning brings adaptability. Quantum computing promises expressivity. The uncomfortable question is whether these promises stack—or cancel each other out. ...

Attention with Doubt: Teaching Transformers When Not to Trust Themselves

Opening — Why this matters now Modern transformers are confident. Too confident. In high-stakes deployments—question answering, medical triage, compliance screening—this confidence routinely outruns correctness. The problem is not accuracy; it is miscalibration. Models say “I’m sure” when they shouldn’t. Most fixes arrive late in the pipeline: temperature scaling, Platt scaling, confidence rescaling after the model has already reasoned itself into a corner. What if uncertainty could intervene earlier—during reasoning rather than after the verdict? ...

When ERP Meets Attention: Teaching Transformers to Pack, Schedule, and Save Real Money

Opening — Why this matters now Enterprise Resource Planning (ERP) systems are excellent at recording what has happened. They are far less impressive at deciding what should happen next. When decision-making involves combinatorial explosions—packing furnaces, sequencing machines, allocating scarce inputs—ERP often falls back on brittle heuristics, slow solvers, or human intuition. None scale gracefully. ...

Scaling Laws Without Power Laws: Why Bigger Models Still Win

Opening — Why this matters now The scaling law debate was supposed to be settled. Bigger models, more data, more compute—loss falls predictably. Then came the uncomfortable question: what exactly is being scaled? If power laws in natural language data are the root cause, then scaling laws might be an artifact of language itself, not of learning. This paper dismantles that comfort. ...

When Tokens Become Actions: A Policy Gradient Built for Transformers

Opening — Why this matters now Reinforcement learning has always assumed that actions are atomic. Large language models politely disagree. In modern LLM training, an “action” is rarely a single move. It is a sequence of tokens, often structured, sometimes tool‑augmented, occasionally self‑reflective. Yet most policy‑gradient methods still pretend that Transformers behave like generic RL agents. The result is a growing mismatch between theory and practice—especially visible in agentic reasoning, tool use, and long‑horizon tasks. ...

When Circuits Go Atomic: Pruning Transformers One Neuron at a Time

Opening — Why this matters now Mechanistic interpretability has a scaling problem. As language models grow larger and more embedded in high‑stakes workflows, the old habit of waving at “important attention heads” is starting to look quaint. If we want to understand how models reason — not just where something lights up — we need circuit discovery methods that scale without drowning GPUs in activations or collapsing everything into blunt architectural units. ...

Circuits of Understanding: A Formal Path to Transformer Interpretability

Can we prove that we understand how a transformer works? Not just describe it heuristically, or highlight patterns—but actually trace its computations with the rigor of a math proof? That’s the ambition behind the recent paper Mechanistic Interpretability for Transformers: A Formal Framework and Case Study on Indirect Object Identification. The authors propose the first comprehensive mathematical framework for mechanistic interpretability, and they use it to dissect how a small transformer solves the Indirect Object Identification (IOI) task. What results is not just a technical tour de force, but a conceptual upgrade for the interpretability field. ...