Brains with Gradients: Why Energy-Based Transformers Might Be the Future of Thinking Machines

AI models are getting better at mimicking human intuition (System 1), but what about deliberate reasoning—slow, careful System 2 Thinking? Until now, most methods required supervision (e.g., reward models, verifiers, or chain-of-thought engineering). A new architecture, Energy-Based Transformers (EBTs), changes that. It offers a radically unsupervised, architecture-level path toward models that “think,” not just react. The implications for robust generalization, dynamic reasoning, and agent-based autonomy are profound.

🧠 The Core Insight: Thinking as Energy Minimization

Instead of predicting outputs in a single pass, EBTs learn to verify the compatibility between context and prediction using an energy scalar. Predictions are refined iteratively via gradient descent, reducing this energy—much like how humans reason through trial and error.

EBT = Transformer + Verifier + Gradient-Driven Self-Correction

This mechanism naturally unlocks three cognitive behaviors:

Facet Human Analogy Enabled in EBT?
Dynamic compute allocation Thinking longer on harder problems
Modeling uncertainty Knowing when you’re unsure
Prediction verification Double-checking your answer

Unlike diffusion models, which iterate to denoise but need external verifiers, or RNNs that lack explicit uncertainty modeling, EBTs are self-contained thinkers.


🚀 Outscaling Transformer++ Across the Board

The EBT paper compares its architecture to Transformer++ across six key axes: data, batch size, depth, parameters, FLOPs, and width. The result?

📈 EBTs are the first architecture to out-scale Transformer++ on all these fronts.

Even with worse pretraining perplexity, EBTs beat Transformer++ on downstream reasoning tasks like SQuAD, BigBench Math QA, and Dyck Languages. Why?

Because verifiers generalize better than predictors. It’s easier to check if a solution is right than to guess it from scratch.


🧪 Inference = Thinking. EBTs Improve by Doing More.

Traditional Transformers can’t do much at inference beyond sampling temperature tweaks. EBTs, on the other hand, improve performance by “thinking” longer:

  • Thinking longer (more gradient steps) → 29% gain on OOD tasks
  • Self-verification (best-of-N sampling) → up to 14% gain with scale
  • Uncertainty-aware prediction → converges faster for easy tokens, hesitates on ambiguous ones

These aren’t just empirical tricks. They’re signs that EBTs encode reasoning cost, epistemic awareness, and selective effort—hallmarks of cognitive control.


📷 Visual Domains: Doing More with Less

In video and image domains, EBTs outperform Diffusion Transformers while using 99% fewer forward passes. Their performance improves linearly with compute effort, enabling:

  • Higher PSNR on image denoising tasks
  • Clearer visual uncertainty over time (e.g., first blurry frame has high energy)
  • 10× better accuracy in ImageNet probing tasks than DiTs

This is crucial for autonomous agents operating in uncertain environments, where understanding, not just generating, is the goal.


🧩 More Than a Model, a New Cognitive Framework

EBTs challenge the conventional wisdom that “more layers, more data” is the only path to smarter models. Instead, they:

  • Show that verification-first training scales better
  • Provide dynamic thinking behaviors at inference without supervision
  • Generalize more robustly, especially on out-of-distribution (OOD) tasks

This aligns well with agentic LLM systems, where multiple subagents evaluate, verify, and update beliefs over time. EBTs offer a mechanism for implementing this with gradient-grounded logic, not just token-level heuristics.


🛠️ Limitations and the Road Ahead

Yes, EBTs are compute-hungry—training requires second-order gradients and inference needs multiple optimization steps. They’re not plug-and-play replacements for current LLMs yet.

But the vision is compelling: a self-verifying, uncertainty-aware, scalable brain for machines. With architectural tricks like replay buffers, Langevin dynamics, and randomized optimization, the authors stabilize training and unlock System 2 Thinking emergent from pretraining.

It’s not just an AI model. It’s an optimization-based mind.


Cognaptus: Automate the Present, Incubate the Future.