Opening — Why this matters now
Large language models are now a routine part of software development. They autocomplete functions, explain repositories, and quietly sit inside CI pipelines. The productivity gains are real. The energy bill is less visible.
As inference increasingly dominates the lifecycle cost of LLMs, the environmental question is no longer about how models are trained, but how often—and how inefficiently—they are used. This paper asks an unfashionable but necessary question: where exactly does inference energy go? The answer turns out to be uncomfortable.
Background — Inference is not one thing
Most discussions treat inference as a single, uniform operation. In practice, it is split into two structurally different phases:
- Prefill: the model ingests the prompt, builds attention state, and initializes key–value caches.
- Decoding: tokens are generated autoregressively, one by one, using the cached state.
These phases stress hardware differently. Prefill is compute-heavy and parallel. Decoding is sequential and memory-bound. Ignoring this distinction hides both inefficiencies and optimization opportunities—especially for software engineering workloads where input and output lengths vary dramatically.
Analysis — Measuring energy where it actually leaks
The authors evaluate ten decoder-only transformer models across two parameter bands (3–4B and 6–7B) using realistic developer tasks:
- Code generation (HumanEval): short prompts, potentially long outputs
- Code understanding (LongBench): long context, short answers
Energy is measured at the GPU level and aligned precisely with token generation timestamps, allowing the authors to attribute joules to individual tokens and inference phases.
Key observation #1: decoding usually dominates
Except when inputs vastly exceed outputs, decoding consumes the majority of inference energy. Even modest outputs accumulate cost because decoding is sequential and grows more expensive over time.
Key observation #2: prefill quietly poisons decoding
Larger inputs increase prefill cost—but more importantly, they raise the baseline energy of every subsequent decoding token. Depending on the model, this amplification ranges from 1.3% to 51.8%. The prefill phase doesn’t just cost energy; it sets up future inefficiency.
Key observation #3: tokens get more expensive as you go
Across models, later tokens cost more energy than earlier ones—up to 20% more by the end of long generations. Growing attention matrices, cache updates, and memory pressure accumulate silently with each token.
Findings — When verbosity becomes waste
The most striking result is behavioral rather than architectural.
Three models consistently generate output right up to the maximum token limit, even after producing a correct solution. The authors call this babbling: whitespace, redundant explanations, alternative implementations—text that is thrown away in post-processing but still paid for in energy.
To counter this, they implement babbling suppression: periodically checking whether generated code already compiles and passes tests, and stopping generation immediately once it does.
Results of babbling suppression
| Model | Token Reduction | Energy Reduction | Accuracy Impact |
|---|---|---|---|
| CodeLlama-7B | 44–80% | up to 62% | negligible |
| DeepSeek-6.7B | 81–93% | up to 89% | negligible |
| Qwen3-4B | 59–79% | up to 76% | negligible |
In short: most of the generated text was never needed.
Implications — Green AI without new hardware
This paper quietly dismantles a common assumption: that greener AI requires new models, new accelerators, or exotic architectures.
Instead, it shows that large gains are available today through:
- Treating prefill and decoding as separate optimization targets
- Controlling prompt length to avoid inflating downstream costs
- Aggressively limiting unnecessary output
- Adding cheap external stopping criteria instead of letting models ramble
For practitioners, this reframes prompt engineering as an energy decision. For system designers, it elevates KV-cache efficiency from a performance concern to a sustainability one.
Conclusion — Efficiency is behavioral, not just architectural
Inference energy is not just about how big a model is. It is about how long it talks after it already knows the answer.
By measuring inference at token-level resolution, this study exposes a simple truth: much of today’s LLM energy consumption is self-inflicted. Cutting it does not require smaller models—just quieter ones.
Cognaptus: Automate the Present, Incubate the Future.