Tokens, Watts, and Waste: The Hidden Energy Bill of LLM Inference

Opening — Why this matters now

Large language models are now a routine part of software development. They autocomplete functions, explain repositories, and quietly sit inside CI pipelines. The productivity gains are real. The energy bill is less visible.

As inference increasingly dominates the lifecycle cost of LLMs, the environmental question is no longer about how models are trained, but how often—and how inefficiently—they are used. This paper asks an unfashionable but necessary question: where exactly does inference energy go? The answer turns out to be uncomfortable.

Background — Inference is not one thing

Most discussions treat inference as a single, uniform operation. In practice, it is split into two structurally different phases:

Prefill: the model ingests the prompt, builds attention state, and initializes key–value caches.
Decoding: tokens are generated autoregressively, one by one, using the cached state.

These phases stress hardware differently. Prefill is compute-heavy and parallel. Decoding is sequential and memory-bound. Ignoring this distinction hides both inefficiencies and optimization opportunities—especially for software engineering workloads where input and output lengths vary dramatically.

Analysis — Measuring energy where it actually leaks

The authors evaluate ten decoder-only transformer models across two parameter bands (3–4B and 6–7B) using realistic developer tasks:

Code generation (HumanEval): short prompts, potentially long outputs
Code understanding (LongBench): long context, short answers

Energy is measured at the GPU level and aligned precisely with token generation timestamps, allowing the authors to attribute joules to individual tokens and inference phases.

Key observation #1: decoding usually dominates

Except when inputs vastly exceed outputs, decoding consumes the majority of inference energy. Even modest outputs accumulate cost because decoding is sequential and grows more expensive over time.

Key observation #2: prefill quietly poisons decoding

Larger inputs increase prefill cost—but more importantly, they raise the baseline energy of every subsequent decoding token. Depending on the model, this amplification ranges from 1.3% to 51.8%. The prefill phase doesn’t just cost energy; it sets up future inefficiency.

Key observation #3: tokens get more expensive as you go

Across models, later tokens cost more energy than earlier ones—up to 20% more by the end of long generations. Growing attention matrices, cache updates, and memory pressure accumulate silently with each token.

Findings — When verbosity becomes waste

The most striking result is behavioral rather than architectural.

Three models consistently generate output right up to the maximum token limit, even after producing a correct solution. The authors call this babbling: whitespace, redundant explanations, alternative implementations—text that is thrown away in post-processing but still paid for in energy.

To counter this, they implement babbling suppression: periodically checking whether generated code already compiles and passes tests, and stopping generation immediately once it does.

Results of babbling suppression

Model	Token Reduction	Energy Reduction	Accuracy Impact
CodeLlama-7B	44–80%	up to 62%	negligible
DeepSeek-6.7B	81–93%	up to 89%	negligible
Qwen3-4B	59–79%	up to 76%	negligible

In short: most of the generated text was never needed.

Implications — Green AI without new hardware

This paper quietly dismantles a common assumption: that greener AI requires new models, new accelerators, or exotic architectures.

Instead, it shows that large gains are available today through:

Treating prefill and decoding as separate optimization targets
Controlling prompt length to avoid inflating downstream costs
Aggressively limiting unnecessary output
Adding cheap external stopping criteria instead of letting models ramble

For practitioners, this reframes prompt engineering as an energy decision. For system designers, it elevates KV-cache efficiency from a performance concern to a sustainability one.

Conclusion — Efficiency is behavioral, not just architectural

Inference energy is not just about how big a model is. It is about how long it talks after it already knows the answer.

By measuring inference at token-level resolution, this study exposes a simple truth: much of today’s LLM energy consumption is self-inflicted. Cutting it does not require smaller models—just quieter ones.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Inference is not one thing#

Analysis — Measuring energy where it actually leaks#

Key observation #1: decoding usually dominates#

Key observation #2: prefill quietly poisons decoding#

Key observation #3: tokens get more expensive as you go#

Findings — When verbosity becomes waste#

Results of babbling suppression#

Implications — Green AI without new hardware#

Conclusion — Efficiency is behavioral, not just architectural#