Tokens are small. That is why they are dangerous.

A developer asks an assistant to generate a function, explain a repository, or reason through a failing test. The screen fills with text. Some of it is useful. Some of it is decoration. Some of it is a polite little parade of examples, test cases, alternative implementations, or whitespace that will be thrown away by the next parser in the pipeline.

The GPU, tragically, is not paid in usefulness. It is paid in computation.

That is the useful discomfort in Lola Solovyeva and Fernando Castor’s paper, Towards Green AI: Decoding the Energy of LLM Inference in Software Development.1 The paper does not ask the familiar broad question: “Are LLMs energy-intensive?” Everyone already knows the answer is yes, followed by a suspiciously large cloud bill. Instead, it asks a better engineering question: where exactly does inference energy go during software-development tasks?

The answer is not simply “large models use more energy.” Nor is it merely “longer outputs cost more.” Both are true, but incomplete. The paper shows that LLM inference energy is shaped by the interaction between input length, prefill, decoding, key-value cache behavior, model implementation, and unnecessary output generation. In other words, the energy bill is not just a function of how big the model is. It is also a function of how the request enters the model, how the model keeps state, and how long it continues talking after the useful answer has already arrived.

That last part is not a metaphor. Some models literally keep going.

Inference is two operations wearing one invoice

Most business discussions treat “LLM inference” as one event. A prompt goes in. An answer comes out. A cost is recorded. Nice, clean, wrong enough to be expensive.

The paper separates inference into two phases:

Phase What happens Why it matters for energy
Prefill The model processes the input prompt and builds internal attention state, including key-value caches. Longer prompts increase the work needed before generation begins.
Decoding The model generates output tokens one by one, using the stored state. Sequential token generation accumulates cost, and later tokens can become more expensive.

This split matters because the two phases stress hardware differently. Prefill is relatively parallel and compute-heavy. Decoding is autoregressive, sequential, and more memory-bound. Treating them as one operation hides the part of the system that should be optimized.

For software work, this distinction is especially important because developer tasks do not have one standard input-output shape. Function-level code generation often uses a short prompt and produces a longer output. Repository understanding does the opposite: the model receives thousands of tokens of context and may only need to answer with one multiple-choice letter. A single “energy per request” number blurs these cases together, which is convenient for dashboards and bad for diagnosis.

The authors therefore evaluate ten decoder-only transformer models across two size groups, 3B–4B and 6B–7B, using five workloads: zero-shot code generation, two-shot code generation, zero-shot chain-of-thought code generation, code understanding, and code understanding with explanation. HumanEval supplies the code-generation tasks; LongBench supplies the long-context code-understanding tasks. GPU energy is sampled and aligned with token generation timestamps, allowing energy to be attributed to individual tokens and phases.

That design is the main contribution. It turns inference from a black-box bill into a sequence of measurable events.

The first token is not just the first token

The paper follows a common measurement convention: the energy associated with generating the first token is counted as prefill. That is reasonable because producing the first token requires processing the entire prompt and initializing the model’s internal state. After that, decoding begins.

A simple mental model helps:

  1. The prompt enters.
  2. The model builds the cache.
  3. The first token appears.
  4. Every later token is generated using the accumulated state.
  5. The state grows as output grows.

The first visible token is therefore not “cheap text.” It is the moment when the model has already paid for reading and structuring the prompt. The study observes a sharp energy spike at this stage, followed by a drop and then a gradual increase during decoding. The pattern appears across models and workloads, though the magnitude differs.

This is where the paper becomes more interesting than ordinary “shorter prompts save money” advice. Longer inputs do not only raise the cost of prefill. They can also raise the energy cost of decoding tokens that come afterward. Once the model has processed a larger prompt, the key-value cache is larger, and retrieving from it during decoding can become more expensive.

The paper reports that input growth can amplify the initial decoding token cost by anywhere from 1.3% to 51.8%, depending on the model. That range is large enough to be operationally annoying. It means two models with similar parameter counts can respond very differently to the same context-expansion strategy.

So yes, retrieval-augmented generation may improve answer quality. Yes, longer context may reduce hallucination. But context is not free just because it was copied into the prompt by software rather than typed by a human. The model still reads it. Then, in some cases, decoding keeps paying for it.

Decoding usually dominates, except when the prompt dwarfs the answer

The clearest phase-level finding is that decoding dominates energy consumption in four of the five workload scenarios. The exception is the code-understanding workload where the input is very long and the expected output is extremely short: 4,000 to 8,000 input tokens, with a maximum output of 10 tokens. In that case, prefill contributes 67.3% to 84.4% of total energy.

That exception is important because it prevents a lazy conclusion. The lesson is not “always optimize decoding.” The lesson is: match the optimization target to the workload shape.

For short-input, long-output tasks, decoding is the obvious battlefield. For long-context, short-answer tasks, prefill becomes much more visible. For long-context tasks that also ask the model to explain itself, both phases matter: the model must first digest the repository-like context, then spend sequential decoding energy producing the explanation.

The authors compare zero-shot and two-shot code generation to isolate a modest input-length increase while keeping the output limit fixed. Moving from zero-shot to two-shot increases prefill’s contribution by 0.4 to 2.5 percentage points. That is not dramatic, but it is consistent. A larger jump appears when comparing zero-shot generation with code-understanding-plus-explanation, where input length rises into the thousands of tokens while the output limit remains 300 tokens. There, the increase in prefill contribution ranges from 4.7 to 21.6 percentage points.

The practical reading is not “few-shot prompting is bad” or “long context is bad.” That would be the sort of conclusion one writes after reading only the abstract and having a strong coffee. The better reading is that prompt length should have a job. If examples, retrieved chunks, logs, or files are not improving task success enough to justify their cost, they are not context. They are ballast.

Later tokens can be more expensive than earlier tokens

Output length matters in the obvious way: more generated tokens require more energy. The paper also shows a less obvious pattern: each successive token can become more expensive than the previous one.

For 300-token code-understanding explanations, several models show rising per-token energy during generation. For the 2,000-token zero-shot chain-of-thought setting, the trend becomes more pronounced. CodeLlama-7B and Deepseek-Coder-6.7B show roughly a 20% increase in the energy cost of the last token compared with the first token in the decoding phase. Other models also show increases, though smaller: NextCoder-7B around 11%, Qwen3-4B around 7%, and Phi4-4B around 5%.

This makes long rambling outputs worse than they look. The 1,900th token is not merely another unit of text. It may be a more expensive unit of text than the 100th token. Transformer attention, key-value cache updates, memory management, cache misses, and runtime behavior all become part of the accumulating cost.

The paper does not claim one universal cause for every model. That restraint is useful. The pattern varies across model families and implementations, suggesting that low-level design choices—attention mechanisms, memory layout, kernel behavior, cache handling—shape energy behavior even among models with comparable parameter counts.

For business users, the point is simpler: output length is not just a UX issue. It is an infrastructure issue. When a system asks a model to “explain step by step” by default, produce long justifications for simple actions, or emit verbose intermediate reasoning that no human reads, it is not being transparent. It is being expensive in a very wordy way.

The evidence is not one number; it is a map of failure modes

The paper’s results are best read as a diagnostic map rather than a league table. Model rankings are less important than the types of inefficiency revealed.

Paper finding Evidence role Business interpretation Boundary
Prefill and decoding have different energy patterns. Main measurement contribution. Track inference cost by phase, not just by request. Measured on local GPU inference, not proprietary hosted APIs.
Longer input raises prefill cost and can raise decoding token cost. Main mechanism evidence. Retrieval and prompt expansion need ROI discipline. The magnitude is model-dependent.
Later decoding tokens may cost more than earlier ones. Main token-level evidence. Long outputs can be disproportionately costly. Exact cause differs by implementation.
Three models generate unnecessary output after completing the task. Behavioral inefficiency finding. Post-processing does not erase the energy already spent. Observed in specific code-generation settings.
Test-based babbling suppression reduces wasted generation. Intervention / exploratory extension. Use external stopping criteria when correctness can be cheaply verified. Applies most naturally where tests or validators exist.

This table is also a useful guardrail against over-reading the paper. The study does not prove that one model family is universally greener than another. It does not price ChatGPT, Copilot, or any proprietary hosted service. It does not measure the total lifecycle emissions of a full AI-assisted development platform. It measures GPU inference energy at phase and token level for selected open models on defined coding workloads.

That is already enough. Not every useful paper needs to solve civilization. Sometimes it just needs to show where the waste bucket is.

Babbling is not style; it is discarded computation

The most memorable result is the identification of “babbling” models. In the study, CodeLlama-7B, Deepseek-Coder-6.7B, and Qwen3-4B often generated outputs close to the maximum token limit. After producing the target function, they continued with extra material: whitespace, test cases, examples, alternative implementations, or other content that was later discarded during post-processing.

This is the operational absurdity: the system pays the energy cost to generate text that the pipeline already plans to delete.

In normal product language, that might be described as “verbose output.” The paper’s framing is more precise. Babbling is not harmless verbosity. It is unnecessary sequential decoding, often occurring in the part of generation where tokens are becoming more expensive.

The authors then implement a babbling-suppression method for code generation. The logic is straightforward:

  1. Generate code token by token.
  2. When an end-of-line token appears, check whether the partial code compiles.
  3. If it compiles, run the benchmark tests.
  4. If the tests pass, stop generation immediately.

This is not an internal model modification. It is an external stopping mechanism. The model is not made more elegant; it is simply interrupted once the job is done. A little rude, perhaps, but GPUs are not sentimental.

The results are strong but should be read carefully. With a 300-token limit, babbling suppression reduces output length by at least 44%. Deepseek-Coder-6.7B sees an 81% output-token reduction and a 69% total-energy reduction. With a 1,000-token limit, reductions become larger: Deepseek-Coder-6.7B reaches a 93% output-token reduction and an 89% total-energy reduction. Qwen3-4B also sees large energy reductions, up to 76% in the 1,000-token case.

There is one important exception. For CodeLlama-7B under the 300-token limit, the checking overhead causes total inference energy to increase by 6%, even though output length falls. That exception is not a flaw to hide. It is exactly the kind of detail that prevents a useful technique from becoming a slogan.

The conclusion is not “always run tests after every line.” The authors themselves note that the overhead can be reduced by checking less frequently than their current per-line approach. The better conclusion is that stopping criteria are part of inference design. In domains where correctness can be cheaply validated—unit tests, schema validation, SQL parsing, type checks, JSON validation, static analysis—the system should not wait politely while the model continues composing its after-dinner speech.

What this means for AI coding products and internal software agents

The business relevance is not that companies should suddenly become environmental saints. That would be nice, but procurement departments generally prefer numbers with invoices attached.

The practical lesson is that energy efficiency, latency, and cost control often point in the same direction. A model that generates fewer useless tokens saves energy, reduces latency, lowers serving cost, and creates less post-processing clutter. There is no tragic trade-off there. It is just cleaner system design.

For AI coding tools, CI assistants, and internal software agents, the paper suggests four controls.

First, measure model efficiency by workload shape, not only by average accuracy. A model that is efficient on short prompts may behave poorly with repository-length context. Another may tolerate long context better because its decoding cost is less sensitive to prefill expansion. “7B model” is not a cost model. It is a parameter count wearing a name tag.

Second, treat context as an investment. Retrieval pipelines should not dump every possibly relevant chunk into the prompt. They should justify context with expected accuracy gain, failure-risk reduction, or human-time savings. The paper’s mechanism gives a concrete reason: extra context can raise both prefill cost and downstream decoding cost.

Third, control output contracts. Ask for the artifact needed, not the ceremony around it. If the system needs a function, request the function. If it needs JSON, enforce JSON. If it needs a one-letter answer, do not ask for a philosophical essay and then slice out the first character. This is not merely a style preference. It is energy governance for machines that talk too much when unsupervised. Relatable, unfortunately.

Fourth, add external stopping logic when the task allows it. Code is unusually convenient because tests can verify correctness. But the broader principle travels: stop generation when a validator confirms that the artifact is complete enough for the next stage. In business automation, validators may include JSON schema checks, database constraints, regular-expression gates, unit tests, deterministic calculators, or human-approved templates.

The common thread is simple: do not let the language model decide alone when the business process is done. Language models are trained to continue plausible text. Business processes need completion conditions.

What the paper directly shows, what we infer, and what remains uncertain

The paper directly shows phase-level GPU energy patterns for ten open, decoder-only transformer models in the 3B–7B range, running on an NVIDIA A10 GPU, across code-generation and code-understanding tasks. It directly shows that decoding usually dominates energy, except when very long inputs are paired with very short outputs. It directly shows that longer inputs can raise both prefill cost and decoding-token cost. It directly shows babbling behavior in three models and demonstrates that test-based early stopping can sharply reduce wasted generation in most tested cases.

Cognaptus’ business inference is that production AI systems should treat prompt length, context retrieval, output length, and stopping criteria as operational levers. These are not cosmetic prompt-engineering details. They affect energy, latency, throughput, and serving cost.

What remains uncertain is the exact transferability to hosted proprietary models, larger frontier-scale systems, mixture-of-experts models, multi-GPU inference, production batching, or serving stacks such as highly optimized inference engines. The study uses local inference on a specific GPU setup. It also evaluates each model-workload trial once, though each workload contains many benchmark entries. The authors address validity threats with timestamp alignment, GPU-use checks, outlier removal, and standard deviations, but the usual measurement caution still applies.

Those boundaries do not weaken the article’s core message. They define where the message should be used: as engineering guidance for inference diagnosis, not as a universal carbon calculator.

The new efficiency frontier is not only smaller models

A convenient myth says that efficient AI mainly means smaller models, better chips, or cleaner datacenters. All three matter. None of them excuses sloppy inference design.

This paper shows a less glamorous frontier: phase-aware measurement, disciplined context, shorter useful outputs, and stopping generation when the job is already complete. The result is not a grand theory of sustainable AI. It is better: a practical map of where software-development inference wastes energy.

The uncomfortable part is that some waste is not hidden inside obscure hardware. It is visible in the generated text. The extra examples. The redundant tests. The explanation nobody asked for. The whitespace that survives long enough to cost electricity and then dies in post-processing.

The GPU does not know what the business wanted. It only knows the next token.

That means the responsibility sits with system designers. If an AI coding assistant burns energy producing text that a parser immediately discards, that is not an inevitable cost of intelligence. It is a product decision, politely disguised as model behavior.

Quieter models would help. Smarter stopping logic would help sooner.

Cognaptus: Automate the Present, Incubate the Future.


  1. Lola Solovyeva and Fernando Castor, “Towards Green AI: Decoding the Energy of LLM Inference in Software Development,” arXiv:2602.05712, 2026. https://arxiv.org/abs/2602.05712 ↩︎