Latency has a habit of hiding inside words that sound efficient.

“Constant decoding cost” is one of those phrases. It suggests a clean engineering promise: linear attention avoids the context-length explosion of softmax attention, so long-context inference should become simpler, cheaper, and less melodramatic. Very nice. The GPU accountants, however, have not retired.

The paper KVBuffer: IO-aware Serving for Linear Attention makes a narrower and more useful point: linear attention may avoid growing KV caches, but existing serving systems still touch a large recurrent state at every decoding step, and that state can dominate memory traffic.1 In other words, linear attention does not abolish the cost problem. It changes the shape of the bill.

That is why this paper is best read as a serving-systems paper, not a model-architecture announcement. KVBuffer does not claim a new foundation model. It does not claim that linear attention has solved long-context reasoning. It asks a more operational question: if hybrid linear-attention models are going to be served at scale, why are we still treating the recurrent state as something that must be read and written on every token?

The answer is not glamorous. It is buffering.

And, inconveniently for people who prefer architecture diagrams over memory diagrams, buffering is exactly where the business value lives.

The misleading comfort of constant-time decoding

The reader’s likely misconception is reasonable: if linear attention has constant decoding cost with respect to context length, recurrent decoding should be the natural serving method. It keeps a fixed-size state, avoids storing all historical keys and values, and seems to deliver the long-context efficiency story in its purest form.

That belief is half right and therefore dangerous.

In recurrent linear attention, the model maintains a state matrix $S_t$. A simplified version looks like this:

$$ S_t = S_{t-1} + k_t^T v_t $$

$$ o_t = q_t S_t $$

This is attractive because the serving system does not need to scan the whole context. The old information is compressed into a fixed-size state. The context length can grow; the state size does not.

But fixed-size does not mean small. In Qwen3-Next’s Gated DeltaNet layer, the paper reports a linear-attention state of about 2 MB. The authors emphasize that this state is often larger than the per-token key and value by two orders of magnitude. Existing serving systems such as vLLM-style and SGLang-style linear-attention serving typically read and write that full state at every decoding step.

So the old story was:

Softmax attention is expensive because context grows.

The paper’s correction is:

Linear attention can still be expensive because the state is large and touched too often.

This is a serving-level shift. The bottleneck moves from “how much history do we store?” to “how often do we move the state through memory?” It is less cinematic than quadratic attention, but GPUs are rarely moved by cinema.

KVBuffer changes the schedule, not the model

KVBuffer’s core idea is simple: keep a buffer of recent keys and values, use that buffer to compute outputs for recent tokens, and delay updates to the large linear-attention state until the buffer is full.

Instead of doing this every token:

read large state -> update large state -> write large state -> produce output

KVBuffer does this across a small cycle:

read large state + read small buffered KVs -> produce output
append new KV to buffer
...
when buffer is full: update large state once in batch -> write large state

The mechanism matters because the paper’s contribution is not “store more cache.” That would be the least inspiring possible interpretation. The contribution is that a small KV buffer lets the serving system switch among three computation forms of linear attention:

Computation form What is stored or used When it helps Operational meaning
Recurrent decoding A fixed linear-attention state Long-context decoding when state updates are tolerable Simple constant-state serving, but heavy state IO every token
Chunkwise decoding State plus a buffer of recent KVs Normal decoding where batched state updates amortize memory writes Lower average memory access by delaying state updates
KV-only decoding KVs only, no state yet Short contexts where all KVs are smaller than the state Avoid paying for a large state before it is useful

The accepted plan for this article was mechanism-first, and the paper confirms that choice. The interesting part is not the headline speedup. It is why the speedup exists, when it disappears, and what kind of serving infrastructure can exploit it.

Chunkwise decoding turns state updates into a batch expense

The main design path is chunkwise decoding.

In recurrent decoding, every new token triggers a state read, a state update, and a state write. In chunkwise decoding, KVBuffer allows the system to compute attention outputs using both the older state and the recent buffered tokens. Only after $m$ tokens does the system flush the buffer and update the state in one batched GPU operation.

The paper’s memory-access analysis gives an approximate speedup for chunkwise decoding:

$$ \text{Speedup}_{\text{chunkwise}}(m) \approx \frac{4(d+1)}{2d + 4d/m + m + 7} $$

Here, $d$ is the hidden dimension and $m$ is the chunk size or buffer size. Under the simplified linear-attention analysis, the speedup is maximized at:

$$ m = 2\sqrt{d} $$

This is a nice formula, but the paper does not stop at algebra. In the actual Qwen3-Next-80B-A3B-Instruct evaluation, the head dimension is $d=128$, so $2\sqrt{d} \approx 22.63$. The best observed buffer size is 32, partly because Qwen3-Next uses grouped-query attention in its linear-attention layers, reducing memory access in the chunkwise form.

That difference matters. The formula is a guide to the IO trade-off, not a sacred number written on a GPU-shaped tablet.

The empirical result is clear: KVBuffer reduces linear-attention decoding latency by up to 45.17% with a buffer size of 32. This is the paper’s main evidence for ordinary decoding. Figure 3 is not a decorative benchmark; it tests whether the memory-access argument actually shows up in kernel latency. It does. Latency first falls as the buffer grows, because state updates are amortized. Then it rises when the buffer becomes too large, because reading buffered KVs becomes its own cost.

That U-shape is the whole paper in miniature.

Small buffer: too many state updates.

Large buffer: too much KV reading.

Middle buffer: the serving system stops being silly.

There is also an important operational wrinkle. At batch size 1, chunkwise decoding can be slower than recurrent decoding because it launches an additional state-update kernel. The authors note that CUDA Graphs could mitigate this overhead. This is not a minor footnote for production systems. It tells us that KVBuffer’s value is not just an algorithmic property; it depends on runtime overhead, batching, and kernel engineering. The paper is honest enough to make the inconvenience visible.

Speculative decoding exposes the real cost of temporary states

The second contribution is more dramatic because speculative decoding multiplies the problem.

Speculative decoding verifies multiple draft tokens in parallel. In a softmax-attention serving world, the cost discussion usually revolves around candidate tokens and KV cache handling. In recurrent linear-attention serving, however, a naive implementation needs temporary linear-attention states for draft tokens until the accepted tokens are known.

That is awkward because the state is large. The paper gives a concrete example: verifying 4 draft tokens in Qwen3-Next occupies an additional 384 MB of memory per request. This is not “some overhead.” This is the kind of overhead that quietly turns a serving cluster into an expensive waiting room.

KVBuffer changes the branch-management logic. Instead of creating and storing a temporary state for each draft token, it buffers the KVs for draft tokens, computes their outputs in chunkwise form, and only updates the main state with the accepted tokens.

The result is both lower verification latency and higher request capacity.

Experiment component Likely purpose What it supports What it does not prove
Figure 3: chunkwise decoding latency Main evidence for ordinary decoding KVBuffer can reduce kernel latency when buffer size is well chosen It does not prove one universal buffer size across models or runtimes
Figure 4: speculative verification latency Main evidence for draft-token verification KVBuffer verification latency grows modestly while recurrent verification grows roughly linearly It does not evaluate all speculative decoding algorithms or draft-model choices
Figure 5: end-to-end speculative throughput System-level serving evidence Avoiding temporary states increases maximum concurrent requests and throughput It does not isolate every scheduler-level effect in other serving stacks
Figure 6: short-context decoding comparison Boundary/sensitivity test for computation form choice KV-only decoding is better when context length is below the head dimension It does not solve dynamic routing between forms in production batches
Appendix A.1 and Table 2 Implementation detail and extension to Gated Delta Networks The memory-access logic remains close for GDN because extra decay factors are scalar It is not a separate benchmark of all linear-attention variants

For verification latency, the paper reports that when verifying 8 draft tokens, KVBuffer achieves a 2.78× speedup, close to the approximately 3× speedup predicted by its analysis. The mechanism is again memory traffic. Recurrent verification must repeatedly compute and store temporary states. KVBuffer buffers much smaller token-level information and delays the state update until the accepted path is known.

For end-to-end throughput, the paper evaluates speculative decoding with 4 draft tokens on ShareGPT using Multi-Token-Prediction as the draft model. Because recurrent verification needs temporary states, it supports fewer concurrent requests. KVBuffer increases the maximum number of serving requests by 5× and achieves up to a 1.46× throughput improvement.

Notice the difference between the two numbers. A 5× increase in maximum concurrent requests does not become a 5× throughput gain. That is not a contradiction; it is reality. End-to-end throughput includes more than state-slot capacity. It includes request rates, scheduling, token generation behavior, model execution, and whatever other gremlins live inside a serving stack. The paper reports both numbers, and they should not be casually blended.

KV-only decoding says short contexts should not pay the state tax

The third mechanism looks counterintuitive only if one has fallen too deeply in love with recurrent state.

For short contexts, maintaining the large state can be more expensive than simply keeping and reading all keys and values. If the context length $L$ is smaller than the hidden dimension $d$, KV-only decoding can be more memory-efficient than both recurrent and chunkwise forms.

The paper’s rule is simple:

$$ L < d $$

For Qwen3-Next-80B-A3B-Instruct, $d=128$. In the short-context experiment with batch size 128, KV-only decoding is faster than both recurrent and chunkwise decoding when the context length is below 128. As the context length rises, KV-only decoding becomes slower because it must read more buffered KVs. When the context length reaches 128, its latency becomes close to chunkwise decoding.

This result is easy to underestimate. Many production workloads are not heroic long-context legal documents or 200-page agent traces. They are short prompts, tool calls, routing requests, chat turns, and small instruction-following tasks. If a serving system always initializes and updates a large linear-attention state for these requests, it may be paying the long-context machinery tax before long context even appears.

KVBuffer therefore changes the serving question from:

Which attention mechanism does the model use?

To:

Which computation form should this request use right now?

That second question is harder. Naturally, it is also the useful one.

What the paper directly shows

The direct empirical scope is fairly specific. The authors implement KVBuffer in SGLang v0.5.10 for Qwen3-Next-80B-A3B-Instruct, a hybrid model using Gated Delta Networks as its linear-attention module. The evaluation runs on four NVIDIA L40S GPUs with tensor parallelism across all four GPUs. KVBuffer kernels are implemented in Triton.

Within that setup, the paper shows four concrete results:

Paper result Interpretation Practical boundary
Up to 45.17% lower linear-attention decoding latency with buffer size 32 Chunkwise decoding reduces average memory access by amortizing state updates Benefit depends on buffer size, batch size, state size, model architecture, and kernel overhead
2.78× speedup when verifying 8 draft tokens Buffered verification avoids repeated temporary state computation and storage Result is tied to the evaluated speculative setup and draft-token count
5× higher maximum number of serving requests when verifying 4 draft tokens KVBuffer reduces per-request memory footprint during speculative verification More request capacity does not translate linearly into throughput
Up to 1.46× end-to-end throughput improvement Lower verification overhead and higher concurrency improve serving throughput Full-stack throughput depends on scheduler behavior, workload mix, and request arrival rate
KV-only decoding faster for $L<d$ in short contexts Short prompts should not always initialize or update a large state Requires routing or batching design that can separate computation forms efficiently

The table is intentionally modest. The paper is not saying “linear attention is now solved.” It is saying that serving linear-attention models through recurrent state alone leaves performance on the table. That is a narrower claim, and therefore more believable.

What Cognaptus infers for business use

For an enterprise AI team, the business relevance is not that KVBuffer should be copied tomorrow into every inference stack. The immediate lesson is more general: when model architectures change, serving assumptions must be re-audited at the memory-traffic level.

Hybrid linear-attention models are attractive because they promise better long-context economics. But if the serving system treats the linear-attention state as a monolithic object to update every token, the organization may buy an efficient architecture and then run it inefficiently. A fine tradition in enterprise technology, but still best avoided.

The operational implications are clearer when split by workload type.

Workload pattern KVBuffer-relevant mechanism Business interpretation
Long-context generation with steady decoding Chunkwise decoding Lower per-token latency may reduce serving cost or improve responsiveness when batching is sufficient
Speculative decoding with multiple draft tokens Parallel verification without temporary states More concurrent requests can fit into the same GPU memory budget, especially when draft branches would otherwise multiply state slots
Short prompts and small tool-call contexts KV-only decoding Do not pay the recurrent-state cost before the context is large enough to justify it
Beam search or branch-heavy sampling Buffered candidate tokens before accepted-branch update Potential memory savings for multi-branch decoding, though this is discussed rather than fully benchmarked
Prefix reuse in hybrid models Reconstruct state from buffered KVs Possible finer-grained prefix caching, but still a design direction rather than a finished serving policy

For Cognaptus clients evaluating AI infrastructure, the practical takeaway is not “choose KVBuffer” as a product slogan. The takeaway is to ask sharper procurement and architecture questions:

  1. Does the serving stack distinguish recurrent, chunkwise, and KV-only computation forms for hybrid models?
  2. Does speculative decoding create temporary linear-attention states per draft token?
  3. Are short-context requests routed differently from long-context requests?
  4. Are latency claims measured at kernel level only, or also in end-to-end serving throughput?
  5. Is the chosen buffer size tuned for the actual model, batch size, and runtime overhead?

Those questions are dull in the best possible way. They are the kind that find money.

The appendix is an implementation bridge, not a second thesis

The appendix explains how the same idea applies to Gated Delta Networks, the linear-attention variant used in Qwen3-Next. GDN adds a data-dependent decay factor and delta-rule update logic. For KVBuffer, this means the system must buffer not only the key and value-like information, but also the decay factor $\alpha_t$, key $k_t$, and delta value $u_t$.

The important point is that $\alpha_t$ is scalar, so it does not materially change the dominant memory story. Table 2 in the appendix shows that GDN’s storage and memory-access estimates remain close to the simplified linear-attention case. The appendix therefore functions as an implementation detail and robustness-of-mechanism argument: the simplified analysis is not floating in a toy universe completely detached from the evaluated model.

It is not, however, a broad proof that every linear-attention variant will behave identically. Different architectures may carry different state sizes, grouping patterns, kernel behavior, and batching constraints. The appendix supports the paper’s chosen evaluation target; it does not eliminate engineering work for future models. Sadly, there is still engineering work. This has been known to happen.

The unsolved part is dynamic routing

The paper’s limitation section is unusually important because it names the next systems problem: KVBuffer enables multiple computation forms, but it does not dynamically route requests across those forms during serving.

That matters because the best form depends on the request.

A short request may prefer KV-only decoding. A longer request may prefer chunkwise decoding. A speculative request may benefit from buffered verification. But these forms involve different kernels and batching requirements. Switching based on prompt length or request state can introduce scheduling overhead. In production, the problem is not merely choosing the fastest kernel in isolation; it is deciding how to batch heterogeneous requests without turning the scheduler into a small tragedy.

The authors suggest possible directions: separating short-context and long-context requests onto different servers, or interleaving different computation forms within the same batch, similar in spirit to chunked prefill. These are design directions, not completed results.

This boundary affects business interpretation directly. KVBuffer’s reported numbers are credible within the evaluated implementation, but deployment value depends on whether a serving platform can decide when to use each form without losing the gains to scheduling friction.

That is the part to watch.

A useful result because it is not too grand

The most valuable papers in AI infrastructure often do not announce a new intelligence frontier. They find a place where the system is doing unnecessary work and stop it from doing that work quite so often.

KVBuffer is in that category.

It reframes linear-attention serving around IO rather than only asymptotic decoding cost. It shows that recurrent state is not automatically optimal just because it is constant with respect to context length. It demonstrates that buffering recent KVs can support chunkwise decoding, speculative verification, and short-context KV-only decoding in one memory-management mechanism. And it reports meaningful improvements in a real serving implementation: up to 45.17% lower linear-attention decoding latency, 2.78× faster verification for 8 draft tokens, 5× more supported serving requests under 4-token speculative verification, and up to 1.46× end-to-end throughput improvement.

The numbers are not universal laws. They are evidence that the serving layer has become architecture-specific again.

For businesses adopting long-context and agentic LLM systems, that is the deeper lesson. Model choice is only half the infrastructure decision. The other half is whether the serving stack understands what the model is actually doing with memory.

Linear attention may reduce the context-length problem. KVBuffer reminds us that someone still has to move the bytes.

Cognaptus: Automate the Present, Incubate the Future.


  1. Longwei Zou and Lin Zhong, “KVBuffer: IO-aware Serving for Linear Attention,” arXiv:2605.19049v1, 18 May 2026, https://arxiv.org/abs/2605.19049↩︎