Mixed Feelings: When LLM Batching Stops Being Obviously Better

Queues are where infrastructure theories go to become invoices.

In LLM serving, the popular theory has been simple enough: mix the work. During inference, a model first reads the prompt in the prefill phase, then generates tokens one by one in the decode phase. Prefill wants compute. Decode wants memory bandwidth. So the obvious move is to combine them in the same batch, letting one part of the GPU do prefill while another part handles decode. This is mixed batching, and it has become the default posture in modern inference engines.

The obvious move, unfortunately, is only obvious until the GPU starts choking on the memory traffic.

The paper Threshold-Based Exclusive Batching for LLM Inference makes a useful and slightly inconvenient point: mixed batching does not universally dominate exclusive batching.1 Its advantage depends on hardware bandwidth, model size, workload composition, and traffic level. On bandwidth-constrained GPUs, combining prefill and decode can create enough interference in attention kernels that a supposedly less elegant scheduler—exclusive batching, which separates phases—wins on throughput and often on per-token latency.

This is not a nostalgic defence of old schedulers. It is a reminder that “overlap” is not a free lunch. It is just a lunch with the cost hidden in a different kernel.

The default belief: mixed batching overlaps the pain away

LLM inference has two distinct phases.

In prefill, the system processes the input prompt and builds the key-value cache. This phase is relatively compute-heavy because prompt tokens can be processed in parallel. In decode, the model generates one token at a time for each active request. This phase repeatedly reads from the KV cache, making it heavily memory-bandwidth-bound.

That difference created the appeal of mixed batching. If prefill is compute-bound and decode is bandwidth-bound, then interleaving them looks efficient. Mixed batching keeps decode requests moving while admitting new prompts through chunked prefill. This is the scheduler logic behind vLLM v1-style serving and related inference systems.

Exclusive batching does the opposite. It does not combine prefill and decode in one forward pass. It alternates: decode active requests until enough slots become idle, then run a prefill phase to admit new requests, then return to decode. This sounds less modern, mostly because “not doing two things at once” rarely wins a product roadmap meeting.

The paper’s argument is that this comparison has been framed too simplistically. Mixed batching reduces some idle time, but it can also make each iteration more expensive. The relevant question is not whether MB overlaps work. It clearly does. The question is whether the overlap costs more than it saves.

That question turns out to be hardware-dependent.

The real comparison is not EB versus MB; it is marginal cost versus launch amortisation

The paper’s central comparison is clean. Mixed batching has an advantage because it reduces separate phase launches and can amortise fixed overhead. Exclusive batching has an advantage because it avoids prefill-decode interference inside a single mixed batch.

So the scheduler choice becomes a trade-off:

Scheduler force Mixed batching advantage Exclusive batching advantage
What it saves Fewer separated phase launches; lower TTFT in many cases Less prefill-decode interference; lower TPOT in bandwidth-constrained cases
What it risks Attention-side contention when decode share is high Waiting to refill idle slots; worse TTFT under low traffic
When it tends to win High bandwidth, low/moderate contention, larger models where fixed costs matter Bandwidth-constrained GPUs, saturated queues, medium decode ratios, token-throughput-sensitive workloads
Business translation Better responsiveness when overlap is cheap Better serving capacity when overlap is not cheap

The paper formalises this as a crossover condition: choose exclusive batching when the marginal-cost gap caused by mixed prefill-decode execution exceeds mixed batching’s amortised fixed-cost advantage. In prose:

$$ \text{MB interference penalty} > \text{MB fixed-cost advantage} $$

That is the paper’s most useful sentence, even though the authors quite reasonably use more symbols. The crossover is not a philosophical preference. It is an operating condition.

A scheduler that ignores this condition is effectively assuming the same answer for a high-bandwidth H200, a bandwidth-constrained RTX PRO 6000, a small Gemma model, a larger Qwen model, a long-context workload, and a decode-heavy chat workload. That is not a scheduling strategy. It is a shrug with YAML.

The evidence starts with a controlled insult to the default

The first important evidence is not the end-to-end benchmark. It is the controlled marginal-cost experiment.

Using Qwen3-4B, the authors compare mixed batches at different decode ratios on two GPUs with sharply different memory bandwidth: RTX PRO 6000 at 1.792 TB/s and H200 at 4.8 TB/s. The observed crossover is stark. On H200, the mixed-batch marginal cost exceeds pure decode only when decode tokens exceed roughly 80% of the batch. On RTX PRO 6000, the threshold falls to about 20%.

That is the paper in miniature.

On the high-bandwidth GPU, mixed batching can tolerate a large decode share before interference becomes expensive. On the bandwidth-constrained GPU, the same strategy starts paying the interference tax much earlier. The scheduler has not changed. The economics of the hardware have.

The kernel breakdown matters because it identifies where the cost comes from. GEMM time is largely insensitive to batch composition. Attention time grows as the decode ratio increases. That is exactly where decode is expected to hurt: every decode token has to stream KV-cache context from memory. When prefill and decode are placed in the same batch, attention becomes the contested resource.

This is the first useful correction for infrastructure teams: MB is not “more efficient” in the abstract. It is more efficient when the co-location cost is low enough. When memory bandwidth is the bottleneck, co-location can turn into self-interference with better branding.

Exclusive batching becomes interesting only when it is thresholded

A naive exclusive batching policy is easy to dismiss. If the system switches to prefill whenever any request finishes, it may admit too few new requests at a time and waste fixed prefill overhead. If it waits too long, idle slots sit unused and new requests wait. The scheduler has to decide when enough slots have opened to justify a refill phase.

The paper studies this as threshold-based exclusive batching, denoted $EB(\theta)$. The threshold $\theta$ represents the fraction of slots that must become idle before the system switches from decode to prefill.

The key mechanism is a throughput-latency trade-off:

  • A low threshold refills early, reducing waiting but amortising prefill overhead poorly.
  • A high threshold refills later, improving amortisation but leaving completed slots idle longer.
  • The optimal threshold depends on output-length behaviour, especially how likely active requests are to finish soon.

The authors derive a closed-form baseline threshold under a constant failure rate assumption, meaning output lengths behave like a memoryless geometric process. They then extend the result to increasing failure rate workloads, which better reflect real LLM serving: once a request has generated more tokens, it often becomes more likely to finish in the near future.

That increasing-failure-rate correction matters. It says the scheduler can often wait longer before refilling, because more completions are likely to arrive soon. In operational language: do not interrupt decode too eagerly if the active batch is about to free more slots anyway.

The paper also adds memory-safe batch sizing. This is not decorative. EB preserves KV caches for continuing requests while admitting new ones during prefill. Memory therefore follows a sawtooth pattern: it rises during decode, drops when requests finish, and rises again when new requests enter. The authors derive a conservative batch-size bound and add a runtime KV-aware gate to avoid refilling when instantaneous KV occupancy leaves too little headroom.

This is where the paper moves from “nice model” to “actually deployable enough to test.” The scheduler is not just selecting a threshold. It is selecting a threshold and a batch size under memory pressure, while updating online from recent workload statistics.

EB+ is the paper’s real product idea, even if the paper is not selling one

The paper’s strongest engineering move is EB+, a hybrid scheduler that switches between exclusive batching and mixed batching online.

EB+ uses the crossover condition at runtime. It estimates the current workload, tracks active occupancy, uses a one-shot hardware profile of mixed-batch marginal cost, and decides whether to run EB or MB. The priority margin can be tuned: favour MB for lower time-to-first-token, or favour EB for token throughput.

That last point matters. The paper is not claiming that EB should replace MB everywhere. It is claiming that a scheduler should know when each one is economically wrong.

At low traffic, EB+ tends to select MB because the fixed-cost advantage and TTFT benefits of mixed batching matter more. At higher occupancy, the fixed-cost advantage shrinks and contention dominates, so EB+ shifts toward EB. Under workload drift, it tracks the decode ratio and changes mode.

This is a practical design philosophy: do not ask operators to pick a permanent batching religion. Give the scheduler a meter.

The evaluation is a hierarchy, not a pile of charts

The paper includes enough experiments that a careless reading can flatten them into “many benchmarks were run.” That is not the right way to read it. The tests serve different purposes.

Evidence component Likely purpose What it supports What it does not prove
Marginal-cost and kernel breakdown experiments Main mechanism evidence MB interference is attention-driven and bandwidth-sensitive Universal behaviour across all kernels and hardware
Threshold validation against fixed sweeps Model validation / implementation check Closed-form EB threshold is near-optimal without manual search Exact finite-batch optimality in every deployment
Real-world workload benchmarks Main end-to-end evidence EB can outperform MB on bandwidth-constrained GPUs, especially for Qwen3-8B EB always beats MB
EB+ traffic and shift experiments Main evidence for adaptivity Hybrid switching tracks low load, high load, and non-stationary regimes Fully SLO-optimal production scheduling
Model, context, and hazard-rate appendices Robustness and sensitivity tests Mechanism holds across Qwen scales and contexts; threshold estimates tolerate error No calibration or profiling needed
Disaggregation comparison Comparison with prior architectural option EB+ can be competitive without separate P/D GPU pools Disaggregation is obsolete

That hierarchy matters because the paper’s claim is not “we found a faster scheduler in a benchmark.” It is “we found the condition under which a scheduler class should win, and built a hybrid that follows that condition.”

The distinction is not academic. Benchmark wins age quickly. Crossover logic ages more slowly, unless the hardware stack changes the underlying cost structure.

Real workloads show the crossover rather than a clean victory lap

On real-world workloads, the results are deliberately messy in the useful way.

The authors evaluate Qwen3-8B and Qwen3-30B-A3B on ShareGPT, LongBench, WildChat, and NuminaMath across RTX PRO 6000 and H200. The high-level pattern fits the mechanism.

On RTX PRO 6000, EB outperforms vLLM v1 mixed batching on every Qwen3-8B workload, with an average request-throughput improvement of 7.9%. The strongest gains appear on ShareGPT (+15.3%) and WildChat (+11.3%), while LongBench (+3.7%) and NuminaMath (+1.4%) show smaller gains.

That spread is important. The paper’s convexity analysis predicts EB’s advantage should peak at intermediate decode ratios and shrink at extremes. LongBench is prefill-heavy. NuminaMath is decode-heavy. ShareGPT and WildChat sit in the more favourable middle. The benchmark is not just a scoreboard; it is a test of the mechanism.

On Qwen3-30B-A3B, the picture becomes more nuanced. EB still wins on some RTX PRO 6000 workloads, such as WildChat (+11.6%) and LongBench (+4.8%), but loses on ShareGPT and NuminaMath, leaving only a modest average gain of 1.4%. On H200, MB is often competitive or better for the larger model, with the average result moving against EB for Qwen3-30B-A3B.

This is not a weakness in the argument. It is the argument.

Larger models increase fixed per-iteration costs, which can make MB’s launch-amortisation advantage more valuable. Higher-bandwidth hardware reduces the attention-side penalty of co-location. Put those together, and the crossover moves back toward MB.

A less careful paper would have hidden this under a victory average. This one effectively says: here is where our scheduler wins, here is where it does not, and here is why. Systems papers are much more useful when they resist the urge to become sales decks.

Latency splits the story into TTFT and TPOT

The latency results clarify the business trade-off.

Mixed batching often improves TTFT because prefill can be interleaved with ongoing decode. Users see the first token sooner. For interactive products, this is not cosmetic. TTFT shapes perceived responsiveness.

Exclusive batching often improves TPOT on bandwidth-constrained hardware because it avoids attention contention during decode. Users get subsequent tokens faster. For long responses, coding assistants, research agents, and reasoning-heavy workloads, TPOT can dominate the actual service cost and experience.

The paper reports large TPOT reductions for EB on RTX PRO 6000 with Qwen3-8B: 65% on ShareGPT, 35% on WildChat, and 20% on NuminaMath, with LongBench as the exception because of its prefill-heavy structure. On H200, the differences are smaller, again consistent with bandwidth reducing the interference penalty.

This is the operational translation:

Product priority Scheduler implication
Fast first-token response under light load MB may remain preferable
High token throughput under sustained load EB or EB+ becomes attractive on constrained hardware
Mixed workloads with shifting traffic EB+ is more sensible than a permanent default
Strict TTFT and TPOT SLOs together The scheduler needs explicit SLO-aware tuning, not just peak throughput

The scheduler decision is therefore not merely technical. It affects the shape of customer experience. A chatbot that produces short answers and values first-token responsiveness has different economics from a coding agent generating long patches under heavy concurrency.

Naturally, many teams will still use one default for both. This is how dashboards become therapy.

EB+ behaves like a runtime arbitrageur

The EB+ experiments are the most business-relevant part of the paper because production traffic is not stationary. Workloads shift. Concurrency shifts. Prompt and output lengths shift. Users, inconsiderately, refuse to behave like synthetic distributions.

The paper tests EB+ under stationary traffic at different concurrency levels and under non-stationary settings with distribution and concurrency shifts. The result is not that EB+ always dominates every metric. The result is more useful: EB+ usually gets close to the better of MB and EB without manual retuning.

On bandwidth-constrained RTX PRO 6000, EB+ selects MB at low load and recovers MB-like TTFT. At moderate and high load, it shifts toward EB and gains throughput with lower TPOT. Under non-stationary distribution shift, EB+ improves throughput over MB by up to 36.4% on RTX PRO 6000. Under the tested shifts, it stays within 1% of the better static scheduler across the reported hardware-scenario cells.

On H200, where MB is already strong, EB+ often reproduces MB behaviour at low load and tracks EB where EB is better. That matters because a hybrid scheduler should not damage strong baselines simply to justify its existence. Infrastructure has enough vanity projects already.

The goodput tests add a stricter lens. Under a relaxed TPOT target on RTX PRO 6000, EB+ reaches 80.3% joint SLO attainment at the loosest TTFT target tested, compared with 77.3% for EB and 5.8% for MB. Under strict TPOT, all schedulers fail on that bandwidth-constrained GPU. This is a useful boundary: scheduling can improve the feasible region, but it cannot repeal hardware limits by being clever.

The appendix tests robustness, not a second thesis

The appendices are worth reading because they show where the mechanism generalises and where it narrows.

The additional marginal-cost experiments extend the analysis across Gemma-3-1B-IT, Qwen3-8B, and Qwen3-30B-A3B. The Qwen models preserve the bandwidth-sensitive interference pattern. The larger Qwen3-30B-A3B shows a wider gap on RTX PRO 6000, consistent with greater memory pressure. Gemma-3-1B-IT is the exception: its lightweight architecture does not show the same crossover, because the bandwidth pressure is lower.

That exception is not embarrassing. It is clarifying. The mechanism is not “MB bad.” It is “MB can become bad when the model and hardware jointly create memory contention.” For small models, the contention may not materialise.

The long-context checks also matter. Increasing decode context length raises attention time, as expected, but does not change the qualitative crossover structure. The paper also validates the linear prefill iteration-time model across seven models in practical token-budget ranges, with high fit quality, and runs 128K-context end-to-end tests on RTX PRO 6000 showing EB with modest but consistent throughput gains across models.

The hazard-rate appendices test the threshold logic. Real workloads show increasing failure-rate behaviour in the reliable region, and synthetic gamma tests show the optimal switching threshold shifting upward under IFR, from 0.16 under a constant-failure-rate case to 0.36 under an increasing-failure-rate case. Sensitivity tests show the adaptive controller reaches 98% of peak throughput on both a synthetic gamma workload and ShareGPT, suggesting moderate hazard-estimation errors do not destroy performance.

These are robustness tests, not license to deploy blindly. They support the mechanism. They do not eliminate the need to profile your own workload.

Disaggregation is an alternative, not a magic escape hatch

The paper also compares EB+ with prefill-decode disaggregation, where separate GPU pools handle prefill and decode. Conceptually, disaggregation attacks the same problem by physical separation rather than temporal separation. It avoids co-location interference, but it introduces its own costs: KV-cache transfer, pool-ratio tuning, minimum GPU footprint, and possible imbalance.

The paper’s comparison is pragmatic. In two-GPU experiments, EB+ matches or exceeds MB throughput across tested concurrencies and can outperform vLLM’s 1P+1D disaggregation setup. At high concurrency, the disaggregation scheduler runs into OOM in the reported tests because KV blocks remain pinned during transfer and admission backpressure is insufficient.

The four-GPU comparison is more nuanced. Some disaggregation ratios win in some prefill-heavy settings. But the optimal prefill-to-decode ratio changes with workload mix, and the wrong ratio is costly. EB+ is best or near-best in 7 of 9 workload cells on RTX PRO 6000 and avoids the manual P:D ratio choice.

The lesson is not that disaggregation is obsolete. It is that disaggregation is an architecture decision, while EB+ is a scheduling decision. If a team can afford dedicated pools, fast interconnects, careful P:D tuning, and workload-specific routing, disaggregation may still be attractive. If the workload shifts and hardware is bandwidth-constrained, a bandwidth-aware scheduler is a cheaper first lever.

In business terms: before buying more GPUs to separate phases spatially, check whether separating them temporally gets you enough of the gain. Radical, I know—measure before procurement.

What the paper directly shows

The direct claims are fairly specific.

First, controlled experiments show that mixed-batch marginal cost can exceed pure decode cost much earlier on bandwidth-constrained hardware than on high-bandwidth hardware. This is localised mainly to attention, not GEMM.

Second, a thresholded EB scheduler can be derived from workload output-length behaviour, updated online, and paired with memory-safe batch sizing. The empirical validation suggests this avoids manual threshold search while retaining near-optimal throughput in the tested settings.

Third, end-to-end benchmarks show EB can improve throughput and TPOT on bandwidth-constrained GPUs, especially for Qwen3-8B and intermediate decode-ratio workloads. The strongest real-workload improvements are not universal; they align with the crossover logic.

Fourth, EB+ can switch between MB and EB online and performs best or near-best under tested stationary and non-stationary traffic, particularly when bandwidth pressure makes static MB costly.

These are strong claims, but they are bounded claims. The paper is not proving that every inference stack should revert to exclusive batching. It is proving that the default scheduler should be conditional.

What Cognaptus infers for operators

The operational implication is straightforward: batching policy should be part of serving optimisation, not an inherited default.

For teams running LLM inference at scale, the paper suggests a practical diagnostic sequence:

Diagnostic question Why it matters
What is the GPU memory bandwidth, not just FLOPS? Decode attention is bandwidth-bound; FLOPS alone misprices the bottleneck.
What is the workload’s input/output mix? EB gains peak around intermediate decode ratios and shrink at extremes.
What are the TTFT and TPOT targets separately? MB may help first token; EB may help subsequent tokens.
Is the fleet bandwidth-constrained or heterogeneous? EB/EB+ may be more valuable on constrained or mixed hardware pools.
Does the model create enough memory pressure? Small models may not trigger the same interference pattern.
Is traffic stationary? Static EB or MB choices degrade when concurrency and workload mix shift.
How much KV-cache headroom exists? EB needs memory-aware refill decisions, especially under long outputs or large models.

The commercial angle is not glamorous, which is why it is useful. Better batching can increase effective serving capacity without changing the model. It can lower time per output token, reduce queue pressure, and make constrained GPUs more economically viable. For enterprises deploying private inference on non-top-tier accelerators, this may matter more than another round of prompt-template folklore.

There is also a procurement lesson. GPU comparisons often overweight FLOPS and memory size. This paper pushes memory bandwidth back into the centre of inference economics. A cheaper GPU with weaker bandwidth may not simply be “slower”; it may move the entire scheduler crossover, changing which serving strategy is optimal. The cost model has to include scheduler behaviour, not just model size and token volume.

Where the result should not be overextended

There are four boundaries worth keeping crisp.

First, the implementation is vLLM-based and evaluated on NVIDIA GPUs. The qualitative mechanism is plausible beyond that stack, but the exact crossover points are not portable constants. Different kernels, attention implementations, quantisation regimes, interconnects, and memory managers can move the boundary.

Second, much of the core comparison is performed under saturated serving conditions. EB+ includes traffic-level adaptation and low-load behaviour, but operators with mostly bursty or lightly loaded traffic should care more about TTFT and queueing dynamics than peak saturated throughput.

Third, EB+ needs calibration. Its switching rule depends on an empirical profile of mixed-batch marginal cost as a function of decode ratio. This is not a major burden, but it is also not zero. “Adaptive” still means “measured somewhere.”

Fourth, the memory-safety model is partly analytical and partly runtime-defensive. The paper derives conservative bounds and adds a KV-aware feasibility gate, but also acknowledges modelling approximations. Production systems with long context, high output variance, MoE memory quirks, or aggressive multi-tenancy should treat memory headroom as a live control variable.

The cleanest practical reading is therefore: use this paper to decide what to profile, not what to blindly switch on.

The business value is conditional automation

The best systems papers do not merely say “our method is faster.” They explain why a default became too broad.

Mixed batching became attractive because it matched the intuition of modern accelerators: overlap different resource demands and improve utilisation. That intuition still holds in many cases. But when decode attention saturates memory bandwidth, overlap stops being elegant and starts being interference.

Exclusive batching looks simpler, but with the right threshold it becomes a structured control policy: wait for enough completions, refill safely, and avoid mixing phases when the hardware would punish it. EB+ then wraps the comparison in an online controller, choosing MB when overlap is cheap and EB when separation is cheaper.

That is the article’s business lesson. The future of inference optimisation is not one scheduler winning permanently. It is schedulers becoming hardware-aware, workload-aware, and SLO-aware. The model may be universal; the serving stack is stubbornly local.

For operators, the next step is not to declare mixed batching dead. That would be satisfyingly dramatic and technically lazy. The next step is to profile the crossover: measure marginal cost by decode ratio, segment by GPU bandwidth, track TTFT and TPOT separately, and let the scheduler respond to the regime it is actually in.

Infrastructure rarely rewards ideology. It rewards the side that counts the bottleneck correctly.

Cognaptus: Automate the Present, Incubate the Future.


  1. Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma, and Shining Wu, “Threshold-Based Exclusive Batching for LLM Inference,” arXiv:2606.00516v1, 30 May 2026, https://arxiv.org/abs/2606.00516↩︎