Queue.

That is still the least glamorous word in AI infrastructure, and probably the most honest one. A user asks a model to write code, summarize a filing, inspect an image, or reason through a customer ticket. The model knows what to do, more or less. The bottleneck is not ambition. It is waiting: one token after another, one expensive forward pass after another, while the GPU performs a very sophisticated version of typing slowly.

Speculative decoding was supposed to relieve that pain. The idea is elegant enough to survive contact with engineering: let a smaller draft model propose several tokens ahead, then let the large target model verify those candidates in parallel. If the candidates pass, the system moves forward several tokens at once. If they fail, the target model corrects them. Done properly, the output distribution remains the same; only the route becomes faster.1

The catch is that speculation has standards. A draft model is not rewarded for sounding plausible in private. It is rewarded for producing drafts the target model actually accepts. This is where Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance becomes interesting.2 The paper is not just another “make inference faster” entry in the already crowded acceleration drawer. Its sharper contribution is diagnostic: speculative decoding has been training drafters for one game and deploying them in another.

The expensive mistake is training for the next token when inference rewards the accepted path

The common reader misconception is simple: a better draft model must be the one that predicts the target model’s next token more accurately.

That sounds reasonable. It is also incomplete in exactly the way expensive systems often are. In speculative decoding, the useful unit is not the isolated next token. The useful unit is the accepted span: how many proposed tokens survive verification before the target model has to intervene. One more accepted token can reduce another serial target-model step. One plausible but rejected branch is just a nicely dressed invoice.

Traditional draft training usually optimizes token-level likelihood: cross-entropy, KL variants, or related objectives that encourage the draft model to imitate target-model outputs token by token. That training regime quietly assumes a mostly single-path world. It asks: given this prefix, what is the most likely continuation?

Modern speculative decoding does not behave like that. It samples or constructs multiple candidate paths, often in trees, ranks them, prunes them, and verifies selected candidates at the path level. The actual question at inference is closer to: which candidate branches are likely to survive ranking and target-model verification for several steps?

That difference is not cosmetic. The VSD paper reports that, in its diagnostic setup with an EAGLE-3 draft model and LLaMA-3.1-8B, around 30% of training-time greedy paths are pruned during draft-tree construction. The finally accepted path matches the greedy path in only 36% of cases. Even when the greedy path is accepted, its average accepted length is only 3–4 tokens, while alternative high-confidence candidates reach 5–6 tokens.2

This is the moment where the paper earns its keep. The problem is not that draft models are stupid. The problem is that they are being graded on a worksheet while deployed in a tournament.

VSD changes the training target from “likely token” to “valid proposal”

Variational Speculative Decoding reframes the draft path as a latent proposal. Instead of treating draft training as next-token imitation, it asks a more operational question: what draft-path distribution maximizes the probability that the target model accepts long useful spans?

The paper formalizes this through a path-level validity probability. Given a context $x$ and a draft path $z$, let $\kappa(x,z)$ represent the probability that the path survives target-model verification. Then the draft model is no longer merely trying to assign high probability to tokens that look target-like. It is trying to place probability mass on proposals that are both plausible under the target model and useful under the verifier.

A simplified version of the objective can be read as:

$$ \mathcal{L}_{\text{VSD}} = ## \mathbb{E}\ast{q\ast\theta(z|x)}[\log \kappa(x,z)] D_{\mathrm{KL}}\left(q_\theta(z|x),|,p_T(z|x)\right) $$

The first term rewards paths that are likely to be accepted. The second term keeps the draft distribution from drifting away from the target distribution. Without the first term, the drafter may learn to imitate target probabilities while still proposing branches that decoding does not use. Without the second term, the drafter could chase acceptance in a distorted way and lose the lossless character that makes speculative decoding attractive in the first place.

This is why the paper’s “variational” framing matters. It is not mathematical decoration, though AI papers do enjoy wearing formalwear to breakfast. The ELBO gives the training process a principled way to optimize over latent draft paths that cannot be enumerated directly. It turns acceptance from an after-the-fact metric into part of the learning objective.

EM and MCMC are doing operational cleanup, not theoretical theater

The objective is attractive. It is also intractable if treated literally, because the space of possible draft paths is huge. VSD therefore uses an Expectation–Maximization style procedure.

In the E-step, the system samples latent draft proposals and filters them through an oracle-like verification process. The point is to approximate the posterior distribution over draft paths that are likely to be valid under the target model. Rejected paths are not merely thrown into the bin. They help define where the draft model is failing and how correction should occur.

In the M-step, the draft model is updated toward the sampled valid-path distribution. The paper adds two stabilizers: Adaptive Rejection Weighting, which helps manage high-variance updates from rejected proposals, and Confidence-Aware Regularization, which discourages the especially annoying failure mode of being confidently wrong. In speculative decoding, confidently wrong is not charming. It builds draft trees that look strong until the verifier starts saying no.

The mechanism can be summarized this way:

VSD component What it directly changes Operational meaning Boundary
Path-level validity Rewards draft paths likely to survive verification Optimizes accepted span, not just token imitation Requires access to target-model verification signals
Variational objective Balances accepted proposals with target-distribution alignment Keeps speedup training tied to lossless decoding Depends on how well the posterior is approximated
EM with MCMC samples Searches over latent proposal paths Learns from the distribution decoding actually sees Adds training complexity
Adaptive Rejection Weighting Uses rejected proposals more efficiently Reduces variance and improves sample efficiency Not a replacement for good draft architecture
Confidence-Aware Regularization Penalizes overconfident invalid paths Prevents brittle draft trees Helps stability, not universal robustness

The key practical point is that VSD does not ask operators to replace the entire speculative-decoding stack. It can be layered onto strong draft systems such as EAGLE-3, MSD, or ViSpec. That makes the result more relevant than a clean-room method that only works after rebuilding everything from scratch. AI infrastructure teams have enough rebuild invitations. Most of them should be politely ignored.

The evidence says VSD is an incremental gain over strong baselines, which is exactly the point

The paper evaluates VSD across text LLMs and multimodal LLMs. For language models, it uses benchmarks including MT-Bench, HumanEval, and GSM8K, with models such as LLaMA-3.1-8B, LLaMA-3.3-70B, Vicuna-13B, and DeepSeek-R1-Distill-LLaMA-8B. For multimodal settings, it tests LLaVA-1.5 variants across visual question answering, chart/document reasoning, and hallucination-oriented benchmarks.

The headline is not “VSD makes models ten times faster.” Good. That headline would deserve suspicion and probably a nap.

The more useful result is that VSD improves already strong speculative-decoding baselines. On LLM benchmarks, the paper reports that VSD improves EAGLE-3’s wall-clock speedup by an average of 9.6% under greedy decoding and 7.3% under stochastic decoding. It also reports that VSD accepts roughly 6–7 tokens per draft-verification cycle, compared with 5–6 for EAGLE-3.2 EAGLE-3 itself is already a serious baseline, using direct token prediction and multi-layer feature fusion to scale speculative inference acceleration.3

For multimodal models, the gains are similarly incremental but consistent. On LLaVA-1.5-13B at temperature 0, adding VSD to MSD improves average speedup by 10.1% and acceptance length by 11.9%; adding it to ViSpec improves speedup by 8.5% and acceptance length by 8.8%. Under stochastic sampling, the gains are smaller but still positive in the reported settings.2

The ablations are also important. Increasing the number of latent proposals improves speedup ratio and acceptance length. Adding Adaptive Rejection Weighting helps. Adding Confidence-Aware Regularization helps further. These are not separate theses. They support the same mechanism: if the model is trained using samples closer to the distribution that decoding actually uses, and if rejected proposals are handled without destabilizing learning, the accepted span improves.

Result type What the paper shows How to interpret it What it does not prove
Greedy-path diagnostic Greedy training paths often do not become accepted inference paths Token-level imitation is misaligned with tree-based verification Greedy training is useless in all settings
LLM benchmark gains VSD improves EAGLE-3 speedup by 9.6% greedy and 7.3% stochastic on average Objective alignment can still extract gains from strong baselines Every deployment will see the same percentage gain
MLLM benchmark gains VSD improves MSD and ViSpec on LLaVA-1.5 benchmarks The path-level objective transfers beyond text-only decoding Speech, video, and tool-use agents are covered
Ablation results More latent proposals, ARW, and CAR each support performance The training procedure matters, not only the final formula More sampling is always economically optimal

The magnitude matters. A 7–10% speedup over a strong baseline can be trivial for a hobby deployment and material for a platform serving millions of requests. Infrastructure economics has this irritating habit: small percentages become large invoices when multiplied by enough traffic.

The business value is acceptance analytics, not just a faster decoder

For business readers, the wrong lesson is “use VSD.” The better lesson is: measure the acceptance behavior of your inference system, not just the raw quality of the draft model.

Speculative decoding lives or dies by accepted spans, rejection patterns, and wall-clock behavior under real workloads. A draft model with slightly worse token-level likelihood may be operationally better if it produces candidate paths that survive verification longer. Conversely, a draft model that looks strong under offline imitation metrics may waste compute if its best-looking branches are pruned before the target model benefits.

That changes what an AI infrastructure team should monitor:

Metric Why it matters
Average acceptance length Directly links draft quality to fewer target-model decoding steps
Branch survival rate Shows whether tree construction is preserving useful candidates
Greedy-path usage rate Diagnoses whether token-level training matches inference behavior
Rejection location Reveals whether failures happen early, making drafts mostly wasted
Speedup by workload type Separates real serving gains from benchmark decoration
Training overhead vs serving savings Determines whether VSD-style training is economically justified

This is also where VSD connects to a broader trend in draft-model training. Recent work such as Draft-OPD frames the problem as an offline-to-inference mismatch: supervised fine-tuning on fixed target-generated trajectories can plateau because the draft model is evaluated on states induced by its own proposals.4 VSD attacks a related but distinct mismatch: not merely whether the draft model sees its own states, but whether training optimizes the path distribution that decoding and verification actually reward.

The operational implication is blunt: if your serving stack uses ranking, pruning, verification, or multi-path proposal selection, then offline token likelihood is an incomplete KPI. It may still be useful. It is just not the scoreboard.

VSD is most relevant when speculative decoding is already worth engineering carefully

Speculative decoding is not the only way to attack inference latency. Medusa, for example, adds multiple decoding heads to predict future tokens in parallel and uses tree-based attention to verify candidates, reducing reliance on a separate external draft model.5 Lookahead decoding explores exact parallel decoding without auxiliary draft models, trading computation within steps for fewer serial steps.6 These alternatives matter because the best system design depends on constraints: model ownership, serving framework, hardware, latency target, batch size, sampling behavior, and tolerance for additional training.

VSD makes the most sense when four conditions hold.

First, the organization already has enough inference volume for marginal speedups to matter. If the monthly serving bill is modest, a 7–10% improvement over a strong baseline may not justify new training complexity. The CFO will survive. The research team may need tea.

Second, the team has access to the target model and verification process. VSD is built around target acceptance behavior. If the target model is a closed API with limited observability, the method becomes harder to apply directly.

Third, the current bottleneck is serial decoding rather than unrelated serving overhead. If queueing, networking, retrieval latency, or application orchestration dominate response time, improving accepted draft length will not magically fix the system. The GPU cannot optimize your product manager’s middleware.

Fourth, the team can evaluate on real workload distributions. Benchmarks such as MT-Bench, HumanEval, GSM8K, VQAv2, and ChartQA are useful, but production traffic has its own grammar: short chats, long coding sessions, retrieval-heavy answers, structured outputs, multilingual prompts, or multimodal documents. Acceptance behavior can shift across these categories.

The boundary is not quality loss; it is deployment realism

One attractive property of speculative decoding is that, under the correct verification scheme, acceleration can be lossless with respect to the target output distribution. That is why the field has remained commercially relevant instead of becoming another “faster but worse” trick. The original speculative decoding work emphasized parallel token generation without altering outputs, and later systems have preserved that ambition while changing the drafting architecture.1

VSD does not remove the need for careful verification. It improves the draft training objective so proposals are more likely to be accepted. The preservation of output distribution still depends on the speculative decoding mechanism doing its job.

The paper’s limitations are practical rather than fatal. The authors note computational constraints in scaling the number of latent proposals for MCMC estimation. Experiments focus on text and visual tasks, leaving other modalities such as speech for future work. The reported gains are measured under specific models, hardware, benchmarks, and baseline implementations. That does not weaken the paper’s core argument, but it does define where the evidence stops.

For deployment, the remaining uncertainty is economic: how much additional training and system complexity is justified by incremental speedup? The answer is not universal. It depends on serving volume, target-model cost, latency sensitivity, engineering capacity, and whether the organization already has a mature speculative-decoding pipeline.

The lesson is to train for the verifier, not for the mirror

The most useful idea in VSD is not the acronym. Acronyms are cheap; accepted tokens are not.

The paper’s contribution is to make speculative decoding less speculative in the bad sense. Instead of training a draft model to imitate the target model in a token-by-token mirror exercise, VSD trains it toward the distribution of paths that the verifier actually accepts. The difference sounds subtle until you remember that inference systems do not pay for plausible guesses. They pay for useful progress.

For AI businesses, this is a familiar pattern. As models become components inside larger systems, local accuracy metrics become less reliable. A planner is not useful because each step looks plausible. A retrieval system is not useful because each document is semantically close. A draft model is not useful because each token has high likelihood. The system rewards what survives downstream selection.

Speculation is allowed. But in production, it needs standards.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yaniv Leviathan, Matan Kalman, and Yossi Matias, “Fast Inference from Transformers via Speculative Decoding,” arXiv:2211.17192, 2022, https://arxiv.org/abs/2211.17192↩︎ ↩︎

  2. Xiandong Zou, Jianshu Li, Jing Huang, and Pan Zhou, “Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance,” arXiv:2602.05774, 2026, https://arxiv.org/html/2602.05774v1↩︎ ↩︎ ↩︎ ↩︎

  3. Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang, “EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test,” arXiv:2503.01840, 2025, https://arxiv.org/abs/2503.01840↩︎

  4. Haodi Lei et al., “Draft-OPD: On-Policy Distillation for Speculative Draft Models,” arXiv:2605.29343, 2026, https://arxiv.org/abs/2605.29343↩︎

  5. Tianle Cai et al., “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads,” arXiv:2401.10774, 2024, https://arxiv.org/abs/2401.10774↩︎

  6. Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang, “Break the Sequential Dependency of LLM Inference Using Lookahead Decoding,” arXiv:2402.02057, 2024, https://arxiv.org/abs/2402.02057↩︎