A demo can survive a slow answer. A production service cannot survive the slow answer that arrives just often enough to make users stop trusting the product.
That is the quiet problem behind p99 latency. The average response time tells you how the service feels on a normal day. p99 tells you what happens to the unlucky one percent: the support agent waiting in front of a customer, the analyst refreshing a dashboard, the employee whose workflow now includes watching a spinner and reconsidering their life choices.
The paper behind this article, Improving LLM performance through black-box online tuning: a case for adding system specs to Factsheets for Trusted AI, proposes SLO-Tuner, a black-box online controller for LLM serving systems.1 Its main technical move is simple enough to sound almost rude: stop optimizing generic throughput or average latency, and tune the serving stack toward goodput—the rate of requests that finish within a p99 service-level objective.
That distinction matters because modern LLM serving is full of knobs that look innocent in isolation. More concurrency can keep the GPU busier. Larger batches can amortize work. Speculative decoding can reduce average decoding cost when a smaller draft model guesses well. Each of these statements can be true, and still the production service can become worse.
The tail wags the model because the user does not experience your mean latency. The user experiences their own request.
The serving problem is not “make the model faster”; it is “stay on the right side of the knee”
The paper frames LLM serving as a control problem over a small number of operational knobs: client concurrency, batching limits, and speculative decoding parameters. These knobs interact through queueing.
At low pressure, increasing concurrency or batch size can improve utilization. The GPU stops waiting around, more requests complete per second, and the dashboard looks healthier. Past a certain point, however, waiting time grows sharply. The system reaches the knee of the curve: a small increase in load or batch formation creates a disproportionate increase in tail latency.
The authors therefore use a p99 service-level objective, such as p99 below 1.2 seconds, as the constraint that matters. They define goodput as the rate of requests that satisfy that objective. A request that finishes too late may count in raw throughput, but it contributes nothing to goodput.
That is the first useful discipline in the paper. It does not ask, “How many requests can the system emit?” It asks, “How many requests can the system emit while still behaving like the service users were promised?”
| Operational metric | What it rewards | What it can hide |
|---|---|---|
| Mean latency | Typical request speed | A small but painful set of extreme delays |
| Raw throughput | Total completions per second | Requests that complete too late to be useful |
| GPU utilization | Hardware busyness | Queueing pressure and user-facing delay |
| Goodput under p99 SLO | Requests completed within the latency promise | Less useful if the SLO itself is poorly chosen |
The slightly unglamorous lesson: a highly utilized GPU can still be running a bad service. Infrastructure dashboards are very good at congratulating expensive hardware for being busy. Users are less sentimental.
SLO-Tuner treats the LLM server as a black box, which is exactly the point
SLO-Tuner does not require internal scheduler instrumentation. It uses end-to-end latency measurements over short segments, computes p50, p95, p99, and goodput, then performs a deterministic hill-climb over nearby configurations.
The live vLLM setup tunes three deployable controls: concurrency, max_num_seqs, and speculative token width. The simulator exposes a more factorized view of speculation, including draft width and verifier cadence, because simulation can separate dynamics that a real serving stack may not expose cleanly. The controller logic remains the same; only the adapter to the serving stack changes.
That design choice is more important than it first appears. Many companies do not operate a research-grade inference stack with deep custom instrumentation. They run vLLM, TGI, Triton, MLX, or managed serving systems, then discover that performance tuning lives somewhere between YAML archaeology and prayer. A black-box tuner is not theoretically pure, but it is operationally plausible.
The scoring logic is also intentionally biased against violating the SLO. In the vLLM experiments, p99 violations receive a penalty large enough to dominate small gains from hardware intensity. This is not pretending that utilization does not matter. It is saying that utilization is subordinate to the service promise.
In plain terms, the controller says:
- Run the server at the current configuration for a short measurement segment.
- Measure end-to-end latency and goodput.
- Try neighboring configurations.
- Move only if the score improves, especially if p99 moves back toward the SLO.
- Keep the best configuration and re-measure it at the end.
This is not a magic optimizer. It is a bounded, deployment-friendly search procedure. That is also why the paper’s limitations matter: first-order hill climbing can get trapped in local optima, and the authors do not compare it against random search or Bayesian optimization under the same segment budget. Still, for a production engineer choosing between “try a few nearby settings with a measurable objective” and “leave speculative decoding at whatever looked reasonable last Tuesday,” the former has charm.
Speculative decoding is not a free lunch; it is another queueing participant
The most useful correction in the paper concerns speculative decoding.
The common simplified story is that speculative decoding accelerates generation. A smaller draft model proposes tokens, the larger target model verifies them, and accepted tokens reduce the amount of expensive decoding work. That story is not wrong. It is merely incomplete in the way many comforting infrastructure stories are incomplete.
Speculation also introduces verification work, variance, and rejection risk. If the draft model proposes poorly, or if the workload and SLO make extra verification costly at the tail, wider speculation can damage p99 latency. The paper’s vLLM results on TinyLlama show exactly that pattern.
The headline live result is striking: the default-like baseline, using speculative width 8 with speculation enabled, reports p99 of 1.36 seconds and goodput of 8.1 requests per second under a 1.2-second p99 SLO. The tuned setting reduces speculative width to 0, effectively disabling speculation in this setup, and reaches p99 of 0.70 seconds with 15.0 requests per second goodput.
That is not just a latency improvement. It is a change in the operating story. The supposedly accelerative feature became the setting the controller had to shrink away from.
The ablation over speculative width reinforces the mechanism. With speculation off, the paper reports 13.87 rps at p99 0.77 seconds. At width 8, goodput falls to 9.77 rps and p99 reaches 1.21 seconds, essentially at the SLO boundary. At width 16, goodput falls further to 4.53 rps while p99 rises to 2.00 seconds.
| Setting tested in vLLM | Reported p99 | Reported goodput | Interpretation |
|---|---|---|---|
| Baseline: spec width 8, spec on | 1.36 s | 8.1 rps | Violates the 1.2 s p99 SLO despite looking like a reasonable default-like setup |
| Tuned: spec width 0 | 0.70 s | 15.0 rps | Escapes the tail-latency penalty by effectively disabling speculation |
| Speculation off in ablation | 0.77 s | 13.87 rps | Best region for this workload and SLO |
| Spec width 16 in ablation | 2.00 s | 4.53 rps | Wider speculation becomes actively harmful |
This is the misconception the article should kill gently, then bury without ceremony: speculative decoding is not “on equals faster.” It is a runtime control surface. Its value depends on model pair, acceptance behavior, workload, batching, concurrency, hardware, and the SLO being enforced.
The paper does not prove that speculative decoding is generally bad. It proves something more operationally useful: speculative decoding can be bad under a specific latency objective, and therefore must be tuned jointly with the rest of the serving stack.
The ablations show three different ways to lose the tail
The paper’s vLLM ablations are not a second thesis. They are diagnostic tests. Their purpose is to isolate how each knob shapes the p99-goodput surface around the target SLO.
Concurrency shows the classic elbow. Increasing concurrency from 2 to 8 raises goodput from 2.6 rps to 9.2 rps, while p99 moves from 1.23 seconds to 1.31 seconds. That is already above the 1.2-second target, but the real collapse comes later: beyond 10 threads, p99 exceeds 1.6 seconds, and at concurrency 16 goodput collapses to 0.27 rps. Raw throughput may continue to rise modestly, but most requests miss the SLO, so they do not count as useful service.
Batch size behaves differently. Raising max_num_seqs from 4 to 16 increases goodput from 0.40 rps to 10.77 rps and lowers p99 from 2.00 seconds to 1.20 seconds. Here, larger batches help up to the edge of the SLO because very small batches underutilize the server. But the paper identifies the queueing knee around 11 to 13 sequences, where p99 crosses the 1.2-second boundary and goodput begins to plateau or degrade.
Speculative width, in this workload, is the bluntest warning. Wider drafts monotonically harm the p99 objective.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| vLLM online tuning trajectory | Main live evidence | A black-box hill-climb can improve p99 and goodput on a real vLLM setup | General performance across larger models, traffic mixes, and many seeds |
| Concurrency ablation | Knob-level diagnostic | More in-flight requests can push the system past a p99 knee | That one concurrency value is universal |
| Batch-size ablation | Knob-level diagnostic | Moderate batching improves goodput, but the SLO boundary matters | That larger batches always help up to 16 in other workloads |
| Speculative-width ablation | Knob-level diagnostic | Speculation can inflate p99 and reduce SLO-satisfying goodput | That speculative decoding is generally harmful |
| Simulator stress tests | Robustness and exploration | The same qualitative trade-offs appear under steady and bursty regimes | Accurate absolute latency prediction for real serving stacks |
| MLX portability check | Exploratory extension | Directional trends can transfer beyond the NVIDIA/vLLM setup | Full portability or production readiness across stacks |
This table is also a useful guardrail for business readers. The evidence is not “SLO-Tuner solves LLM serving.” The evidence is “SLO-aware black-box tuning can detect and avoid bad operating regions that average-performance thinking would miss.”
That is enough to matter.
The simulator is useful because it is wrong in the right way
The paper includes a discrete-event simulator for queueing, batching, and speculative decoding dynamics. It is calibrated from timing parameters, adds light variability, and can test steady or bursty arrivals more cheaply than repeated live GPU runs.
The simulator is not meant to predict kernel-level timing exactly. This distinction is easy to miss and important not to overstate. Its job is not to replace live measurement; its job is to map the rough shape of the terrain before the online controller steps on the real hardware.
In the simulator experiments, batching is essential but dangerous at extremes. Under steady arrivals, some moderate-batch configurations approach the 1.2-second p99 boundary while preserving goodput, whereas too-small or too-aggressive settings sit away from the useful frontier. Under bursty arrivals, the safe region shrinks because bursts add queueing pressure. The hill-climb trajectories move toward smaller speculative settings and moderate batches, keeping EMA-smoothed p99 near the SLO.
The stress tests and ablations serve a different role from the live vLLM run. They ask: if prompts get longer or arrivals become more bursty, does the basic warning still hold? The answer is directionally yes. Aggressive fixed settings that look attractive under benign load can push p99 toward or beyond the SLO while providing only marginal goodput gains.
The MLX check on Apple Silicon is even more modest. It compares p99 trends for concurrency, speculative lookahead, and a draft-width proxy using Qwen models. The paper reports a clear calibration offset in absolute latency, but similar directional responses: more concurrency degrades p99, and more aggressive speculation can inflate tail latency.
That makes the simulator a screening tool, not an oracle. A good simulator tells you where not to waste expensive experiments. It does not earn the right to make final deployment decisions while the real server is sitting there with measurable latency.
The business implication is a serving policy, not a slogan about Responsible AI
The paper’s final move is to connect system performance to Factsheets for Trusted AI. This is reasonable, but only if we keep the chain of reasoning concrete.
The paper directly shows that LLM serving performance can change materially under different runtime configurations, and that p99-aware tuning can improve SLO-satisfying goodput in the tested setup. It also argues that system performance and sustainability metrics should become part of AI factsheets, because responsible adoption is not only about model behavior in the abstract. A model that is accurate in a benchmark but unreliable at the tail may still fail users in deployment.
Cognaptus would translate that into a practical procurement and operations checklist:
| Governance question | Operational metric to request | Why it belongs in an AI factsheet |
|---|---|---|
| Can the service meet user-facing latency promises? | p50, p95, p99 under stated workload and SLO | Average latency hides tail pain |
| Does throughput remain useful under load? | Goodput, not only raw throughput | Late completions should not be counted as success |
| Are acceleration features tuned or merely enabled? | Speculative decoding settings and SLO impact | “Enabled” can mean “faster” or “quietly worse” |
| Can the system adapt to workload drift? | Tuning cadence, canary policy, rollback criteria | Static settings decay as traffic changes |
| Is hardware efficiency measured responsibly? | Goodput per GPU-hour or energy proxy under SLO | Sustainability claims need service-quality context |
The inferred business practice is straightforward: before launching or scaling an LLM feature, define the p99 SLO, measure goodput under realistic traffic, tune concurrency, batching, and speculation against that SLO, then publish the tested operating envelope internally or externally.
This does not require every company to implement SLO-Tuner exactly as written. The deeper requirement is to stop treating model serving as a fixed backend detail after model selection. In many LLM products, the model is only half the product. The serving policy is the other half, and occasionally the half that users actually notice.
Where this evidence stops
The paper is preliminary, and the boundaries are not decorative. They shape how the result should be used.
The live vLLM evidence is based mainly on TinyLlama-1.1B served on a single NVIDIA L40S GPU, with one synthetic prompt, fixed output cap, steady closed-loop concurrency, and short measurement windows. The portability check uses smaller Qwen models on MLX. These are useful controlled experiments, but they are not production traffic.
Real deployments include mixed prompt lengths, variable output lengths, tool calls, retrieval latency, safety filters, user bursts, retries, caching, rate limits, and multi-tenant interference. Multi-GPU scheduling and cluster-level admission control can also move the tail in ways a single-node experiment cannot capture.
The search method is bounded hill climbing. That gives predictable tuning cost, but it can miss better configurations on a non-convex surface. The authors also report a single vLLM tuning trajectory rather than a distribution across seeds or starting points, and they do not benchmark against random search, Bayesian optimization, or bandit methods under the same measurement budget.
There is also a practical implementation issue: the vLLM setup restarts the server for each segment because the relevant flags are not safely hot-swappable. The paper estimates the full live tuning run at roughly 40 to 45 minutes of single-GPU time, dominated by server restarts. That may be acceptable for deployment-time tuning, off-peak calibration, or canary replicas. It is less attractive for highly dynamic workloads where the SLO surface shifts faster than the controller can react.
So the correct conclusion is not “install this and forget latency.” The correct conclusion is narrower and more useful: use SLO-first measurement to find the operating region where the serving stack keeps its promise, and treat acceleration knobs as hypotheses to test, not virtues to enable.
The tail is the product experience
The most business-relevant idea in the paper is not the specific hill-climb algorithm. It is the objective function.
Goodput under a p99 SLO forces the serving team to count only the work that arrives in time to be useful. That sounds severe, but production systems are severe. A chatbot embedded in customer support, a document-analysis workflow, or an internal copiloting tool does not receive partial credit from users because the GPU was almost efficient.
The paper’s TinyLlama result is a compact warning: a default-like speculative setup produced p99 of 1.36 seconds and 8.1 rps goodput under a 1.2-second SLO; tuning away from speculative width reached 0.70 seconds p99 and 15.0 rps goodput. The important word is not “TinyLlama.” The important word is “tuning.”
As LLM products move from experiments to daily infrastructure, the old habit of reporting model quality separately from system behavior will look increasingly quaint. Responsible AI factsheets that ignore latency, goodput, and efficiency under load are describing a product that does not quite exist. They are describing the model in a jar.
Users meet the system in the tail.
Cognaptus: Automate the Present, Incubate the Future.
-
Yonas Atinafu, Henry Lin, and Robin Cohen, “Improving LLM performance through black-box online tuning: a case for adding system specs to Factsheets for Trusted AI,” arXiv:2603.11340v1, 2026. https://arxiv.org/abs/2603.11340 ↩︎