Opening — Why this matters now
LLMs are no longer laboratory curiosities. They are infrastructure.
From customer‑support copilots to enterprise knowledge systems, organizations increasingly run large language models as interactive services. When that happens, a quiet but brutal reality emerges: users do not care about average latency. They care about the worst moment when the system stalls.
This is what engineers call tail latency—typically measured at the 99th percentile (p99). If the p99 latency is bad, a small fraction of users experience painfully slow responses even when the system looks “fast on average.” In production environments, that minority can be the difference between a product that feels magical and one that feels broken.
A recent research paper proposes a simple but powerful idea: instead of optimizing raw throughput, optimize goodput under a strict tail‑latency constraint. The authors introduce a system called SLO‑Tuner, a black‑box controller that automatically adjusts serving parameters to maximize the number of requests that meet a latency target.
The result is striking: p99 latency is cut roughly in half while throughput nearly doubles.
The lesson is subtle but profound: performance tuning for LLMs should start from the tail, not the mean.
Background — Context and prior art
Operating an LLM service involves balancing several competing forces:
| Factor | Goal | Risk |
|---|---|---|
| GPU utilization | Maximize hardware efficiency | Queue buildup |
| Concurrency | Serve many users simultaneously | Latency spikes |
| Batching | Increase throughput per GPU step | Longer waiting time |
| Speculative decoding | Speed up token generation | Extra verification overhead |
Traditional tuning strategies usually aim to maximize average throughput or average latency. However, this approach hides an uncomfortable truth: averages can look healthy even while a minority of users experience severe delays.
In large‑scale distributed systems, this phenomenon is known as “the tail at scale.” A single slow component propagates delays across the entire service pipeline.
The authors argue that LLM serving should instead be framed around Service Level Objectives (SLOs)—for example:
p99 latency ≤ 1.2 seconds
Under this framing, the system should maximize goodput, defined as:
| Metric | Meaning |
|---|---|
| Throughput | Total requests processed per second |
| Goodput | Requests per second that meet the SLO |
This distinction matters because requests that violate the latency target contribute zero value to user experience.
Analysis — What the paper actually builds
The proposed system, SLO‑Tuner, treats the entire LLM server as a black box.
Instead of inspecting internal metrics or modifying the inference engine, it simply observes end‑to‑end measurements and adjusts configuration parameters through a hill‑climbing search process.
Tunable knobs
The controller focuses on three operational parameters:
| Knob | Meaning | Trade‑off |
|---|---|---|
| Concurrency | Number of simultaneous requests | Higher GPU use vs queueing delays |
| Batch size | Requests processed together | Throughput vs waiting time |
| Speculative decoding width | Tokens drafted before verification | Speed vs latency variance |
These knobs interact in complex ways depending on hardware, workload, and model architecture.
Optimization objective
The controller evaluates configurations using a score function:
Score = Goodput − SLO penalty − Hardware cost
Where:
- Goodput measures successful requests per second
- SLO penalty punishes configurations where p99 exceeds the target
- Hardware cost discourages excessively aggressive resource usage
The search algorithm repeatedly:
- Runs the server with a configuration
- Measures p50, p95, p99 latency
- Computes goodput
- Tests nearby configurations
- Moves to the best option
This process continues for a limited number of iterations, making it practical for real deployment environments.
Findings — Results with visualization
The researchers evaluated the approach using the TinyLlama 1.1B model served with vLLM.
Headline performance improvements
| Configuration | p99 Latency | Goodput |
|---|---|---|
| Default configuration | ~1.36 s | ~8 requests/sec |
| Tuned configuration | ~0.70 s | ~15 requests/sec |
Two observations stand out.
1️⃣ Speculative decoding was actually harmful for the target SLO.
Wider speculative drafts increased verification overhead and variance, inflating p99 latency.
2️⃣ Concurrency has a sharp knee point.
| Concurrency | Goodput | p99 latency |
|---|---|---|
| 2 threads | 2.6 rps | ~1.23 s |
| 8 threads | 9.2 rps | ~1.31 s |
| 16 threads | ~0.27 rps | >1.6 s |
Beyond a certain threshold, queueing delays dominate the system and performance collapses.
Batch‑size tradeoff
| Batch size | Impact |
|---|---|
| Small batches | Underutilized GPU |
| Moderate batches | Best goodput / SLO compliance |
| Very large batches | Long queue delays → p99 spikes |
The optimal region tends to be moderate batching with conservative speculation.
Implications — Why this matters for AI infrastructure
The implications extend far beyond one tuning algorithm.
1. LLM deployment is becoming systems engineering
Much of the AI conversation focuses on models, benchmarks, and training datasets.
Yet in production, the decisive factor is often inference systems engineering. A poorly tuned serving stack can squander expensive GPUs while degrading user experience.
2. Black‑box optimization lowers operational complexity
Most companies cannot afford deep modifications to inference engines.
A black‑box controller that interacts only through standard APIs offers a portable solution across platforms like:
- vLLM
- Triton
- Hugging Face TGI
- MLX
3. Tail latency is also a fairness problem
Interestingly, the authors frame tail latency as a fairness constraint.
If a small minority of users consistently receive extremely slow responses, the system effectively treats them as second‑class participants.
Optimizing p99 therefore improves not only performance but also equitable user experience.
4. Performance transparency should be part of AI governance
The paper proposes extending AI factsheets—documents used to describe responsible AI systems—to include operational metrics such as:
| Category | Example metrics |
|---|---|
| Reliability | p95 / p99 latency |
| Performance | SLO compliance |
| Sustainability | energy per request |
This shifts Responsible AI discussions beyond bias and transparency toward practical system behavior in real deployment.
Conclusion — Infrastructure decides adoption
A curious irony of modern AI is that the most expensive part of the system—the model—is rarely the primary operational bottleneck.
Instead, performance often hinges on mundane configuration choices: how many requests run simultaneously, how large batches grow, or how aggressively tokens are speculated.
The work discussed here demonstrates that simple, black‑box tuning can dramatically improve both speed and reliability. More importantly, it reframes LLM optimization around a principle borrowed from distributed systems engineering:
The user experience is determined not by the average case, but by the worst one percent.
As LLMs evolve into critical infrastructure, the ability to engineer the tail may become as important as the ability to train the model itself.
Cognaptus: Automate the Present, Incubate the Future.