Opening — Why this matters now

LLMs are no longer laboratory curiosities. They are infrastructure.

From customer‑support copilots to enterprise knowledge systems, organizations increasingly run large language models as interactive services. When that happens, a quiet but brutal reality emerges: users do not care about average latency. They care about the worst moment when the system stalls.

This is what engineers call tail latency—typically measured at the 99th percentile (p99). If the p99 latency is bad, a small fraction of users experience painfully slow responses even when the system looks “fast on average.” In production environments, that minority can be the difference between a product that feels magical and one that feels broken.

A recent research paper proposes a simple but powerful idea: instead of optimizing raw throughput, optimize goodput under a strict tail‑latency constraint. The authors introduce a system called SLO‑Tuner, a black‑box controller that automatically adjusts serving parameters to maximize the number of requests that meet a latency target.

The result is striking: p99 latency is cut roughly in half while throughput nearly doubles.

The lesson is subtle but profound: performance tuning for LLMs should start from the tail, not the mean.


Background — Context and prior art

Operating an LLM service involves balancing several competing forces:

Factor Goal Risk
GPU utilization Maximize hardware efficiency Queue buildup
Concurrency Serve many users simultaneously Latency spikes
Batching Increase throughput per GPU step Longer waiting time
Speculative decoding Speed up token generation Extra verification overhead

Traditional tuning strategies usually aim to maximize average throughput or average latency. However, this approach hides an uncomfortable truth: averages can look healthy even while a minority of users experience severe delays.

In large‑scale distributed systems, this phenomenon is known as “the tail at scale.” A single slow component propagates delays across the entire service pipeline.

The authors argue that LLM serving should instead be framed around Service Level Objectives (SLOs)—for example:


p99 latency ≤ 1.2 seconds

Under this framing, the system should maximize goodput, defined as:

Metric Meaning
Throughput Total requests processed per second
Goodput Requests per second that meet the SLO

This distinction matters because requests that violate the latency target contribute zero value to user experience.


Analysis — What the paper actually builds

The proposed system, SLO‑Tuner, treats the entire LLM server as a black box.

Instead of inspecting internal metrics or modifying the inference engine, it simply observes end‑to‑end measurements and adjusts configuration parameters through a hill‑climbing search process.

Tunable knobs

The controller focuses on three operational parameters:

Knob Meaning Trade‑off
Concurrency Number of simultaneous requests Higher GPU use vs queueing delays
Batch size Requests processed together Throughput vs waiting time
Speculative decoding width Tokens drafted before verification Speed vs latency variance

These knobs interact in complex ways depending on hardware, workload, and model architecture.

Optimization objective

The controller evaluates configurations using a score function:


Score = Goodput − SLO penalty − Hardware cost

Where:

  • Goodput measures successful requests per second
  • SLO penalty punishes configurations where p99 exceeds the target
  • Hardware cost discourages excessively aggressive resource usage

The search algorithm repeatedly:

  1. Runs the server with a configuration
  2. Measures p50, p95, p99 latency
  3. Computes goodput
  4. Tests nearby configurations
  5. Moves to the best option

This process continues for a limited number of iterations, making it practical for real deployment environments.


Findings — Results with visualization

The researchers evaluated the approach using the TinyLlama 1.1B model served with vLLM.

Headline performance improvements

Configuration p99 Latency Goodput
Default configuration ~1.36 s ~8 requests/sec
Tuned configuration ~0.70 s ~15 requests/sec

Two observations stand out.

1️⃣ Speculative decoding was actually harmful for the target SLO.

Wider speculative drafts increased verification overhead and variance, inflating p99 latency.

2️⃣ Concurrency has a sharp knee point.

Concurrency Goodput p99 latency
2 threads 2.6 rps ~1.23 s
8 threads 9.2 rps ~1.31 s
16 threads ~0.27 rps >1.6 s

Beyond a certain threshold, queueing delays dominate the system and performance collapses.

Batch‑size tradeoff

Batch size Impact
Small batches Underutilized GPU
Moderate batches Best goodput / SLO compliance
Very large batches Long queue delays → p99 spikes

The optimal region tends to be moderate batching with conservative speculation.


Implications — Why this matters for AI infrastructure

The implications extend far beyond one tuning algorithm.

1. LLM deployment is becoming systems engineering

Much of the AI conversation focuses on models, benchmarks, and training datasets.

Yet in production, the decisive factor is often inference systems engineering. A poorly tuned serving stack can squander expensive GPUs while degrading user experience.

2. Black‑box optimization lowers operational complexity

Most companies cannot afford deep modifications to inference engines.

A black‑box controller that interacts only through standard APIs offers a portable solution across platforms like:

  • vLLM
  • Triton
  • Hugging Face TGI
  • MLX

3. Tail latency is also a fairness problem

Interestingly, the authors frame tail latency as a fairness constraint.

If a small minority of users consistently receive extremely slow responses, the system effectively treats them as second‑class participants.

Optimizing p99 therefore improves not only performance but also equitable user experience.

4. Performance transparency should be part of AI governance

The paper proposes extending AI factsheets—documents used to describe responsible AI systems—to include operational metrics such as:

Category Example metrics
Reliability p95 / p99 latency
Performance SLO compliance
Sustainability energy per request

This shifts Responsible AI discussions beyond bias and transparency toward practical system behavior in real deployment.


Conclusion — Infrastructure decides adoption

A curious irony of modern AI is that the most expensive part of the system—the model—is rarely the primary operational bottleneck.

Instead, performance often hinges on mundane configuration choices: how many requests run simultaneously, how large batches grow, or how aggressively tokens are speculated.

The work discussed here demonstrates that simple, black‑box tuning can dramatically improve both speed and reliability. More importantly, it reframes LLM optimization around a principle borrowed from distributed systems engineering:

The user experience is determined not by the average case, but by the worst one percent.

As LLMs evolve into critical infrastructure, the ability to engineer the tail may become as important as the ability to train the model itself.

Cognaptus: Automate the Present, Incubate the Future.