No Free Tokens: The New Economics of LLM Inference

Opening — Why this matters now

For the last few years, AI strategy has been narrated as a model-quality story: bigger models, better benchmarks, longer context windows, more agents, more demos, more adjectives. That story was useful. It was also incomplete.

The less glamorous reality is now arriving with the invoice attached. LLM systems are not merely models. They are production services that consume GPU memory, scheduling capacity, engineering attention, and operational patience. Once a business moves from a prototype to repeated daily use, the question changes from “Can the model answer?” to “Can the system answer reliably, cheaply, and repeatedly when real users arrive at inconvenient times?”

Two recent arXiv papers are useful because they attack this problem from different layers of the same stack. Budgeted LoRA studies model distillation under explicit compute constraints, proposing a way to reallocate dense and low-rank computation so that a student model becomes cheaper not only to train but also to serve.¹ A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints studies the serving layer, deriving stability conditions for LLM inference systems where GPU memory and KV cache growth determine whether queues remain bounded or explode.²

Read together, the papers suggest a blunt but useful thesis:

LLM efficiency is shifting from “make the model smaller” to “control where computation and memory are spent across the whole service stack.”

That sounds less exciting than “autonomous AI transformation.” Good. It is probably closer to where the money goes.

The Research Cluster — What these papers are collectively asking

The two papers do not ask the same technical question. One is about compression during distillation. The other is about stability in serving systems. But they share a deeper concern: modern LLM deployment is constrained by resources that are not visible in simple model-quality metrics.

The compression paper asks: if standard LoRA reduces trainable parameters but leaves dense inference cost mostly intact, can distillation be reformulated as a structured compute-allocation problem? Its answer is Budgeted LoRA: a method that uses a global dense-compute budget, module-level dense-retention coefficients, adaptive rank gates, and post-training compression to produce a family of deployment-efficient students.

The queueing paper asks: if LLM inference is limited by both compute and KV cache memory, can operators derive a stability condition that predicts whether incoming demand can be served without unbounded queue growth? Its answer is a queueing framework that models prompt and decode phases, accounts for memory growth over a request’s lifetime, and validates predicted stability thresholds against real GPU experiments.

The connective tissue is not LoRA, KV cache, or any one algorithm. The connective tissue is allocation. The first paper reallocates computation inside the model. The second reallocates and provisions capacity around the model.

That makes this a stack/layer article, not a pair of paper summaries. The real object of analysis is the inference economics stack.

The Shared Problem — What the papers are reacting to

LLM deployment has a persistent habit of hiding costs behind abstractions.

A product manager sees an API call. A user sees a chat window. A CFO sees a GPU bill. A platform engineer sees queue length, batch size, context length, memory fragmentation, and the faint outline of sleep deprivation.

Both papers react to the same operational problem: a model can be impressive and still be economically awkward.

At the model layer, parameter-efficient fine-tuning is not automatically inference-efficient. Standard LoRA trains a small number of low-rank parameters while preserving the dense backbone. This lowers adaptation cost, but the dense matrix multiplications still dominate serving. The first paper’s important move is to say: do not merely ask how many parameters are trainable; ask which computation survives deployment.

At the serving layer, request throughput is not determined by average request length alone. LLM requests have prompt phases and decode phases. KV cache memory grows as tokens accumulate. Long prompts and long generations can change effective concurrency. The second paper’s important move is to say: do not merely ask how fast one request runs; ask whether the request arrival process is stable under memory-constrained batching.

Here is the shared business problem in plain terms:

Hidden assumption	Why it fails in production	What the papers replace it with
“Smaller adaptation means cheaper inference.”	LoRA can reduce training cost while leaving dense serving cost mostly unchanged.	Distillation as structured compute allocation across dense and low-rank pathways.
“Average latency tells us enough.”	Queue stability depends on demand, batching, prompt/decode distributions, and KV cache memory growth.	Stability analysis using arrival rate, service capacity, workload distribution, and memory constraints.
“GPU planning is just buy-more-hardware math.”	Over-provisioning wastes capital; under-provisioning creates latency failures.	Capacity planning based on stable service rate and workload-specific memory dynamics.
“Perplexity captures model retention.”	Similar perplexity can hide different behavior retention patterns.	Evaluation that also probes in-context behavior retention.

The common message is simple: efficiency is no longer a single metric. It is a control problem.

What Each Paper Adds

Paper	Layer of the stack	Direct contribution	What it changes in the deployment conversation
Budgeted LoRA	Model architecture / distillation	Reframes distillation as compute allocation under a global dense-compute budget. Uses dense-retention coefficients, adaptive low-rank allocation, and post-training compression.	Shows that model adaptation efficiency and inference efficiency are different. A cheap-to-train model is not necessarily cheap to serve.
Queueing-Theoretic Framework	Serving / operations	Builds a queueing model for LLM inference that explicitly includes KV cache memory constraints and derives stability / instability conditions.	Turns GPU provisioning from rule-of-thumb sizing into workload-aware capacity planning.

Paper 1: Budgeted LoRA makes model efficiency controllable

The Budgeted LoRA paper starts from a practical weakness in standard LoRA-based distillation: LoRA changes the trainable adaptation path but generally leaves the dense backbone in place. That means the method can be parameter-efficient without creating meaningful inference savings.

The authors’ response is to make dense computation itself budgeted. Instead of assuming a fixed student architecture, they introduce a global compute budget controlling the fraction of dense computation retained. Individual modules receive dense-retention coefficients. Low-rank paths use adaptive rank gates. After training, a compression step converts the gated training structure into deployment forms: dense paths may be removed, approximated with low-rank SVD, or kept and merged when needed.

The result is not a single compressed model but a family of operating points. At a moderate budget, Budgeted LoRA closely matches standard KD-LoRA perplexity while reducing training-compute proxy and providing compressed-module inference speedup. At a more aggressive budget, it achieves larger efficiency gains with moderate perplexity degradation. The paper reports that at one moderate setting, Budgeted LoRA achieves 0.59 normalized training compute and 1.74 compressed-module inference speedup while matching KD-LoRA closely in perplexity; at a more aggressive setting, it reports 0.38 normalized training compute and 4.05 compressed-module inference speedup, with higher perplexity.

The more interesting result is not only the speedup. The paper also evaluates function-style in-context learning probes. It finds that Budgeted LoRA can retain more of this behavior than standard KD-LoRA, especially when distilling from larger teachers. That matters because perplexity alone can miss operationally relevant behavior. A model can look acceptable on one aggregate metric while losing capabilities that business workflows actually depend on.

Business interpretation: this paper supports a more granular view of model compression. Instead of treating compression as a one-time shrink operation, organizations should treat it as a budget dial with measurable trade-offs across quality, latency, and behavior retention. The paper does not prove that Budgeted LoRA is ready for every production architecture. It does show that “parameter-efficient” is too weak as an ROI category.

Paper 2: Queueing theory makes serving reliability measurable

The queueing paper moves one layer down, from the model to the serving system. Its starting point is that LLM inference is not a classical stateless service. Each request has a prompt phase and a decode phase. KV cache speeds decoding by storing key-value representations, but that memory grows as active context grows. The result is a serving system where compute and GPU memory jointly determine capacity.

The paper models a single GPU worker with a KV cache memory constraint. Requests arrive stochastically, each with a prompt length and response length drawn from a joint distribution. The model tracks the lifetime cumulative memory usage of each request and derives conditions under which the queue is stable or overloaded.

The paper’s practical claim is that operators can estimate a request arrival rate and a stable processing rate, then use the relationship between the two to provision GPUs. In experiments using Meta-Llama-3-8B on NVIDIA A100 GPUs with vLLM and chunked prefill, the theoretical predictions align closely with empirical rates. The paper reports deviations typically within 10%, including an 8.03% error on LongBench v2 and a 3.38% gap in an eight-GPU replica setting.

Business interpretation: this paper moves LLM serving from “monitor and panic” toward capacity planning. It does not eliminate engineering judgment. It does not solve every architecture, especially pipeline-parallel and prefill-decode-disaggregated systems, which the authors identify as future extensions. But it gives operators a principled way to ask: under this workload distribution, arrival rate, context length pattern, and memory budget, are we stable?

That is a very different question from “What is our average latency?” Average latency is a dashboard metric. Stability is a survival condition.

The Bigger Pattern — What emerges when we read them together

The larger pattern is that LLM inference is becoming an allocation economy.

At the model layer, computation must be allocated between dense pathways and low-rank pathways. At the serving layer, GPU memory must be allocated across active requests whose memory footprints evolve over time. In both cases, naive averages are dangerous.

A simple model of the stack looks like this:

Stack layer	Scarce resource	Control lever	Failure mode if unmanaged
Distillation / compression	Dense computation and retained behavior	Dense-budget schedule, rank allocation, post-training compression	Model is cheap to adapt but expensive to serve; or efficient but behaviorally damaged.
Runtime execution	KV cache memory and batch capacity	Chunked prefill, scheduling, batching, memory-aware admission	Good single-request speed but poor concurrency.
Service operations	GPU fleet capacity	Arrival-rate estimation, stable service-rate estimation, autoscaling	Queue growth, SLA violations, wasteful over-provisioning.
Business adoption	ROI and reliability	Workflow design, workload segmentation, governance thresholds	AI pilots scale into unpredictable cost centers.

This is why the two papers belong together. Budgeted LoRA says: decide how much computation the model is allowed to keep. Queueing stability says: decide how much demand the serving system can safely absorb. One compresses the request cost. The other sizes the system that processes the requests.

For business readers, the key lesson is not “use Budgeted LoRA” or “use this exact queueing formula.” The lesson is that LLM deployment decisions should be made with explicit resource budgets and stability thresholds. Without them, AI adoption becomes a vibes-based capital expenditure program. A very modern form of financial comedy.

The flywheel view

The two papers also imply a practical flywheel:

Measure workload distribution — prompt lengths, output lengths, arrival rates, peak periods, workflow type.
Choose model operating points — dense budget, compression level, behavior-retention tests, acceptable quality loss.
Estimate stable service capacity — per-GPU service rate under actual workload distributions and memory constraints.
Provision and route intelligently — assign different workflows to different model and serving tiers.
Re-test after product changes — new prompts, longer context windows, agent loops, and retrieval payloads change the economics.

This is the operational bridge between model research and business ROI. It is not enough to ask whether a model is “smaller.” Smaller for what? Adaptation? Parameter count? Dense compute? KV cache pressure? Latency? Stability under 9:30 a.m. Monday demand?

The papers push us toward a more disciplined vocabulary.

Business Interpretation — What changes in practice

The papers directly show technical methods and theoretical tools. The business implications below are extrapolations, but reasonable ones.

1. Treat AI deployment as a portfolio of operating points

The Budgeted LoRA paper shows that one global budget can expose a continuum of quality-efficiency trade-offs. In business terms, that suggests one model strategy should not serve every workflow.

A customer-service triage bot, an internal summarization assistant, a compliance drafting tool, and an executive decision-support copilot do not need the same latency, cost, or capability-retention profile. Some workflows can tolerate more compression. Others need stronger behavior preservation. The useful question becomes:

Which workflow deserves which compute budget?

That question is more productive than arguing abstractly about whether to use “the best model.” Best is not a strategy. It is usually a procurement mood.

2. Separate adaptation cost from serving cost

Many AI adoption conversations still conflate fine-tuning cost with operating cost. The first paper is a useful correction. A method can reduce trainable parameters and still fail to lower dense inference cost.

For managers, this means vendor or internal engineering claims should be tested across at least four buckets:

Claim type	What to ask
Training efficiency	Does it reduce training time, GPU hours, or adaptation cost?
Inference efficiency	Does it reduce dense compute, end-to-end latency, or memory pressure at serving time?
Behavioral retention	Which task behaviors survive compression, beyond aggregate perplexity or benchmark score?
Deployment complexity	Does the compressed model compile into practical serving modules, or does it require fragile runtime machinery?

The difference between training savings and serving savings is not academic. Training may happen occasionally. Serving happens every time a user clicks the button.

3. Capacity planning should use workload distributions, not averages

The queueing paper emphasizes the joint distribution of prompt length and decode length. That is important because enterprise AI workloads are rarely uniform.

A legal document review assistant may have long prompts and long outputs. A support classifier may have short prompts and short outputs. An agentic research workflow may create bursts of tool calls, long retrieval contexts, and unpredictable output lengths. Averages flatten these differences and make provisioning look simpler than it is.

Practical implication: before scaling an AI workflow, collect at least:

Workload feature	Why it matters
Prompt length distribution	Affects prefill load and KV cache growth.
Output length distribution	Affects decode duration and memory retention.
Joint prompt-output relationship	Long prompts may correlate with long responses; assuming independence can mislead.
Arrival-rate pattern	Determines whether queues stay bounded under peak demand.
Workflow criticality	Determines acceptable latency, fallback behavior, and redundancy.

The paper directly validates its model on controlled workloads and LongBench v2; the business extrapolation is that companies should build their own workload traces. Imported benchmarks are helpful. Your users remain annoyingly specific.

4. Compression and capacity planning should be coupled

A compressed model changes the serving equation. A different serving policy changes the effective value of compression. These decisions should not live in separate teams that meet once per quarter and exchange diagrams with rectangles.

A practical AI operations review should ask:

Decision	Model team question	Platform team question	Business question
Compress model	What behavior is lost at each budget?	How does latency and memory pressure change?	Which workflows can accept the trade-off?
Increase context length	Does model behavior improve enough?	Does KV cache pressure destabilize service?	Is the added quality worth the capacity cost?
Add agent loops	Does reasoning improve?	Do request bursts overload queues?	Does automation ROI survive repeated calls?
Autoscale GPUs	Which model variants are deployed?	What stability threshold triggers scaling?	What SLA justifies the GPU reserve?

The combined lesson is that ROI is not merely model quality divided by API cost. It is quality, reliability, latency, and governance divided by total operating burden. Less catchy, sadly. More useful, definitely.

5. Build behavior-retention tests for each workflow

Budgeted LoRA’s in-context learning probe results matter because they warn against trusting one global quality metric. A business workflow may depend on following examples, preserving formatting rules, applying a classification rubric, or executing a structured transformation. These behaviors may not move in lockstep with perplexity.

For Cognaptus-style automation projects, this implies a simple adoption checklist:

Stage	Test
Before compression	Define the workflow behaviors that must survive: extraction accuracy, format discipline, tool-choice logic, escalation decisions.
During model selection	Compare model variants on those behaviors, not just generic benchmarks.
Before deployment	Run workload traces through serving tests to estimate stable capacity.
After deployment	Monitor behavior drift, queue growth, tail latency, and cost per completed business task.

The unit of evaluation should not be only “tokens generated.” It should be “business task completed under acceptable cost and reliability.”

Limits and Open Questions

Both papers are useful, but neither should be over-sold.

The Budgeted LoRA paper is evaluated in a controlled setting with relatively small models and specific architecture choices. Its inference speed measurements are reported over replaced or compressed modules rather than full end-to-end generation latency. The authors also note that transfer laws for hyperparameters across student sizes, teacher scales, budgets, and architectures remain open.

The queueing paper focuses on single-GPU replicas and data-parallel multi-GPU deployment. It discusses tensor parallelism as an extension by treating a tensor-parallel group as a logical worker, but pipeline parallelism and prefill-decode disaggregation change the queue topology and remain future work. That matters because many production systems are not clean single-replica textbook objects. They are more like a logistics network designed during a product launch.

The combined open questions are where the business value gets interesting:

Open question	Why it matters
How do compression choices change end-to-end serving stability?	A model-level speedup may or may not translate into system-level throughput under realistic batching and KV cache constraints.
How should model variants be routed by workflow type?	Not every request deserves the same model budget or GPU priority.
Can behavior-retention tests be integrated into autoscaling and model-selection policies?	Operational decisions should account for quality degradation, not just throughput.
How do agentic workflows change queue stability?	Agents create cascades of model calls, tool calls, retrieval contexts, and retries. Arrival rates become workflow-generated, not merely user-generated.
What governance layer decides acceptable degradation?	Compression and routing are business-risk decisions, not only engineering choices.

The papers give us tools, not a finished operating manual. That is fine. A useful tool is better than another ceremonial AI maturity model with five stages and zero instrumentation.

Conclusion

The two papers point toward a more realistic phase of AI deployment. The next frontier is not only larger models or smaller models. It is controlled models inside stable services.

Budgeted LoRA shows how model compression can be reframed as structured compute allocation, where dense computation is not sacred and low-rank adaptation is not automatically deployment-efficient. The queueing-theoretic paper shows how serving reliability depends on workload arrival, prompt/decode distributions, GPU memory, and KV cache dynamics.

Together, they sketch the economics of practical AI systems: choose the right compute budget, preserve the right behaviors, measure the real workload, and provision for stability rather than hope.

For business leaders, this is the useful shift. AI ROI will not come from adopting “AI” in general. It will come from turning model behavior, infrastructure capacity, and workflow value into measurable operating choices.

The free-token era was always imaginary. The invoice is just becoming better itemized.

Cognaptus: Automate the Present, Incubate the Future.

Mohammed Sabry and Anya Belz, “Budgeted LoRA: Distillation as Structured Compute Allocation for Efficient Inference,” arXiv:2605.04341, 2026. https://arxiv.org/abs/2605.04341 ↩︎
Chengyi Nie, Nian Si, and Zijie Zhou, “A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints,” arXiv:2605.04595, 2026. https://arxiv.org/abs/2605.04595 ↩︎

Opening — Why this matters now#

The Research Cluster — What these papers are collectively asking#

The Shared Problem — What the papers are reacting to#

What Each Paper Adds#

Paper 1: Budgeted LoRA makes model efficiency controllable#

Paper 2: Queueing theory makes serving reliability measurable#

The Bigger Pattern — What emerges when we read them together#

The flywheel view#

Business Interpretation — What changes in practice#

1. Treat AI deployment as a portfolio of operating points#

2. Separate adaptation cost from serving cost#

3. Capacity planning should use workload distributions, not averages#

4. Compression and capacity planning should be coupled#

5. Build behavior-retention tests for each workflow#

Limits and Open Questions#

Conclusion#