When Bandits Get Priority: Learning Under Scarce, Tiered Capacity

Capacity looks simple until someone pays to jump the queue.

That is the quiet problem behind a large amount of modern AI infrastructure. A platform may have many model instances, edge servers, or compute nodes. Tasks arrive with different business value. Enterprise traffic is more important than free-tier traffic. Some jobs have tighter latency targets. Some users, by contract or politics, are simply not equal. Lovely democratic fiction ends at the load balancer.

The obvious engineering answer is priority routing: assign high-priority tasks first, let them consume capacity, then give whatever remains to lower-priority tasks. This sounds straightforward. It is not. Once capacity is stochastic, rewards are uncertain, and assignment has cost, the problem is no longer “send every task to the best server.” The best server may be crowded. The high-reward arm may be wasteful for a low-priority task. A lower-reward arm with more reliable capacity may dominate after costs. Queue-jumping is easy to describe and annoyingly hard to learn.

That is the core of Multiple-play Stochastic Bandits with Prioritized Arm Capacity Sharing, which introduces MSB-PRS: Multiple-play Stochastic Bandits with Prioritized Resource Sharing.¹ The paper is written as bandit theory, but the useful business reading is infrastructure control. It asks: how should a system learn to allocate many prioritized tasks across many uncertain-capacity resources, when priority itself changes who actually receives service?

The answer is not “use UCB and move on.” Naturally. That would be too kind.

The paper’s most important move is not the regret theorem. It is the modeling decision.

Classical multi-play multi-armed bandits assume a learner can pull multiple arms per round and receive uncertain rewards. That abstraction is useful when each pull is more or less independent. But in shared compute systems, pulls interfere with one another. Several tasks may be assigned to the same model instance or edge server. That arm may have only a random number of usable capacity units in the current round. When demand exceeds available capacity, not everyone gets served.

MSB-PRS models this directly.

There are $M$ arms and $K$ plays. In an AI infrastructure analogy, arms are model instances, edge servers, or compute resources; plays are reasoning tasks, workloads, or service requests. Each arm has two unknown properties: the reward distribution from one unit of capacity, and the stochastic number of capacity units available in each round. Each play has two known properties: a priority weight $\alpha_k$ and a movement cost $c_{k,m}$ for assigning play $k$ to arm $m$.

The priority rule is simple: on each arm, assigned plays are sorted by priority. If the arm has enough capacity, all assigned plays are served. If not, only the top-ranked plays get capacity. A served play receives reward scaled by its priority weight. An unserved play receives nothing. The system’s utility is expected reward minus movement cost.

This creates the central twist:

$$ \text{utility} = \text{priority-weighted expected served reward} - \text{assignment cost} $$

The phrase “expected served reward” is doing heavy lifting. A play’s value on an arm depends not only on that arm’s reward mean, but also on the play’s priority rank among other plays assigned to the same arm and the probability that the arm has enough capacity to reach that rank.

So the allocation value of a play changes when another play joins the same arm. That is where the familiar “choose the top arms” instinct quietly dies.

Why top arms are not automatically optimal

A naive scheduler may reason like this: estimate each arm’s reward, send more tasks to the best arms, and reserve premium slots for premium users. That sounds operationally reasonable. It is also exactly the kind of reasoning this paper warns against.

The paper explicitly notes that arms with larger per-unit rewards are not necessarily favorable. The reason is not philosophical; it is mechanical.

A high-reward arm may also have scarce capacity. A lower-priority play assigned to that arm may often be pushed below the capacity cutoff by higher-priority plays. Its theoretical reward mean becomes irrelevant if it rarely receives a capacity unit. Meanwhile, assigning it to a less glamorous arm may yield higher expected utility because it is more likely to be served and may incur lower movement cost.

This is the mechanism-first intuition:

Naive belief	MSB-PRS correction	Operational consequence
The best arm is the arm with the highest reward mean.	A play’s effective value depends on arm reward, capacity probability, priority rank, and cost.	Routing must evaluate assignments jointly, not arm quality separately.
Priority simply decides who gets served first.	Priority changes the expected utility of every play sharing the same arm.	Service tiers alter the optimization landscape, not just queue order.
Adding more capacity choices always helps.	More arms expand the action space and slow learning.	Larger infrastructure can be harder to learn unless routing is structured.
UCB can be applied directly over allocations.	Exact UCB over all $M^K$ action profiles is computationally explosive.	Learning requires an optimization oracle, not just confidence intervals.

This matters for AI systems because many production controls are currently implemented as policy patches: premium queues, reserved capacity, fallback models, task routing rules, and overload behavior. MSB-PRS says those patches are coupled. Change one tier’s priority and the optimal assignment for other tiers may change, even if the servers themselves have not changed.

The model therefore treats priority not as a decoration on top of resource allocation, but as part of the resource allocation problem itself. A small mercy: at least the math is honest.

The offline problem becomes a matching problem with rank-aware nodes

Once the model is set, the first challenge is not learning. It is computing the optimal allocation when all parameters are already known.

That may sound like a warm-up. It is not. There are $M^K$ possible action profiles because each of the $K$ plays can be assigned to one of $M$ arms. Exhaustive search becomes useless quickly. More importantly, the utility is nonlinear because rank on an arm affects whether a play is likely to receive capacity.

The paper’s key computational trick is to convert the allocation problem into a weighted bipartite matching problem.

On the left side, the graph has play nodes. On the right side, it does not merely have arm nodes. It has arm–rank nodes: for each arm $m$, the graph includes positions $(m,1), (m,2), \ldots, (m,K)$. This is the elegant part. The right side represents not just “which arm,” but “which rank on which arm.”

The edge from play $k$ to arm-rank position $(m,j)$ receives a marginal utility weight:

$$ \Lambda_m(k,j) = \alpha_k \mu_m P_{m,j} - c_{k,m} $$

Here, $\mu_m$ is the mean reward of arm $m$, $P_{m,j}$ is the probability that arm $m$ has at least $j$ units of capacity, $\alpha_k$ is the priority weight, and $c_{k,m}$ is the movement cost. In plain English: the edge asks what play $k$ is worth if it occupies rank $j$ on arm $m$.

But not every matching is valid. A feasible matching must satisfy three structural conditions:

Matching condition	Meaning in the allocation problem
$U$-saturated	Every play is assigned somewhere.
$V$-monotone	If an arm uses rank $j$, it must also use earlier ranks; no magical empty first slot before an occupied third slot.
Priority compatible	Higher-priority plays must not sit behind lower-priority plays on the same arm.

The paper proves that valid action profiles can be mapped to matchings with the same utility, and valid matchings can be mapped back to action profiles. That equivalence is the bridge. It turns the messy priority-sharing allocation into something computable.

The resulting offline algorithm, MSB-PRS-OffOpt, constructs the weighted bipartite graph, runs maximum weighted matching, adjusts the matching if needed to satisfy the structural conditions, and returns the corresponding allocation. Its computational complexity is $O(M^3K^3)$.

That complexity is not tiny. But it is polynomial. Compared with $M^K$, polynomial looks almost civilized.

The learning problem is harder because uncertainty has two faces

The online version adds uncertainty. The learner does not know each arm’s reward mean $\mu_m$ or capacity distribution. It must learn both from feedback.

The feedback structure is also uneven. When a play is served, the system observes reward. Capacity for an arm is revealed only if at least one play is assigned to that arm. This means the learner must discover not only which arms are rewarding, but also which arms are reliably available at different capacity ranks.

That creates two estimation tasks:

estimate arm reward means;
estimate the complementary capacity probabilities $P_{m,d}$, meaning the probability that arm $m$ has at least $d$ units of capacity.

The exact optimistic approach would define a UCB index for every possible action profile by maximizing utility over all plausible reward and capacity parameters. Unfortunately, different action profiles may be optimistic under different parameter choices. Finding the best exact-UCB action may require searching over all $M^K$ allocations.

The paper avoids that by using an approximate UCB index. Instead of optimizing parameters separately for each action profile, it constructs one shared optimistic parameter set:

$$ \widehat{\mu}^{(t)} + \epsilon^{(t)}, \quad \widehat{P}^{(t)} + \lambda^{(t)} $$

Then it asks the offline oracle to solve the allocation problem under those optimistic estimates.

This is not just a computational shortcut. It works because the utility function is monotone in the reward means and capacity probabilities. If higher estimated reward and capacity probability cannot make an allocation look worse, then a shared optimistic parameter set remains a valid upper-confidence style guide for exploration.

The online algorithm, MSB-PRS-ApUCB, is therefore a loop:

Step	What happens	Why it matters
Estimate	Update reward and capacity estimates from observed feedback.	Learns both service quality and capacity availability.
Optimism	Add confidence bonuses to reward and capacity estimates.	Forces exploration of uncertain arms.
Optimize	Call MSB-PRS-OffOpt on optimistic parameters.	Converts learning into structured allocation.
Observe	Assign plays, observe rewards and revealed capacities.	Improves future estimates.

The important business reading is that learning is not wrapped around a greedy scheduler. Learning is wrapped around a correct allocation oracle. That distinction matters. If the allocation layer is structurally wrong, better estimates only make the wrong decision more confidently. A familiar enterprise software pattern, unfortunately.

The regret results say the difficulty is real, not cosmetic

The paper proves both lower and upper regret bounds. For business readers, regret is the accumulated performance loss from not knowing the optimal allocation at the beginning. In infrastructure terms: the cost of learning while serving live workloads.

The instance-independent lower bound is:

$$ \Omega(\alpha_1\sigma\sqrt{KMT}) $$

The instance-dependent lower bound for utility gap $\Delta$ is:

$$ \Omega\left(\frac{\alpha_1\sigma^2 M}{\Delta}\ln T\right) $$

The symbols matter less than the shape. Regret grows with the number of plays, the number of arms, the time horizon, the reward uncertainty, and the top priority weight $\alpha_1$. The highest priority tier amplifies mistakes. A bad routing decision for a premium task is more expensive than a bad routing decision for a low-priority task. Shocking, yes. But now it is inside the theorem, where product managers can no longer pretend it is only a customer-success issue.

The upper bounds for MSB-PRS-ApUCB are sublinear and nearly match the lower bounds up to stated factors. The instance-independent upper bound matches up to a $\sqrt{K\ln(KT)}$ factor. The instance-dependent upper bound matches up to an $\alpha_1K^2$ factor. The authors also state that closing these gaps remains open because MSB-PRS is neither a standard multi-play bandit nor a standard combinatorial bandit.

A compact reading:

Result type	What the paper shows	Practical interpretation
Lower bounds	Any learner must pay regret that scales with priority, uncertainty, plays, arms, and time.	Scarce tiered capacity is inherently costly to learn; this is not a bad implementation artifact.
Offline oracle	Known-parameter allocation can be solved via rank-aware matching in $O(M^3K^3)$.	Correct routing requires joint assignment over tasks and rank positions.
Approximate UCB	Online learning can reuse the oracle with optimistic estimates and retain sublinear regret.	A learnable control loop is possible without exhaustive search.
Remaining gaps	The bounds do not fully close.	Theory is strong enough to guide architecture, not complete enough to declare final optimality.

This is the paper’s strongest contribution: it separates unavoidable learning difficulty from avoidable computational stupidity. The unavoidable part is regret from uncertainty. The avoidable part is searching over $M^K$ allocations when a matching structure can be exploited.

The experiments are sanity checks, not production validation

The synthetic experiments are best read as implementation evidence and sensitivity testing, not as a second thesis.

The authors set a default synthetic environment with $M=5$ arms, $K=10$ plays, $T=10^4$, movement-cost scale $\eta=1$, reward standard deviation $\sigma=0.2$, U-shaped reward means, and two priority types with weights $\alpha=[3,1]$. They then vary four conditions: number of arms, number of plays, movement cost, and reward noise.

The reported behavior is consistent across figures:

Experiment	Likely purpose	What it supports	What it does not prove
Varying number of arms $M$	Sensitivity test	MSB-PRS-ApUCB remains sublinear; convergence slows as arms increase.	It does not establish performance under real LLM traffic or nonstationary workloads.
Varying number of plays $K$	Sensitivity test	More plays make learning harder, but the proposed method stays below baselines.	It does not resolve large-scale deployment cost when $K$ is very large.
Varying movement cost $\eta$	Robustness-style sensitivity test	The algorithm remains better than baselines under different cost scales.	It does not tell operators how to calibrate real switching, latency, or migration costs.
Varying reward standard deviation $\sigma$	Noise sensitivity test	Higher uncertainty does not break the sublinear pattern in the tested setting.	It does not cover heavy-tailed failures, bursty traffic, or adversarial demand shifts.

The baseline algorithms, OnlinActPrf and OnlinActPrf-v, show roughly linear regret in the plots, while MSB-PRS-ApUCB remains much lower and plateaus in the synthetic settings. That is useful evidence that the algorithm behaves like the theory predicts.

But the experiments are not a production benchmark. They do not simulate real request queues, latency deadlines, batching behavior, cold starts, model heterogeneity, GPU fragmentation, autoscaling delays, or service-level penalties. The paper is not claiming that it has solved LLM serving. It is claiming a theoretical and algorithmic foundation for a specific class of prioritized stochastic allocation problems.

That distinction is not academic nitpicking. It is the difference between “this can guide a routing layer” and “please deploy this in front of your largest enterprise customers by Friday.” The second sentence is how outages are born.

What Cognaptus infers for AI infrastructure

The paper directly shows a model, lower bounds, an offline oracle, an approximate UCB algorithm, regret guarantees, and synthetic validation. The business inference is broader but should be kept disciplined.

For AI platforms, the main lesson is that service tiers should be modeled inside the allocation algorithm, not bolted on afterward. If enterprise users, free users, batch jobs, internal agents, and latency-sensitive tasks share the same compute fabric, then priority affects the expected value of every assignment. A scheduler that first chooses “best server” and then applies priority rules may be optimizing the wrong object.

A practical architecture inspired by the paper would have three layers:

Layer	Role	MSB-PRS-inspired design question
Measurement layer	Estimate reward and capacity behavior by resource.	What is the probability that this instance can serve the first, second, or third task assigned to it under current conditions?
Allocation layer	Solve priority-aware assignment.	Which tasks should share which resource, given rank, priority, and movement cost?
Learning layer	Explore uncertain resources carefully.	Which assignments are worth trying because uncertainty is still materially affecting routing quality?

For LLM serving, “reward” could be mapped to business value, latency-adjusted success, throughput contribution, quality-adjusted completion, or SLA-weighted utility. Capacity could mean available inference slots, effective batching capacity, context-window pressure, GPU memory availability, or model-instance readiness. Movement cost could represent latency penalty, routing overhead, model-switching cost, data locality, privacy constraint, or cache miss.

Those mappings are Cognaptus interpretations, not claims directly tested by the paper. The paper gives the mathematical skeleton. Production systems must decide what utility means.

Where the model applies, and where it does not yet apply

MSB-PRS is most relevant when four conditions hold.

First, multiple tasks must be allocated simultaneously or repeatedly across shared resources. Second, those resources must have uncertain capacity. Third, tasks must have known priority weights or service-tier values. Fourth, assignment must involve cost or feasibility constraints.

That covers many plausible AI and edge scenarios. It does not cover everything.

The paper assumes stationary reward and capacity distributions. Real infrastructure can drift by hour, region, model version, traffic mix, failure incident, and pricing experiment. The paper assumes known priority weights and movement costs. In real products, those are often policy choices disguised as numbers. The paper models capacity as stochastic but not as a full queueing system with latency distributions, deadlines, backpressure, retries, and cancellations. It also uses synthetic experiments rather than production traces.

The computational cost is also worth respecting. $O(M^3K^3)$ is far better than $M^K$, but it is not free. If a platform treats every request as a separate play and every model replica as a separate arm at high frequency, implementation would need batching, decomposition, approximation, or hierarchical control. Theory gives the control logic; engineering still has to pay the invoice.

The authors themselves flag an open theoretical issue: the regret upper and lower bounds do not fully match. Closing those gaps remains unresolved. That does not make the result weak. It makes it honest.

The useful lesson is not “use this algorithm”; it is “model the queue correctly”

The best business articles about bandit papers usually avoid telling executives to implement the algorithm tomorrow. This one is no exception.

The durable lesson is that priority changes the shape of learning. Once high-priority and low-priority tasks share stochastic capacity, routing is no longer a collection of independent arm choices. It becomes a rank-aware allocation problem under uncertainty. The matching formulation explains the structure. The approximate UCB algorithm shows how learning can exploit that structure. The regret bounds show that the difficulty is fundamental, not merely a poor baseline having a bad afternoon.

For AI infrastructure teams, that suggests a useful diagnostic question:

Does our routing system optimize assignment under priority-aware capacity sharing, or does it optimize server choice first and clean up priority conflicts afterward?

If the answer is the second one, the system may still work. Many systems work by accumulating exceptions until the exceptions become the architecture. But MSB-PRS explains why that design can become expensive as service tiers, workloads, and compute options multiply.

The paper does not hand us a production-ready LLM serving stack. It gives something more foundational: a way to think about scarce, tiered capacity without pretending that all tasks stand politely in the same line.

That is a good start. Especially because the line was never polite.

Cognaptus: Automate the Present, Incubate the Future.

Hong Xie, Haoran Gu, Yanying Huang, Tao Tan, and Defu Lian, “Multiple-play Stochastic Bandits with Prioritized Arm Capacity Sharing,” arXiv:2512.21626, 2025, https://arxiv.org/abs/2512.21626. ↩︎

The real mechanism is not bandits first; it is priority sharing first#

Why top arms are not automatically optimal#

The offline problem becomes a matching problem with rank-aware nodes#

The learning problem is harder because uncertainty has two faces#

The regret results say the difficulty is real, not cosmetic#

The experiments are sanity checks, not production validation#

What Cognaptus infers for AI infrastructure#

Where the model applies, and where it does not yet apply#

The useful lesson is not “use this algorithm”; it is “model the queue correctly”#

The real mechanism is not bandits first; it is priority sharing first