Bandits, Budgets, and the Art of Waiting: How Delay-Aware Algorithms Rewire Resource Allocation

Budgets arrive before outcomes. That is the small administrative tragedy behind many allocation systems.

A university decides which students receive financial aid before it knows who will persist. A workforce programme assigns training slots before employment outcomes appear. A healthcare provider prioritises interventions before the full treatment effect is visible. The decision is immediate; the evidence drips in later, usually after the next decision has already been made. Naturally, many algorithms pretend this is not happening. Very elegant. Also very wrong.

The paper behind this article, Bi-Level Contextual Bandits for Individualized Resource Allocation under Delayed Feedback, proposes MetaCUB: a bi-level contextual bandit framework designed for this awkward operating reality.¹ Its central move is not simply “use a better bandit”. The important mechanism is a split: allocate scarce resources across demographic subgroups first, then target individuals within those groups. Around that split, the authors add the machinery that real institutions keep ruining clean models with: delayed feedback, cohort turnover, budget constraints, and cooldown periods.

That matters because most allocation systems fail in one of two predictable ways. They optimise individual-level predictions so aggressively that whole subgroups get squeezed out, or they enforce group-level equity so bluntly that they ignore who within a group is most likely to benefit. MetaCUB tries to avoid both failures. It is fairness-aware without becoming purely quota-driven, and personalised without pretending group-level allocation has no policy consequences.

The result is a useful paper, though not a magical one. It shows strong simulation results on education and job-training datasets. It does not prove that an AI allocator is “fair” in the broad moral, legal, or institutional sense. That distinction is not pedantry. It is the difference between a deployable decision-support component and a slide deck that says “responsible AI” in 42-point font.

The real mechanism is two decisions, not one

A flat contextual bandit treats each feasible individual-resource pairing as an arm. If a student, trainee, patient, or customer has features $x_i$, and a resource $r$ is available, the algorithm estimates the reward from assigning $r$ to $i$, adds an uncertainty bonus, and chooses promising actions over time. This works tolerably well when the environment is simple: stable population, immediate reward, no group-level mandate, no awkward institutional constraint.

MetaCUB starts from the observation that high-stakes resource allocation is not one decision. It is at least two.

First, the institution decides how much of each resource should flow to each subgroup. This is the meta level. It operates over subgroup-resource budget shares: how much resource type A should go to group 1, how much resource type B should go to group 2, and so on. The policy lives on a probability simplex, meaning the shares must add up to a feasible allocation rather than becoming a fantasy spreadsheet where every department gets 140% of the budget.

Second, after those subgroup budgets are set, the base level chooses individuals within each subgroup. It uses contextual information and an uncertainty-aware scoring rule to identify likely high-benefit recipients. In simplified form, each individual-resource score has the familiar UCB flavour:

$$ G_{i,r} = \hat{y}\ast{i,r} + \beta u\ast{i,r} $$

Here, $\hat{y}\ast{i,r}$ is the predicted benefit of giving resource $r$ to individual $i$, $u\ast{i,r}$ is an uncertainty estimate, and $\beta$ controls exploration. The base level therefore still personalises. It just does so inside subgroup-level budget boundaries.

This is the paper’s main architectural contribution. The fairness-relevance does not come from sprinkling a constraint over a single optimiser. It comes from decomposing the allocation problem into two layers with different responsibilities.

Layer	Decision	Operational meaning	Failure it tries to avoid
Meta level	Allocate resource shares across subgroups	Keeps the global allocation aligned with equity and budget constraints	Majority or high-predicted-reward groups absorbing too much capacity
Base level	Select individuals within each subgroup	Uses individual context to target likely beneficiaries	Crude group quotas that ignore within-group heterogeneity
Delay/cooldown layer	Pace and time the feedback loop	Models when effects appear and when repeat allocation is allowed	Treating slow outcomes as failure, or over-serving the same person too soon

That table is the paper in miniature. The rest of the model exists because real allocation is temporal, not just statistical.

Delay is not noise; it is part of the product

The paper’s delay model is more interesting than the usual “reward arrives after $d$ rounds” treatment. MetaCUB uses resource-specific delay kernels. A delay kernel describes how the reward from an allocation is distributed across future rounds. In the paper, these kernels are built from discretised Beta distributions, which can represent early, late, peaked, or more diffuse feedback profiles.

The practical point is straightforward. A tutoring intervention, a scholarship package, a job-training course, and a counselling session do not mature on the same schedule. Some effects appear quickly. Some arrive slowly. Some are spread out enough that impatient measurement will call them ineffective before they have had time to work.

MetaCUB models observed reward at time $t$ as the accumulated delayed effect of previous allocations. In words: today’s observed outcome is not only today’s decision; it is the residue of older decisions whose effects are now becoming visible. In the paper’s formulation, the reward at time $t$ sums prior allocations weighted by the relevant resource delay kernel.

This is not merely mathematical neatness. In a business or public-sector setting, delay-aware accounting changes the interpretation of performance. A programme may look weak because it is bad. It may also look weak because its effect is slow and the dashboard is impatient. MetaCUB tries to separate those two possibilities.

There is a boundary, though. The paper fixes delay kernels from plausible timing profiles during the experiments. It notes that other normalised kernels could be used and that adaptive or meta-learned kernels are possible, but the empirical evidence here is not a field test of learned real-world delay distributions. So the business takeaway is not “we can now infer all intervention timing automatically”. It is narrower and more useful: if you already have or can estimate a credible impact timeline, the allocation policy should use it rather than pretending reward is immediate.

Cohorts and cooldowns make the model less toy-like

The paper also models population change. Individuals arrive and exit in fixed-length cohorts: eight semesters for the education setting and twelve months for the workforce setting. Only the active cohort can receive resources during its active window. Once the cohort leaves, the algorithm must act on the next pool.

This is a simple modelling choice with serious operational implications. Many institutions do not allocate into a timeless population. They allocate into intake cycles, semester groups, hiring cohorts, patient panels, eligibility windows, subscription cohorts, and annual budget periods. A model that assumes the same arms remain available forever is solving a cleaner problem than the institution has.

Cooldowns add another dose of realism. After receiving a resource, an individual may become ineligible to receive the same resource again for a number of rounds. In the experiments, cooldown periods are sampled uniformly from one, two, or three rounds. In practice, cooldowns can represent treatment spacing, intervention fatigue, compliance rules, capacity pacing, or the obvious fact that giving the same person the same support again tomorrow may not be clever just because the model has developed a crush on their feature vector.

The combination of delay kernels and cooldowns matters. Delay models when effects become visible. Cooldown models when repeat action becomes permissible. One is measurement timing; the other is operational pacing. Conflating them is a nice way to produce policy theatre.

What the experiments are actually testing

The paper evaluates MetaCUB on two real-world datasets used in simulation.

The first is the Educational Longitudinal Study dataset, treated as a GPA regression task. It has 1,179 instances, 40 features, and four resource types representing financial aid packaging. The second is the JOBS dataset, built from the National Supported Work sample and PSID controls, treated as a binary employment classification task. The appendix table lists 2,935 JOBS instances, eight features, and one resource type.

The authors compare MetaCUB against several baselines: UCB, LinUCB, CUCB, EXP3, mEXP3, DUCB, and SWUCB. The comparison is deliberately broad. Some baselines represent classical reward optimism. Some bring individual context. Some handle combinatorial choices. Some adapt to non-stationarity or adversarial/delayed feedback. What they generally lack is MetaCUB’s combination of subgroup-aware budget allocation, individual targeting, explicit delay kernels, cohort churn, and cooldowns.

The experiments vary four important dimensions:

Experimental component	Likely purpose	What it supports	What it does not prove
ELS and JOBS datasets	Main evidence across education and workforce settings	MetaCUB is not tuned to one domain shape	Generalisation to live institutional deployment
Immediate vs delayed feedback	Main evidence for delay sensitivity	Delay-aware modelling helps when rewards arrive late	That the chosen delay kernels match real organisations
Linear vs nonlinear outcome mappings	Representation comparison	Flexible predictors reduce regret relative to linear assumptions	That neural networks are always necessary or sufficient
Type-I vs Type-II delay kernels	Robustness/sensitivity test	MetaCUB remains resilient under different feedback timing profiles	Automatic discovery of true intervention timing
Fairness ratio tables	Empirical fairness evidence	Subgroup allocation coverage stays closer to parity	Individual fairness, causal fairness, or legal compliance
Lemma and appendix proof	Theoretical support under assumptions	Bi-level structure can reduce disparity relative to flat allocation	A universal guarantee under arbitrary budgets, predictors, or populations

This distinction matters because the paper’s results are easiest to overstate. The cumulative regret plots are the main performance evidence. The delay-kernel variants are robustness-style tests around feedback timing. The fairness tables support subgroup allocation balance. The appendix proof formalises the intended disparity reduction mechanism under stated assumptions, including coverage and predictor fidelity. These pieces reinforce one another, but they are not interchangeable.

A regret curve is not a compliance audit. A fairness ratio is not a causal validation study. A proof under assumptions is not a guarantee that your procurement committee will suddenly become reasonable.

The evidence: MetaCUB wins most clearly when feedback is late

Across the reported cumulative regret plots, MetaCUB has the lowest regret trajectory in all tested scenarios: ELS and JOBS, Type-I and Type-II delay kernels, immediate and delayed feedback, linear and nonlinear outcome mappings. The paper does not provide a compact table of final regret values, so the safest interpretation is qualitative from the plotted trajectories and the authors’ discussion rather than a manufactured percentage improvement. We do not need fake precision. The pattern is already clear.

The performance gap is most important in delayed-feedback environments. That is where conventional bandits struggle because they update policy from signals that arrive late, partially, or in temporally diluted form. DUCB and SWUCB, which are designed to adapt under non-stationarity by decaying or windowing historical reward information, improve robustness relative to simpler baselines. But they still lack the subgroup-level budget mechanism and explicit kernelised delay modelling that MetaCUB uses.

The delay-kernel comparison is particularly useful. Type-I kernels are more peaked, with reward signals concentrated earlier or around an early-to-mid time window. These are easier learning environments because feedback becomes informative sooner. Type-II kernels are flatter and more dispersed, which spreads reward information across time and increases uncertainty. In those broader-delay settings, delay-agnostic methods accumulate regret more quickly, while MetaCUB remains more resilient.

There is also an important representation result. The paper reports that algorithms under nonlinear reward functions incur substantially lower regret than their linear counterparts. This is not shocking; institutional outcomes are rarely polite enough to be linear. But it is operationally relevant. If the reward model is too rigid, the bandit layer is optimising over a distorted map. The allocation policy can only be as good as the outcome surface it is allowed to see.

For executives, this is the uncomfortable translation: the bandit is not the whole product. The reward model, delay model, constraints, and cohort logic are part of the product. Buying “an optimiser” while ignoring those pieces is like buying a steering wheel and calling it a car.

The fairness result is strong, but narrower than the word suggests

The paper’s fairness analysis focuses on subgroup allocation balance. The metric is the ratio of subgroup allocation rates to the overall mean, with values closer to 1.0 indicating more equitable coverage. This is a sensible institutional metric when the policy mandate is group-level coverage, but it is not the whole philosophy department packed into a decimal.

The results are nevertheless notable. In the ELS table, MetaCUB’s delayed-feedback ratios are close to parity across reported groups: Asian 1.02, White 0.96, Black 1.00, Hispanic 0.97, and multiracial 0.97. Under immediate feedback, most MetaCUB values are also close to parity, though Asian coverage is lower at 0.84. That detail matters because “consistently near parity” should not be lazily converted into “perfectly fair everywhere”. Precision is free; we should use it.

For JOBS, MetaCUB reports Black ratios of 1.03 under immediate feedback and 0.98 under delayed feedback, while Hispanic ratios are 0.99 and 0.98. Baselines vary more widely. For example, UCB reports JOBS Black coverage of 0.76 immediate and 0.54 delayed, while SWUCB reports 1.30 delayed for Black and 0.75 delayed for Hispanic. In ELS, UCB over-allocates to White students in the table, with ratios of 1.29 immediate and 1.42 delayed, while several minority subgroup ratios sit far below 1.0.

That is the paper’s fairness story: the bi-level structure helps keep subgroup coverage closer to parity while preserving individual targeting inside each subgroup. It does not show that every similarly situated individual receives similar treatment. It does not establish that the protected attributes are being handled in a legally sufficient way. It does not prove that historical data used to train the reward model are free of structural bias. It models a specific fairness mandate and shows the architecture aligns well with that mandate under simulation.

This is still valuable. In many real allocation settings, subgroup monitoring is exactly what compliance and policy teams track. The problem is not that subgroup parity is irrelevant. The problem is that people keep mistaking one fairness metric for moral completion. MetaCUB avoids one common allocation failure; it does not retire ethics.

The business value is better timing, not just better targeting

For business and public-sector operators, the practical relevance lies in settings with five properties:

Resources are scarce.
Recipients differ in likely responsiveness.
Outcomes are delayed.
Eligibility changes over time.
Allocation must be monitored across subgroups.

That covers more terrain than education and job training. It includes employee reskilling budgets, customer retention offers, health interventions, grant programmes, financial hardship support, sales enablement resources, public-benefit services, and some forms of credit assistance. The exact legal and ethical constraints differ, but the structure is familiar: decide now, observe later, explain always.

MetaCUB’s design suggests three business lessons.

First, delayed feedback should be modelled before the KPI meeting, not apologised for during it. If a company knows that an intervention has a 90-day maturation curve, then evaluating it after 14 days and reallocating budget away from it is not data-driven discipline. It is impatience with charts. Delay kernels offer a formal way to encode expected timing and prevent slow-burn interventions from being punished prematurely.

Second, fairness-aware allocation works better when global and local decisions are separated. A central policy layer can enforce subgroup coverage or budget proportions, while a local decision layer can still select individuals based on expected benefit. This is often more operationally realistic than asking one flat model to optimise reward, fairness, capacity, pacing, and uncertainty all at once. One model can do many things. It can also do many things badly.

Third, institutional constraints are not noise. Cohorts, cooldowns, budgets, and eligibility windows are not annoying exceptions to the “real” optimisation problem. They are the real optimisation problem. A system that ignores them will look impressive in a notebook and then quietly fail in deployment, where the finance department, legal team, and programme office continue existing with their usual inconsiderate enthusiasm.

Where the result stops

The paper’s evidence is simulation-based. That does not make it weak, but it defines the boundary. The experiments use real datasets as substrate, then simulate allocation processes, feedback timing, cohorts, and cooldowns. A live deployment would still need causal validation: does the intervention actually cause the predicted improvement for the individuals selected, or is the model learning historical correlations dressed up as treatment response?

The reward model is another dependency. MetaCUB relies on a learned function mapping individual context to predicted outcomes. If that predictor is poorly calibrated, biased, unstable across cohorts, or blind to unmeasured constraints, the bandit can optimise the wrong thing very efficiently. This is the old algorithmic governance joke: optimisation is only impressive after someone has checked the objective is not ridiculous.

The fairness guarantee also depends on assumptions. The lemma’s disparity reduction argument relies on coverage, base-level convergence, and sufficient predictor fidelity. In constrained environments, those assumptions may be strained. If budgets are tiny, eligibility rules are tight, or subgroup definitions interact badly with resource availability, the feasible set may not support both strong subgroup coverage and strong individual targeting. The paper acknowledges this in its appendix discussion: individual-level fairness can be incorporated at the base level when properly defined, but feasibility may require softened penalties.

Finally, delay kernels are fixed in the reported experiments. Real organisations may not know the correct timing profile of each resource. Estimating it is itself a serious measurement problem. Getting it wrong can shift reward attribution across time and distort the learning loop. MetaCUB gives a useful architecture for delay-aware allocation, but the timing model still has to earn its keep.

The useful lesson is architectural realism

MetaCUB is not interesting because it adds one more acronym to the bandit zoo. It is interesting because it rearranges the allocation problem into a shape that resembles how institutions actually operate.

Global budgets come first. Local targeting comes next. Effects arrive late. People enter and leave in cohorts. Repeat service has limits. Equity is monitored at subgroup level. The model is built around those facts instead of treating them as unfortunate contamination from the outside world.

That is the quiet contribution of the paper. It moves adaptive allocation away from the fantasy of instantaneous, frictionless optimisation and toward a more administrative kind of intelligence: allocate carefully, wait intelligently, update without forgetting who has been left out.

For organisations, the immediate lesson is not to deploy MetaCUB tomorrow morning and call the governance work finished. Please do not give procurement that kind of encouragement. The better lesson is to audit allocation systems for timing blindness. Where outcomes are delayed, populations churn, and fairness is measured across groups, a flat optimiser is probably solving the wrong problem.

Waiting, it turns out, is not the absence of intelligence. Sometimes it is the thing the algorithm has to learn.

Cognaptus: Automate the Present, Incubate the Future.

Mohammadsina Almasi and Hadis Anahideh, “Bi-Level Contextual Bandits for Individualized Resource Allocation under Delayed Feedback,” arXiv:2511.10572, 2025, https://arxiv.org/abs/2511.10572. ↩︎

The real mechanism is two decisions, not one#

Delay is not noise; it is part of the product#

Cohorts and cooldowns make the model less toy-like#

What the experiments are actually testing#

The evidence: MetaCUB wins most clearly when feedback is late#

The fairness result is strong, but narrower than the word suggests#

The business value is better timing, not just better targeting#

Where the result stops#

The useful lesson is architectural realism#