TL;DR for operators

Many tabular foundation models behave like very competent consultants with a mildly expensive habit: they want the entire labelled training set placed in front of them at inference time. That works neatly on small datasets. It becomes rather less charming when the table grows to tens or hundreds of thousands of rows and the model’s attention cost starts behaving like it has discovered compound interest.

The paper introduces CRUMB, short for Clustered Retrieval Using Minimised-MMD Batching, a training-free wrapper for prior-fitted networks (PFNs). It does not retrain the model, rewrite the architecture, or fine-tune adapters. Instead, it changes what the model sees at inference time. CRUMB clusters the test queries, selects a small training subset for each test cluster by greedily minimising maximum mean discrepancy, and then runs one PFN forward pass per cluster.1

The operational point is simple: instead of feeding every query the whole table, feed each batch a smaller context that is matched to where that batch lives in feature space. The paper’s main result is that this is not just random subsampling with a lab coat. On 51 TabArena datasets using TabICL-v2, CRUMB significantly outperforms uniform subsampling and MICP at the same context budget, while remaining statistically close to per-query kNN. It does this with 20 PFN forward passes instead of 200 in the main comparison, producing a reported average PFN inference time of 35 seconds versus 361 seconds for per-query kNN.

The evidence also matters under drift. In a controlled covariate-drift experiment, CRUMB’s accuracy advantage over MICP grows from +4.9 percentage points with no drift to +17.1 percentage points under the strongest simulated drift. The interpretation is not that CRUMB magically solves distribution shift. Please, let us remain adults. The point is narrower and more useful: because CRUMB clusters the current test batch and selects contexts against that batch, it can adapt its retrieval target when the query population moves.

For business use, CRUMB reframes PFN deployment as a context-routing problem. The question is no longer only “Which model should we use?” It becomes “Which evidence should each inference batch receive, and how often should that routing be refreshed?” The boundary is equally clear: CRUMB is best suited to batch scoring or micro-batch settings where test queries are available upfront. In pure streaming mode it starts to resemble cached routing, which reduces its distinctive distribution-matching advantage unless refreshed periodically.

The expensive part is not training; it is deciding what the model gets to see

The usual enterprise instinct is to treat model cost as a training problem. Training is visible. Training has procurement line items, GPU dashboards, irritated finance teams, and a clear villain. PFNs complicate that instinct because their promise is partly that they avoid dataset-specific training. A prior-fitted network is pretrained over synthetic tasks and then performs in-context learning on a real tabular dataset: give it labelled training examples and test queries, and it predicts in a forward pass.

That is elegant until the training context becomes large. In the PFN setting, the labelled training table is not just historical data sitting somewhere in storage. It is part of the model input. When sample-level attention scales quadratically with the number of training samples, “just include the whole table” stops being a design principle and starts being a denial strategy.

The obvious workaround is context selection. Select fewer rows. Run faster. Pretend nothing important was discarded. This is the ancient corporate art of hoping the spreadsheet you did not inspect contained nothing embarrassing.

Uniform subsampling is the crude version: choose a smaller random subset of training rows and hope it represents the task well enough. Per-query nearest-neighbour retrieval is the more targeted version: for every test point, retrieve nearby training examples. That often gives better context, but it creates a new operational nuisance. If every query receives its own context, every query requires its own PFN call. The model becomes more accurate by becoming less batchable. This is not a victory; it is a cost transfer with better manners.

CRUMB is designed around that tension. It tries to keep the quality of local retrieval while preserving the efficiency of batching. Its answer is not “retrieve for each point.” It is “retrieve for each test cluster.”

That distinction is the paper’s real mechanism.

CRUMB starts by clustering the question, not the archive

The first design choice is easy to miss because it sounds mundane: CRUMB clusters the test queries, not the training data.

This matters. MICP, the closest baseline discussed in the paper, clusters the training data and routes test points to training-cluster centroids. That gives a reusable routing structure. It also anchors the system to the historical training distribution. If the incoming query population moves, the routing map may still be beautifully organised around yesterday’s geography. Very tidy. Very wrong.

CRUMB reverses the emphasis. It first partitions the current test set into $K$ clusters using k-means on standardised input features. The default setting in the paper is $K=20$. At one extreme, $K=1$ collapses the method into a single shared context for all test points. At the other extreme, $K=T$ becomes per-query retrieval, which recovers locality but destroys batching. CRUMB lives in the middle: enough clusters to create localised query groups, few enough clusters to preserve batched inference.

This is not just an implementation trick. It changes the target of context selection. The target is no longer “represent the entire training set.” The target is “represent the region of the current test batch that needs prediction.”

That is a more operationally relevant question. In customer scoring, fraud monitoring, risk triage, claim routing, or demand forecasting, the next batch of cases is rarely a perfect miniature of the historical archive. The model does not need a ceremonial sample of all history. It needs the right evidence for the cases it is about to judge.

MMD herding is there to avoid the nearest-neighbour trap

After CRUMB forms test clusters, it needs to select a training subset for each cluster. A tempting answer would be: take the nearest training rows to the cluster centroid. That sounds sensible. It is also the kind of sensible that fails quietly.

A centroid is a summary, not a distribution. If a cluster has tails, subregions, or uneven density, nearest-to-centroid retrieval can over-concentrate the context around the middle and under-represent the rest. The model sees a clean centre and misses the awkward edges, which is generally where real tabular problems keep their knives.

CRUMB instead uses maximum mean discrepancy (MMD), a kernel-based distance between distributions. In this paper, MMD is used to measure how well a selected training subset matches the empirical distribution of a test cluster. The method greedily selects training points through kernel herding. The objective has two useful pressures:

  • it rewards training points that are close to the test cluster;
  • it penalises redundancy among already selected points.

So the selected context is not merely local. It is locally distributed. That is the key correction to the likely misconception about the paper. CRUMB is not “k-means plus nearest neighbours.” Nor is it “MMD alone.” The ablations show that the useful behaviour comes from the interaction: clustering creates local targets, and MMD herding gives those targets well-covered contexts.

The paper uses a Gaussian RBF kernel with bandwidth set by the median heuristic. It also accelerates the greedy MMD procedure with Random Fourier Features and batched greedy selection. For large datasets, it adds early stopping: monitor relative MMD improvement every 50 selected points and stop if improvement falls below $10^{-4}$ for five consecutive checks. This lets easier clusters use smaller contexts instead of blindly consuming the full budget. Even context has a point of diminishing returns. Stunningly, the computer science community has rediscovered budgeting.

The three-stage mechanism is the article, not the decoration

Mechanism-first is the right way to read this paper because the experiments are mostly asking whether each part of the mechanism earns its keep.

The full procedure can be summarised without mystery:

Stage What CRUMB does Operational consequence
1. Test-side clustering Partition current test queries into $K$ clusters using k-means on standardised features. Queries that look similar share an inference batch and can reuse one context.
2. MMD-based context retrieval For each test cluster, greedily select a training subset whose distribution matches the cluster. The context is local and distributionally covered, not just random or centroid-nearest.
3. Batched PFN inference Run the PFN once per cluster using that cluster’s selected context. Number of PFN forward passes becomes $K$, not one per test point.

The paper’s central trade-off is therefore not accuracy versus speed in the lazy sense. It is context specificity versus batchability. Per-query kNN is highly specific but poorly batchable. Uniform subsampling is batchable but blunt. MICP is batchable and local, but its routing is tied to the training distribution. CRUMB tries to be batchable, local, and test-adaptive.

That makes the method unusually easy to translate into deployment language: CRUMB is an inference-time context router for PFNs.

It also explains why the method is architecture-agnostic. CRUMB does not need access to the internals of TabPFNv2, TabICLv1, or TabICLv2. It only controls the context supplied to the model. This is strategically useful because it means the wrapper can be evaluated around existing PFN checkpoints, rather than waiting for the next architecture paper to descend from the mountain.

The main benchmark says CRUMB buys batching without falling far behind kNN

The main experiment evaluates CRUMB on all 51 TabArena datasets using TabICL-v2. The baselines are full-context inference, per-query kNN retrieval, MICP, and uniform subsampling. Every method except full context uses a fixed context budget of $n = 0.1N$. Methods are ranked per dataset and seed, with classification ranked by accuracy and regression by RMSE.

The result is best read as a ranked trade-off table, not as a single leaderboard.

Method Average rank Statistical comparison vs. CRUMB Computational reading
Full context $2.202 \pm 0.098$ Significant Best upper-bound style baseline, but uses all $N$ and scales poorly.
Per-query kNN $2.782 \pm 0.107$ Not significant, $p_{\text{adj}}=0.106$ Strong local retrieval, but requires $T$ PFN forward passes.
CRUMB $2.975 \pm 0.096$ Slightly behind kNN in rank, but needs only $K$ forward passes.
MICP $3.304 \pm 0.105$ Significant Batchable, but less effective than CRUMB at same budget.
Uniform $3.737 \pm 0.105$ Significant Cheap and blunt; the spreadsheet equivalent of shrugging.

The important comparison is CRUMB versus per-query kNN. CRUMB is statistically close to kNN, but in the main setup it uses $K=20$ forward passes instead of $T=200$. The paper reports this as a 10× reduction in forward passes, reflected in average PFN inference time of 35 seconds for CRUMB versus 361 seconds for per-query kNN.

That does not mean CRUMB is always 10× faster in every deployment. The paper is careful that wall-clock timing on smaller TabArena datasets can be distorted by fixed overheads, which is why timing is reported selectively. But the structural advantage is real: fewer PFN calls, shared contexts within clusters, better GPU utilisation.

The business interpretation is direct. If a PFN is blocked by inference cost, the decision is not necessarily “abandon PFNs” or “rewrite the architecture.” A middle layer that routes context intelligently may get you into the viable zone before anyone starts approving a research program with seventeen milestones and a steering committee. Condolences in advance.

The consistency test asks whether the win survives model and budget changes

The second experiment broadens the evidence. It restricts to the 38 classification datasets because TabICLv1 does not support regression, then varies three axes: PFN backbone, sampling proportion, and maximum training size. The models are TabICL-v1, TabICL-v2, and TabPFN-v2. The context proportions are $p \in {0.05, 0.1, 0.2, 0.5, 0.8}$, with maximum training sizes of 500, 1,000, and 2,000.

This is a robustness and generality test, not a second main thesis. Its job is to check whether CRUMB’s advantage is an accident of one model and one context budget.

Across these settings, CRUMB reaches an average rank of $1.809 \pm 0.027$, compared with $2.022 \pm 0.023$ for MICP and $2.169 \pm 0.026$ for uniform subsampling. Both differences are reported as significant after Bonferroni correction.

The heatmap in Figure 3 gives the more granular version: CRUMB is best on 24 of 37 datasets for TabICL-v1, 24 of 37 for TabICL-v2, and 27 of 37 for TabPFN-v2. The paper says 38 classification datasets, while the visible heatmap counts show 37 entries in those per-model summaries. For interpretation, the table-level aggregate is the cleaner reference; the qualitative message is unchanged.

The practical reading: CRUMB is not a narrow trick tuned to one backbone. It behaves like a wrapper whose usefulness transfers across PFN families, at least among the three models tested.

That matters because enterprise ML teams do not buy abstractions. They operate stacks. A method that only works for one checkpoint under one budget is a demo. A method that survives backbone and budget variation starts to look like infrastructure.

The large-dataset experiment is about viability, not benchmark theatre

The large-dataset section moves closer to the deployment problem that motivated the paper. The authors evaluate on five large TabArena datasets: Diabetes130US at 57k samples, APSFailure at 61k, SDSS17 at 62k, customer_satisfaction_in_airline at 104k, and GiveMeSomeCredit at 120k, with the maximum size capped at 100k samples. The maximum context is $n=0.1N$, and the experiment uses five random seeds.

Here the paper enables early stopping in the greedy MMD process. That makes this section partly a large-scale viability test and partly an implementation test for adaptive context size.

Figure 4 reports two useful observations. First, CRUMB with early stopping wins more often than MICP across dataset-seed comparisons, significantly so for TabPFN-v2 and TabICL-v1 at $p < 0.01$. Second, early stopping terminates context selection before the full budget is reached, producing smaller contexts that perform comparably to MICP when MICP is given larger contexts.

That second point is the one operators should keep. Early stopping is not just a runtime hack. It converts context size from a fixed entitlement into an adaptive allocation. Some local test clusters may need more evidence. Some may not. CRUMB’s early stopping gives the system a way to stop paying for context once the distributional match stops materially improving.

The paper does not claim that preprocessing cost disappears. In fact, it later notes that CRUMB’s preprocessing scales linearly in $N$ but can still carry a large constant for million-scale datasets. The sober interpretation is that CRUMB shifts the cost profile: away from full attention over all training samples, toward CPU-side clustering and context retrieval plus a small number of GPU forward passes.

For many production teams, that trade is attractive. CPU preprocessing is not free, but it is usually easier to schedule, cache, parallelise, and explain than repeatedly asking a transformer to attend over a training table the size of a municipal tax register.

The drift test explains why test-side clustering is not cosmetic

The covariate-drift experiment is the cleanest evidence for why CRUMB clusters the test side.

The paper simulates controlled drift on real datasets. It standardises features, projects samples onto the first principal component, assigns each sample a quantile rank $q \in [0,1]$, then fixes the training pool as the bottom half, $q < 0.5$. The test pool is selected with a sliding window controlled by drift intensity $\tau$. At $\tau=0$, train and test filters match. At $\tau=1$, the test set lies in the opposite half of the PC1 distribution, making train and test disjoint along that axis.

Each method receives a context budget of $n=\min(0.1N, 10{,}000)$ and uses $K=20$ clusters. Results are averaged over 10 seeds across four datasets with at least 50,000 samples: Anneal, Amazon Employee Access, Airfoil Self Noise, and Higgs. The aggregate table reports classification accuracy across three classification datasets, three PFN models, and 10 seeds.

The result:

Drift intensity CRUMB accuracy MICP accuracy CRUMB advantage
$\tau=0$ $0.794 \pm 0.029$ $0.746 \pm 0.033$ $+0.049 \pm 0.006$
$\tau=0.5$ $0.788 \pm 0.019$ $0.663 \pm 0.031$ $+0.125 \pm 0.015$
$\tau=1.0$ $0.699 \pm 0.014$ $0.528 \pm 0.031$ $+0.171 \pm 0.021$

This is a robustness test, and it supports a specific mechanism. MICP clusters the training distribution. When the test distribution shifts, those training-derived partitions can become misaligned with where the current queries actually are. CRUMB clusters the current test queries, then selects training context to match those test clusters. So when the query population moves, the retrieval target moves with it.

The paper phrases the result as CRUMB degrading gracefully while MICP collapses: CRUMB falls from 0.794 to 0.699, while MICP falls from 0.746 to 0.528. That is a meaningful difference.

But do not overread it. This is covariate drift under a controlled construction where the label mechanism is assumed unchanged. It does not prove robustness to concept drift, policy changes, adversarial manipulation, or the more tragic forms of enterprise data quality where a feature column silently changes meaning because someone renamed a product category in 2019. CRUMB aligns context to shifted inputs. It does not guarantee that the world still labels those inputs the same way.

That boundary is not a flaw. It is the difference between a useful method and a motivational poster.

The ablations identify the engine: cluster plus distributional retrieval

The appendix is where the paper does the necessary dismantling. This is not ornamental. It tells us whether CRUMB is winning because of one clever component, or because two components interact.

The ablation evidence has three roles:

Test Likely purpose What it supports What it does not prove
Varying number of clusters $K$ Sensitivity test $K=1$ is too global; performance improves as $K$ increases, with diminishing returns after around $K=10$. It does not establish a universal optimal $K$ for every domain.
Retrieval strategy comparison Ablation MMD herding beats centroid-nearest and Voronoi-uniform within the same test-clustering framework. It does not prove MMD is always the best distribution measure.
2×2 clustering/retrieval factorial design Mechanism isolation The full pipeline, cluster plus MMD, has the best average rank; clustering alone and MMD alone are weaker. It does not remove dependence on k-means geometry or RBF kernel choice.
RFF and batched-greedy variants Implementation detail and efficiency test Approximate and batched MMD selection preserves most of the benefit while improving runtime. It does not make preprocessing cost irrelevant at very large scale.
Online cached variant Boundary exploration In streaming mode, CRUMB can cache centroids and contexts, but then behaves like MICP at inference time. It does not show the batch version’s advantage automatically survives streaming deployment.

The most important ablation is the 2×2 design:

Uniform context MMD context
No clustering Case A: standard uniform subsampling Case C: MMD over the full test set
Clustering Case B: clustered test batches with uniform contexts Case D: full CRUMB

Case D wins. Cases A and B are similar. Case C helps only marginally. That is exactly the mechanism-first story: clustering alone does not create useful context if the selected context is still globally uniform; MMD alone has little to exploit when the full test set approximates the training distribution. The gain comes when clustering creates localised target distributions and MMD herding selects contexts matched to those local targets.

This is the paper’s quiet warning to implementers: do not cargo-cult one component. If you cluster the test set and then sample random contexts, you mostly built bureaucracy. If you run MMD against the whole test batch, you may have built a more sophisticated version of global sampling. The useful unit is the pair: local query group plus distributionally matched evidence.

What the paper directly shows

The paper directly shows five things.

First, CRUMB can be used as a training-free wrapper around existing PFN architectures. The authors test TabPFNv2, TabICLv1, and TabICLv2 from public checkpoints without modifying their weights.

Second, on the main 51-dataset TabArena benchmark using TabICL-v2, CRUMB significantly outperforms MICP and uniform subsampling at the same context budget. It is statistically close to per-query kNN while requiring far fewer PFN forward passes.

Third, the advantage generalises across multiple PFN backbones and context sampling proportions on classification tasks, with CRUMB achieving the best average rank in the broader sweep.

Fourth, on larger datasets, early-stopped CRUMB wins more often than MICP and can use smaller mean contexts while remaining competitive.

Fifth, under controlled covariate drift, CRUMB’s advantage over MICP grows as drift intensity increases. This supports the paper’s interpretation that test-side clustering and test-aligned context retrieval create a form of inference-time drift adaptation.

These are meaningful claims. They are also bounded claims.

The paper does not show that CRUMB beats all tabular ML systems in production. It does not show superiority over gradient boosting on every large enterprise dataset. It does not show that PFNs should replace existing pipelines. It shows that, conditional on using PFNs, inference-time context selection can be materially improved by clustering current queries and matching training subsets to those query clusters.

That conditional matters.

What Cognaptus infers for business use

The business inference is that PFN deployment should be designed around a context layer, not only around a model endpoint.

In conventional tabular ML, the training data shapes the fitted model and then mostly disappears from the inference path. In PFN-style inference, labelled examples remain part of the input. That makes context selection a first-class operational design problem. Which rows should the model see? How should they be selected? How often should selection update? What happens when the query population shifts?

CRUMB suggests one viable answer: route test batches into local groups and retrieve evidence that matches each group distributionally. For batch-scored business workflows, this maps cleanly onto existing operating patterns:

Workflow Why CRUMB’s shape fits What to monitor
Nightly risk scoring Queries arrive in batches; contexts can be selected per segment before inference. Distribution shift across score batches; context size after early stopping.
Claims triage Incoming claims may cluster by product, region, or event type. Whether clusters align with operationally meaningful subpopulations.
Fraud detection review queues Current suspicious cases may not resemble the historical average. Drift between recent queries and cached contexts; false-negative behaviour in tail clusters.
Customer propensity scoring Campaign lists are often pre-batched and non-random. Whether campaign-selected customers create distributional skew.
Credit or underwriting pre-screening Batch decisions need consistent throughput and auditable context policy. Stability of selected contexts, feature standardisation, and model suitability.

The ROI path is not “CRUMB gives better AI.” That sentence should be taxed.

The more precise ROI path is: CRUMB may let teams use PFNs on larger tabular inference jobs by reducing the number and cost of PFN forward passes while preserving much of the performance of more local retrieval methods. It may also reduce performance decay when the current query population differs from the historical training distribution, provided the label mechanism is stable and the feature representation is meaningful.

That is not a marketing slogan. It is an operating hypothesis worth testing.

Where CRUMB should not be over-sold

CRUMB has sharp boundaries.

The first boundary is batching. The method needs the full set of test queries upfront to cluster them. The paper discusses an online variant where CRUMB runs on an initial batch, caches centroids and contexts, and routes later queries to the nearest cached centroid. But the paper is explicit: in that cached regime, CRUMB-Online reduces to an MICP-style inference procedure. The operational difference is how the cached centroids and contexts were initialised.

That means a streaming deployment needs refresh policy. If recent queries drift, stale cached centroids can become yesterday’s map. A sliding-window refresh could help, but then the system must decide how often to recompute clusters and contexts. This is no longer merely a modelling question; it is an operations question.

The second boundary is geometry. CRUMB uses Euclidean k-means on standardised features. That assumes a feature space where Euclidean clusters mean something useful. Many enterprise tables contain mixed categorical, ordinal, sparse, and engineered features. Standardisation does not magically make their geometry honest. The MMD stage uses an isotropic Gaussian RBF kernel with median-heuristic bandwidth, which similarly imposes a generic notion of distance.

The third boundary is preprocessing cost. CRUMB avoids full-context quadratic attention, but it does not make context selection free. The paper notes that CRUMB’s preprocessing scales linearly in $N$, with a potentially large constant in expressions like $O(Nd(T+Kn))$. For million-scale datasets, that constant matters. The method may still be attractive, but it belongs in a benchmarked pipeline, not a slide titled “Scalable” with a check mark.

The fourth boundary is the PFN itself. CRUMB controls context. It does not fix a model whose pretraining prior is poorly matched to the dataset. If the underlying PFN is wrong for the task, better context selection may produce a more efficiently wrong answer. Efficiency has never been a substitute for validity, though it does make invalidity arrive faster.

The governance question: what context did the model see?

There is also a less obvious governance implication.

PFNs make the inference context part of the decision process. CRUMB makes that context dynamic: different test clusters receive different selected training subsets. That is operationally powerful, but it also creates a traceability requirement. If a decision is challenged, audited, or debugged, the system should know which context subset was selected, which test cluster triggered it, which retrieval parameters were used, and whether the context was produced by fresh clustering or cached routing.

In other words, context selection becomes part of model governance.

For regulated or high-stakes domains, “the model predicted it” is already an unsatisfying sentence. “The model predicted it after seeing a dynamically selected subset of training examples chosen by a kernel herding procedure” is not going to calm anyone down unless the pipeline logs its work.

A production CRUMB-like system should therefore store at least:

Governance artefact Why it matters
Cluster assignment for each query Explains which batch-level context governed the prediction.
Selected training row identifiers Enables audit, debugging, and reproducibility.
MMD or selection diagnostics Shows whether the selected context matched the test cluster well.
Context refresh timestamp Distinguishes fresh distribution matching from stale cached routing.
Feature standardisation and kernel settings Captures distance assumptions embedded in the selection process.
PFN checkpoint and context budget Separates model behaviour from context policy.

This is not bureaucratic garnish. When context becomes dynamic, context becomes evidence. Evidence should leave a trail. A shocking proposal, apparently.

The useful mental model: CRUMB is a context scheduler

The best way to think about CRUMB is not as a new model. It is a context scheduler.

A model scheduler decides where compute should run. A context scheduler decides what evidence each inference batch should receive. CRUMB schedules labelled examples into cluster-specific contexts based on distributional match to current queries. That makes it especially relevant to tabular foundation models, where inference depends heavily on the supplied context.

This mental model also clarifies the relationship to other approaches.

Architectural methods try to make the PFN itself handle longer contexts more efficiently. Fine-tuning methods adapt weights or adapters to a dataset. Context-tuning methods optimise examples. CRUMB sits outside the model and changes the input context at inference time. These strategies are not mutually exclusive. In fact, a future system might combine them: a more scalable PFN architecture, a CRUMB-like context router, and fine-tuning where justified.

But the virtue of CRUMB is that it asks for less institutional drama. No retraining. No new architecture. No waiting for a foundation model vendor to make the long-context problem disappear in a roadmap update. It changes the inference wrapper.

In enterprise AI, that matters. The wrapper is often where deployability lives.

The operator’s checklist

For a team considering CRUMB-style inference, the implementation question should not begin with “Can we reproduce the paper?” It should begin with the operating conditions.

Question Good sign Warning sign
Do queries arrive in batches or micro-batches? Batch scoring, campaign lists, nightly jobs, queue processing. One-at-a-time streaming with strict latency and no refresh window.
Does Euclidean feature geometry make sense? Mostly continuous or well-embedded features, stable preprocessing. High-cardinality categoricals, sparse codes, mixed semantics, unstable feature definitions.
Is the PFN already competitive on the task? PFN performs well on smaller samples or validation subsets. PFN underperforms strong baselines even with full or generous context.
Is query distribution shift common? New campaigns, seasonal patterns, event-driven case mix. Stable iid-like scoring where uniform context already works well.
Can selected contexts be logged? Row identifiers, cluster IDs, and retrieval diagnostics can be persisted. Governance requires explanations but pipeline cannot trace selected evidence.
Is preprocessing latency acceptable? CPU-side selection can be cached, parallelised, or run before SLA-critical inference. Selection cost itself violates latency budget.

This checklist is where the paper becomes useful. CRUMB is not a universal deployment recipe. It is a design pattern for a specific operating problem: making PFN inference cheaper and more adaptive when the full training context is too large and query batches have structure worth exploiting.

Conclusion: the model does not need the whole archive; it needs the right witnesses

CRUMB’s contribution is not that it finds yet another way to sample fewer rows. The field has enough row sampling to fill several warehouses of disappointment.

The useful idea is more precise: for PFN inference, the context should be selected against the current query distribution, not merely drawn from the historical training distribution. Test-side clustering creates local inference batches. MMD herding selects training evidence that covers those batches. Batched PFN calls then preserve much of the locality benefit without paying the per-query forward-pass cost.

The paper’s results support that mechanism. CRUMB beats MICP and uniform subsampling across TabArena at matched context budgets, remains statistically close to per-query kNN while using far fewer forward passes, and shows a widening advantage under controlled covariate drift. The ablations make the important point even clearer: clustering alone is not enough, and MMD alone is not enough. The win comes from their interaction.

For operators, the lesson is simple and inconvenient in the useful way. In tabular foundation models, inference context is not an afterthought. It is part of the system design. It should be routed, budgeted, refreshed, and audited.

The model does not need to read the entire archive every time it answers a question. It needs the right witnesses in the room.

Cognaptus: Automate the Present, Incubate the Future.


  1. Jamie Heredge, Mattia J. Villani, Pranav Deshpande, Akshay Seshadri, and Niraj Kumar, “CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching,” arXiv:2606.11473, 2026. https://arxiv.org/abs/2606.11473 ↩︎