Mind the BOLD Gap: Why fMRI Models Need More Than a Local Look

TL;DR for operators

This paper is not about magically reading the mind from fMRI. Fortunately. We already have enough products pretending to do that.

The useful point is narrower and more operational: fMRI signals are distributed across brain regions and stretched across time, so a model that treats them as local snapshots may be structurally under-equipped before training even begins. Kramer, Acharya, Giola, and Zappala adapt an Attentional Neural Integral Equation-style architecture to fMRI encoding and decoding, learning a nonlocal operator in latent space rather than relying only on local filters, short recurrent memory, or fixed graph assumptions.¹

The paper’s strongest business lesson is about context as a design variable. Longer temporal windows usually help the neural integral operator. Whole-brain input improves Haxby decoding over visual-cortex-only input. Learned latent embeddings make stimulus structure more separable, especially when temporal information is limited. That matters for neurotech, medical imaging analytics, and any business handling high-dimensional sensor streams where the signal is scattered, delayed, and annoyingly unwilling to sit in one neat feature column.

The boundary is equally clear. The results do not establish a mechanistic theory of brain function. The data are fMRI BOLD signals, not direct neural recordings. The experiments use two open datasets, small-subject settings, and moderate encoding performance. Longer windows may help partly because they stabilize statistics, not only because the model has discovered deep temporal structure. Useful, yes. Mind-reading oracle, no. Please put the pitch deck down.

The model starts from the part most pipelines quietly flatten

Most analytics pipelines begin by deciding what part of the signal is allowed to matter. That decision often hides in preprocessing. Crop the region of interest. Pick a time window. Flatten a volume. Aggregate a sequence. Choose a model that can only see neighbours. Then, several weeks later, someone wonders why performance is brittle.

The paper’s central move is to refuse that quiet flattening. fMRI data are treated as functions over space and time: a BOLD signal $u(x,t)$, where $x$ represents brain location and $t$ represents time. The modelling question becomes less “which classifier should we bolt on?” and more “what kind of operator can map one spatiotemporal function into another without pretending the brain is a spreadsheet?”

That is why the mechanism matters more than the benchmark table. The authors frame fMRI dynamics as nonlocal: activity at one location may depend on distant regions, and activity at one time may reflect earlier dynamics because of neural integration, hemodynamic delay, and memory effects. A strictly local model assumes that nearby inputs are the main inputs. That assumption is convenient. It is not necessarily innocent.

The integral-operator formulation expresses this directly:

$$ T(u)(x,t)=\int_0^1 \int_\Omega K_\theta(u(z,s),x,t,z,s),dz,ds $$

The important part is not the notation; it is the permission structure. The output at location $x$ and time $t$ can depend on signal values at other locations $z$ and earlier or different times $s$. The model learns the kernel $K_\theta$ rather than requiring a hand-built connectivity graph or fixed delay model.

For business readers, this is the part worth carrying into other domains. When a signal is naturally distributed, a local architecture can become a cost-saving measure disguised as an inductive bias. It may be cheaper to run, easier to explain, and perfectly wrong in the places where the signal actually lives.

Encoding and decoding are opposite directions, not equal difficulties

The paper tests both decoding and encoding, and the distinction matters.

In decoding, the model observes fMRI activity and predicts the stimulus associated with it. That is a backward mapping: from recorded brain response to what was shown. The paper studies this on the Haxby object-classification dataset and the Miyawaki visual-stimulus dataset.

In encoding, the model moves in the other direction. It receives a stimulus representation and predicts the corresponding fMRI response. That is harder because the output is high-dimensional brain activity. In the Miyawaki encoding task, the authors describe the problem as using a $10 \times 10$ stimulus input to predict roughly 5,000 voxels. This is not a polite regression problem. It is a many-output inverse headache wearing a lab coat.

The architecture handles both directions through latent fixed-point dynamics. In decoding, the BOLD signal is encoded into a latent representation, passed through a learned operator, and decoded into the stimulus. In encoding, the stimulus is encoded into latent space, iterated through the operator, and decoded into the predicted BOLD signal.

This dual use is one of the paper’s substantive contributions. The authors are not merely trying another classifier on fMRI. They are asking whether a latent nonlocal operator can serve as a common modelling framework for both directions of the brain-stimulus relationship.

That is ambitious. It is also where the paper is careful: predictive gains are treated as evidence that broader spatiotemporal context is useful, not proof that the model has recovered the brain’s causal machinery.

The main evidence: longer windows help the operator more than they help most baselines

The first major experimental pattern appears in Haxby decoding. Using visual-cortex recordings, the neural integral operator improves as the temporal window grows from 1 to 10 to 20 time points. Accuracy rises from 0.7367 to 0.7675 to 0.7917, and F1 rises from 0.6842 to 0.7108 to 0.7283.

That would be moderately interesting on its own. The sharper point is the contrast with other models. Several baselines perform well at the shortest window and then degrade severely when the window expands. The feed-forward network starts with 0.7967 accuracy at one time point but falls to 0.3858 at ten. ResNet starts at 0.8192 but drops to 0.5416 and then 0.3965. CNN, LSTM, and ViT also show weaker or unstable behaviour across longer windows.

So the evidence is not simply “more context is always better”. It is more precise: more context helps when the architecture can use it. Otherwise, more context may just become more room in which the model can get confused. Very democratic, but not very useful.

Test	Likely purpose	What it supports	What it does not prove
Haxby decoding with visual-cortex input	Main evidence and temporal-window sensitivity test	ANIE benefits from longer temporal windows while several baselines degrade	That ANIE is universally best across every metric or dataset
Miyawaki decoding	Main evidence on a harder visual-stimulus decoding task	ANIE remains competitive and improves at the longest tested window for accuracy	That the operator dominates all baselines on every metric; ViT has stronger F1 in this table
Haxby whole-brain decoding	Spatial-context sensitivity test	Broader spatial input improves ANIE, especially at 20 time points	That whole-brain context has been exhaustively compared for all architectures
Miyawaki encoding	Main evidence for forward stimulus-to-fMRI prediction	Longer output windows improve ANIE more clearly than ViT	Full recovery of target brain dynamics
Raw versus latent KNN separability	Exploratory representation diagnostic	Learned embeddings improve separability when time windows are short	A mechanistic interpretation of the learned latent space

The Miyawaki decoding task is more nuanced. ANIE reaches 0.8760 accuracy at a 10-frame window, higher than the listed baselines at that window. But ViT posts stronger F1 values across the Miyawaki table. This is exactly where an article should resist the urge to make a leaderboard crown out of papier-mâché. The more defensible conclusion is that the operator behaves well as temporal context grows, but metric choice still matters.

That nuance is valuable. In business systems, “best model” is usually a malformed sentence until someone specifies the task, metric, data regime, latency budget, and failure cost. A model that wins on accuracy but not F1 may be useful, but only after the operating objective has been made explicit. Otherwise, the evaluation has all the authority of a horoscope with decimals.

Spatial context changes the result, which should make ROI cropping feel less harmless

The paper then changes the spatial input for Haxby decoding. Instead of using only visual-cortex recordings, it tests whole-brain fMRI signals with the neural integral operator.

The result strengthens the mechanism-first reading. With whole-brain input, ANIE achieves 0.8125 accuracy at 10 time points and 0.8517 at 20 time points. F1 rises from 0.7608 to 0.8000. The authors report that whole-brain recordings significantly improve performance relative to the visual-cortex setting at 10 and 20 time frames, and that the whole-brain 20-time-point result is the highest across the Haxby experiments.

This does not mean every visual task requires the whole brain. It means that cropping to the obvious region may discard useful distributed information. The visual cortex is obviously relevant to visual stimuli. That does not imply it is sufficient for the whole decoding task.

This is a broader warning for applied AI. Many businesses reduce sensor data to an “expert-selected” region because it makes the pipeline cleaner and cheaper. Sometimes that is correct. Sometimes it is an elegant way to amputate the signal. The test is empirical: run the model under different spatial scopes and measure whether the architecture can exploit the additional context.

The paper’s whole-brain experiment is therefore best understood as a spatial-context sensitivity test, not a universal recommendation to always feed every voxel into every model until the GPU begs for mercy.

Encoding is where the result becomes harder to dismiss as input correlation

The encoding task is the most useful part of the evidence because it removes a simple explanation.

In decoding, longer input windows may help because the model is simply given more temporal information. That is still useful, but not surprising. More evidence often improves prediction. Groundbreaking; also Tuesday.

Encoding is different. The input is the visual stimulus, while the target is the fMRI response. The authors argue that the stimulus sequences do not contain the same temporal dependence as the target fMRI dynamics. Therefore, improvements with longer output windows are less easily explained as “the input just had more correlated frames”.

The numbers support a real, though moderate, effect. For Miyawaki encoding, ANIE’s $R^2$ improves from 0.1491 at 5 time points to 0.1744 at 10 and 0.2546 at 25. Pearson correlation rises from 0.3385 to 0.3800 to 0.4580. The ViT baseline also improves in Pearson correlation, but its $R^2$ largely saturates by 10 to 25 time points: 0.1181, 0.1511, and 0.1524.

The authors explicitly state that these scores do not indicate full recovery of the target dynamics. That caveat is not cosmetic; it changes the interpretation. The encoding result is not “we can predict the brain”. It is “a nonlocal latent operator appears to gain useful predictive structure from longer temporal output dynamics in a hard forward-modelling task”.

That is a less glamorous sentence. It is also a more useful one.

For business use, the implication is that forward models in high-dimensional sensor domains should be evaluated not only by input richness but by output-structure modelling. When the target is itself a structured field or sequence, the model must generate coherent outputs across space and time. Treating each output dimension as a mostly independent prediction can leave performance on the table, then blame the data for being difficult. The data may indeed be difficult. It may also be offended by the architecture.

The latent-space test says the model helps most when the signal is thin

The paper’s representation analysis is a useful secondary diagnostic. The authors compare raw fMRI representations with latent embeddings learned by the integral-operator model. They reduce representations using PCA and UMAP, then use a KNN classifier to distinguish geometric versus random stimuli in the Miyawaki dataset.

This is not the main predictive task. It is better read as an exploratory representation diagnostic: does the model organize stimulus-dependent information more clearly than the raw signal?

The answer depends on the temporal window. At 3 time points, raw data achieves 88.1670% KNN accuracy, while the model embedding reaches 93.6440%. At 5 time points, raw data reaches 92.2130%, while the model embedding reaches 95.1530%. Both differences are statistically significant in the paper’s tests, with $p=0.0021$ and $p=0.0162$. At 10 time points, the raw data reaches 98.2840% and the model embedding reaches 98.4070%, but the difference is no longer significant, with $p=0.8519$.

That pattern is more interesting than a simple “latent space is better” claim. The model’s advantage is strongest when temporal information is limited. When the window becomes large enough, the raw representation is already highly separable, and the learned embedding’s relative advantage diminishes. This looks like saturation, not failure.

Temporal window	Raw fMRI KNN accuracy	Model embedding KNN accuracy	Interpretation
3	88.1670% ± 0.5987	93.6440% ± 0.5403	Latent operator adds clear structure under limited context
5	92.2130% ± 1.1076	95.1530% ± 2.6173	Advantage remains significant but smaller
10	98.2840% ± 0.6038	98.4070% ± 1.5113	Both are nearly saturated; advantage disappears statistically

This matters operationally because many business datasets live in the low-signal regime. Short sessions, noisy sensors, incomplete records, budget-constrained acquisition, missing modalities: the world is full of datasets that would be better if only reality had remembered to collect them properly.

In those regimes, learned representations may be most valuable not when they produce dazzling final metrics, but when they make weak structure usable. That is a quieter form of ROI, but often the more durable one.

The misconception: this is not a brain mechanism wearing a neural-network costume

The obvious misreading is that the paper has discovered a mechanistic model of brain function. It has not.

The authors are careful on this point. The model operates on fMRI BOLD signals, which are indirect, delayed, and smoothed measurements of neural activity. Improved prediction from longer windows is consistent with temporally extended brain dynamics, but it does not isolate the biological mechanism. The authors also note that longer windows may provide benefits from statistical stabilization or data-augmentation-like effects.

That distinction is not academic hair-splitting. It determines how the work should be used.

A predictive model can be valuable without being mechanistic. A latent representation can be useful without being neuroscientifically interpretable. A nonlocal kernel can capture dependencies without mapping cleanly onto functional brain organization. Confusing these categories is how research becomes product mythology, and product mythology is just technical debt with better typography.

The paper’s defensible claim is that nonlocal operator learning is a promising framework for modelling fMRI dynamics and that broader temporal and spatial context improves performance in the tested settings. The unsupported claim would be that the learned operator reveals the brain’s actual causal wiring. The first belongs in a serious roadmap. The second belongs in a funding application that makes reviewers sigh.

The business value is architecture-context fit, not “AI for brains”

The practical lesson travels beyond fMRI. Many business systems work with high-dimensional, distributed, delayed signals: medical imaging, industrial monitoring, logistics telemetry, financial market microstructure, environmental sensing, satellite imagery, and multi-camera operations. In these settings, the target event may not appear in one location, one frame, or one metric.

The paper suggests a useful operating principle:

If the phenomenon is nonlocal, test whether the model is allowed to be nonlocal before blaming the data.

That principle translates into several concrete design habits.

First, treat temporal window length as a controlled design variable. Do not inherit it from a preprocessing notebook written by someone who has since left the company and now keeps bees in Portugal. Run window sensitivity tests. Check whether performance improves, saturates, or collapses. Collapse is especially informative; it may reveal that the architecture cannot use added context.

Second, test spatial scope rather than assuming the obvious region is sufficient. In the paper, whole-brain input improves Haxby decoding for the integral operator. In business domains, the equivalent might be using full machine telemetry rather than a single sensor cluster, full transaction sequences rather than isolated events, or broader patient context rather than a single scan region.

Third, compare raw and learned representations under a simple downstream diagnostic. The KNN latent-space test is not fancy, which is precisely why it is useful. If a representation claims to organize meaningful structure, a simple classifier or retrieval test should see some of it. Not everything requires a 900-slide interpretability framework. Some things merely require not being allergic to baselines.

Fourth, separate prediction from explanation. A model can improve outcomes while remaining opaque. That is acceptable for some workflows and unacceptable for others. In clinical decision support, regulatory reporting, and human-subject research, the difference matters.

Where the evidence stops

The paper’s boundaries are not defects; they are the operating envelope.

The experiments use two open fMRI datasets: Haxby and Miyawaki. These are valuable, familiar datasets, but they are not a population-scale clinical deployment. Haxby includes six subjects in the experiments described here; Miyawaki evaluations use repeated random seed initializations and test recordings, but the broader setting remains small compared with commercial-grade validation.

The measurement modality is also a boundary. fMRI BOLD signals are not direct neural activity. They are hemodynamic proxies with delay and smoothing. Any claim about “brain dynamics” should pass through that filter before becoming a business conclusion.

The model comparisons are informative but not exhaustive. The authors keep model scales comparable, which is useful for controlled comparison, but it also means the study is not a final contest against every modern large-scale architecture tuned to exhaustion. The whole-brain comparison is reported for the integral operator, with some qualitative note that other models may also improve but by smaller factors in initial experimentation. That is suggestive, not definitive.

Finally, temporal-window gains have more than one plausible explanation. They may reflect true exploitation of longer-range temporal dependencies. They may also reflect more stable statistics, implicit data augmentation, or easier aggregation across noisy measurements. The paper acknowledges this. Operators should too.

Directly shown	Cognaptus inference	Remaining uncertainty
Longer windows generally improve ANIE across several tasks	Context-window design should be part of model selection, not a fixed preprocessing choice	How much gain comes from learned temporal dynamics versus statistical stabilization
Whole-brain Haxby input improves ANIE over visual-cortex input	ROI cropping can remove useful distributed signal	Whether the same spatial-context benefit holds equally across larger datasets and architectures
ANIE improves Miyawaki encoding over ViT in reported $R^2$ and Pearson correlation	Nonlocal output-structure modelling may help high-dimensional forward prediction	Absolute encoding performance remains moderate
Latent embeddings outperform raw data at shorter temporal windows	Representation learning may be most valuable when signal is limited	Latent structure is not automatically mechanistic or clinically interpretable

The operator lesson

The neat version of AI says better models extract better answers from the same data. The messier version, which is usually the real one, says models can only extract the structure they are built to notice.

This paper is useful because it makes that architectural premise visible. fMRI is not a single measurement. It is a spatially distributed, temporally delayed, high-dimensional signal. Nonlocal neural integral operators are attractive here because they give the model a way to relate distant locations and extended time intervals without forcing the dependencies into a hand-built graph or a short-memory architecture.

The evidence is not a coronation. ANIE does not dominate every baseline on every metric. Encoding remains moderate. Latent structure is encouraging but not explanatory proof. The paper’s value lies in the disciplined middle: enough empirical support to take nonlocal operator learning seriously for fMRI, enough uncertainty to keep the claims from wandering into theatre.

For operators, the message is simple. When the system you are modelling has long-range dependencies, context is not decoration. It is part of the model. Remove it casually, and the model may still produce numbers. It may even produce confident numbers. But confidence is cheap. The expensive part is making sure the architecture was allowed to see the signal in the first place.

Cognaptus: Automate the Present, Incubate the Future.

Andreas Kramer, Saugat Acharya, Alice Giola, and Emanuele Zappala, “Nonlocal operator learning for fMRI encoding and decoding tasks,” arXiv:2605.20389v1, 19 May 2026, https://arxiv.org/abs/2605.20389. ↩︎

TL;DR for operators#

The model starts from the part most pipelines quietly flatten#

Encoding and decoding are opposite directions, not equal difficulties#

The main evidence: longer windows help the operator more than they help most baselines#

Spatial context changes the result, which should make ROI cropping feel less harmless#

Encoding is where the result becomes harder to dismiss as input correlation#

The latent-space test says the model helps most when the signal is thin#

The misconception: this is not a brain mechanism wearing a neural-network costume#

The business value is architecture-context fit, not “AI for brains”#

Where the evidence stops#

The operator lesson#