When One Heatmap Isn’t Enough: Layered XAI for Brain Tumour Detection

Diagnosis has a simple business problem hiding inside a clinical one: nobody wants a black box that is confident for the wrong reason.

That is especially true in medical imaging. A brain MRI classifier that says “tumour” or “non-tumour” is not automatically useful because it crosses a respectable accuracy threshold. The difficult question comes next: did the model look at the clinically relevant region, or did it discover some convenient artefact in the image pipeline? A single heatmap may answer that question. It may also merely look persuasive, which is not quite the same thing. Medicine, regrettably, is one of those domains where aesthetic confidence is still not a validation method.

The paper behind this article, Explainable AI: A Combined XAI Framework for Explaining Brain Tumour Detection Models, tackles that problem by combining three explanation methods—GRAD-CAM, Layer-wise Relevance Propagation, and SHAP—around a custom convolutional neural network trained for tumour/non-tumour classification on FLAIR MRI slices from BraTS 2021.¹ The CNN matters, but the more interesting contribution is the explanatory stack. The paper is not just asking whether a model can classify brain MRI slices. It asks whether different explanation methods can be made to cross-check one another.

That is a more useful question than “which XAI method is best?” Best for what? Localising a suspicious region? Showing pixel-level relevance? Quantifying whether the visible evidence supports or opposes a tumour prediction? These are different jobs. Treating them as one job is how explainability becomes theatre.

The useful unit is not a heatmap; it is an explanation stack

The paper’s mechanism is easy to state and harder to operationalise. It uses three complementary XAI methods because each one sees a different layer of the model’s reasoning.

GRAD-CAM provides the coarse visual story. It uses gradients from the target class with respect to the final convolutional layer to highlight regions that influenced the model’s decision. In plain English: it answers, “Where did the model look?” That is valuable, but broad. A GRAD-CAM map can show a region of interest without proving that every highlighted pixel is clinically meaningful.

LRP provides the finer relevance story. It propagates the model’s output backward through the network and assigns relevance scores to pixels. This answers a slightly different question: “Which input pixels carried the decision?” It is more granular than GRAD-CAM, but also more demanding to interpret. Pixel-level relevance is useful only when someone knows what kind of pixel-level pattern is plausible. Otherwise, one gets a very sophisticated red cloud. The industry already has enough clouds.

SHAP provides the contribution story. It estimates how features contribute positively or negatively to the model’s prediction. In this paper’s implementation, SHAP visualisations quantify whether areas of the image support tumour or non-tumour classification. That turns the explanation from “the model looked somewhere around here” into “these regions pushed the prediction in this direction.”

The paper’s combined framework therefore has a simple internal division of labour:

Explanation method	What it contributes	What it cannot do alone	Operational role
GRAD-CAM	Broad region-of-interest localisation	May be too coarse or spatially overinclusive	First-pass attention check
LRP	Pixel-level relevance	Can be visually noisy and harder to interpret	Fine-grained plausibility check
SHAP	Positive/negative contribution balance	Can be abstract without spatial context	Quantified support/contradiction check

This is why the mechanism-first reading matters. A conventional article would summarise the dataset, the CNN, the metrics, and the XAI methods in sequence. That would be tidy and mostly useless. The real value of the paper is the layered diagnostic logic: location first, relevance second, contribution third.

In a hospital or regulated AI vendor, that distinction matters. An explanation panel is not valuable because it has three colourful outputs instead of one. It is valuable if each output answers a different operational question and makes the next human action clearer.

The model improves, but the real result is the remaining uncertainty

The authors build a custom CNN rather than relying on a pre-trained architecture. Their reasoning is sensible. The BraTS data used in the paper consists of brain-focused MRI data rather than natural images or full head images, and the project aims to control the architecture closely enough to support interpretability. The model uses FLAIR sequences from BraTS 2021, converts 3D MRI volumes into 2D slices, labels slices as tumour or non-tumour using segmentation masks, filters for informative slices, normalises and resizes images, and splits the data at the subject level to reduce contamination between training, validation, and test sets.

The improved model performs better than the baseline version inspired by Hafeez et al. The test accuracy rises from 84.76% to 91.24%, and test loss falls from 0.3482 to 0.2355. Precision improves from 0.9191 to 0.9608. Recall rises from 0.7622 to 0.8622. F1-score increases from 0.8333 to 0.9088. AUC improves from 0.92 to 0.96.

Metric	Original model	Improved model	Interpretation
Test accuracy	84.76%	91.24%	More slices classified correctly
Test loss	0.3482	0.2355	Predictions closer to labels, though not trivial
Precision	0.9191	0.9608	Fewer false positive tumour calls
Recall	0.7622	0.8622	Fewer missed tumour cases
F1-score	0.8333	0.9088	Better balance between false positives and false negatives
AUC	0.92	0.96	Better class separation

For a business reader, the tempting conclusion is obvious: the model got better, therefore the system became more useful. That is only partly right.

The more important point is that even the improved model still fails in ways that are operationally meaningful. The improved confusion matrix reports fewer false positives and fewer false negatives than the original model. False negatives fall from 649 to 376. False positives fall from 183 to 96 in the confusion matrix discussion, while the later false-positive breakdown refers to 97 cases. That small internal inconsistency is not the headline, but it is a reminder that careful readers should not treat every reported count as a polished regulatory evidence package. This is a research prototype, not a clinical submission dossier.

The authors examine the remaining errors. Among false negatives, poor image quality and partial tumour visibility dominate the story. They report 123 of 376 false negatives as poor-quality images, with issues such as blurriness, low contrast, and motion artefacts. The remaining 253 are attributed to partial tumours or limited tumour visibility. Among false positives, some are poor-quality images, while others are non-tumorous anomalies with image characteristics that resemble tumour tissue.

This is where explainability becomes more than an academic accessory. The model’s mistakes are not random bookkeeping errors. They fall into categories that matter for workflow design:

Error pattern	Likely cause	Business meaning	What XAI can support
False negative with poor image quality	Low contrast, blur, motion artefacts	The AI may need image-quality gating before interpretation	Flag the case as unreliable rather than silently negative
False negative with partial tumour visibility	Small or incomplete tumour evidence in a 2D slice	The slice-level classifier may miss context visible across adjacent slices	Trigger review of contiguous slices or 3D context
False positive with poor image quality	Noise mistaken for pathology	Avoid unnecessary alarm when input quality is weak	Separate image-quality risk from tumour evidence
False positive with non-tumorous anomaly	Benign or unrelated bright structures mimic tumour features	The model may need multimodal MRI context or richer labels	Show whether attention aligns with actual segmentation

The lesson is not “the model is accurate.” The lesson is “the model’s residual errors suggest where an assurance layer should intervene.” That is a much more practical statement.

Positive cases are useful when explanations converge

In clear tumour cases, the three explanation methods tend to align. GRAD-CAM highlights the tumour region. LRP assigns strong relevance to the same area. SHAP shows that a majority of contributions support the tumour classification. In one positive combined example, the paper reports 64.73% of SHAP values supporting tumour classification and 35.27% contributing against it.

That convergence is important because it gives a human reviewer several forms of agreement. The model does not merely say “tumour.” Its broad attention map, pixel relevance pattern, and contribution balance all point in a compatible direction.

This is the cleanest use case for layered XAI. The explanation stack acts like a consistency check:

Spatial check: Does the broad highlighted region overlap the visible abnormality?
Granular check: Do pixel-level relevance scores concentrate where the suspicious tissue appears?
Contribution check: Do positive feature contributions dominate in a way consistent with the prediction?

If the answer is yes across all three, the system provides a stronger explanation than any single method could provide. Not proof. Stronger explanation. The distinction matters because XAI does not certify clinical truth; it clarifies model reasoning. A map of the model’s attention is not the same as pathology. But a map, a relevance pattern, and a contribution balance aligned around a visible tumour are more useful than a lone heatmap asking to be admired.

For AI product teams, this suggests a design principle: explanations should not be dumped into the interface as visual ornaments. They should be arranged as an evidence sequence. First the user sees where the model looked. Then what pixels mattered. Then whether the evidence supported or opposed the classification. If an interface cannot tell the user how to read its explanation stack, it has not built explainability. It has built a radiology-themed dashboard.

Negative cases are not empty; they reveal diffuse reasoning

Negative cases are often treated as less interesting because there is no tumour to point at. The paper’s XAI results suggest the opposite. In non-tumour samples, GRAD-CAM shows less concentrated activation. LRP shows scattered relevance rather than a dominant tumour-like focus. SHAP shows contributions weighted strongly against tumour classification. In one negative SHAP example, 86.27% of SHAP values support non-tumour classification, with only 13.73% supporting tumour classification. In a combined negative example, 79.98% contributes against tumour classification and 20.02% supports tumour.

This matters because “nothing detected” is not the same as “nothing happened inside the model.” A negative decision still has a reasoning pattern. If that pattern is diffuse and the quantitative contribution balance points away from tumour, the model’s negative call becomes more interpretable.

That is useful for decision support. In production, negative cases can be dangerous because they may invite complacency. A layered explanation can help separate a confident negative from a fragile one. For example:

Negative-case signal	Possible interpretation	Operational response
Diffuse GRAD-CAM, scattered LRP, SHAP mostly negative	Model sees no concentrated tumour-like evidence	Lower review priority, assuming image quality is adequate
Focused GRAD-CAM but SHAP mostly negative	Model attends to a region but contribution balance rejects tumour	Human review may check whether the region reflects artefact or benign structure
Scattered LRP with weak SHAP margin	Model reasoning may be unstable	Request additional slice/context review
Poor image quality plus negative classification	Input may be unreliable	Do not treat negative output as reassuring without quality control

The business implication is subtle. XAI should not merely explain positive alerts. It should help triage negative decisions. A missed tumour is clinically and operationally worse than an extra review in many settings. If XAI can identify negative cases where the model’s reasoning is weak, then it becomes part of a risk-routing system rather than a retrospective explanation toy.

Partial tumours are where the stack earns its keep

The most interesting part of the paper is its treatment of partial tumour cases. These are exactly the cases where a 2D slice-level classifier can struggle. The tumour may be barely visible. The segmentation mask confirms tumour tissue, but the original slice may not make that obvious to a human reader at a glance. A single explanation method can become ambiguous.

In the first partial tumour example, GRAD-CAM focuses on the relevant region but also highlights other areas. That broader focus suggests uncertainty. LRP gives a more concentrated relevance pattern around the tumour region. SHAP reports 67.48% of values supporting tumour classification and 32.52% contributing against it. The combined interpretation is more useful than any component alone: GRAD-CAM says the model is looking broadly but includes the relevant region; LRP says the local pixel relevance still concentrates near tumour tissue; SHAP says the contribution balance still favours tumour.

In the second partial tumour example, the story is different and more valuable. The model predicts tumour with probability 0.507, essentially a coin toss dressed in neural-network clothing. The XAI methods reflect this uncertainty. GRAD-CAM activates multiple regions. LRP is scattered rather than concentrated. SHAP reports 85.97% of values contributing to non-tumour diagnosis and only 14.03% supporting tumour diagnosis. The segmentation mask confirms tumour tissue, but the explanation stack shows that the model’s internal evidence is weak.

This is the paper’s best practical insight. In borderline cases, disagreement is not a failure of the explanation system. It is the explanation.

A weak article would say: “The combined framework improves trust.” A better reading says: “The combined framework makes uncertainty harder to hide.” That is the point.

Case type	GRAD-CAM	LRP	SHAP	Practical reading
Clear tumour	Focused on tumour region	Relevance concentrated near tumour	Majority supports tumour	Explanations converge; model reasoning appears plausible
Non-tumour	Broad or diffuse attention	Scattered relevance	Majority supports non-tumour	Negative decision has interpretable support
Partial tumour, detectable	Broad but includes tumour region	More focused around tumour	Majority supports tumour	Layering clarifies weak but usable evidence
Partial tumour, uncertain	Multiple activations	Scattered relevance	Majority supports non-tumour despite tumour label	Escalate for human review and adjacent-slice context

This is where the paper’s business relevance becomes concrete. The framework should not be sold as “AI that doctors can trust.” That phrase has done enough damage already. It should be understood as a quality-assurance layer for model reasoning. It helps answer whether a prediction deserves routine acceptance, closer review, or outright suspicion.

What the paper directly shows, and what Cognaptus infers

The paper directly shows three things.

First, a custom CNN trained on FLAIR-derived 2D slices from BraTS 2021 can improve over the authors’ baseline configuration after architectural and hyperparameter changes. The reported improvement is meaningful within the study’s own setup: accuracy rises to 91.24%, recall improves, and false negatives fall.

Second, applying GRAD-CAM, LRP, and SHAP side-by-side provides complementary views of the model’s predictions. The methods often converge in clear positive and negative cases, and they provide different signals in partial or ambiguous cases.

Third, the combined XAI panel can expose uncertainty in borderline slices. The second partial tumour case is especially important because the model’s probability is only 0.507 and the explanation methods do not produce a clean, reassuring story. That is exactly what a serious decision-support tool should reveal.

Cognaptus infers a business pathway from these findings, but it should be stated carefully.

The pathway is not immediate clinical deployment. The pathway is operational assurance for imaging AI. Hospitals, AI imaging vendors, and governance teams can use layered explanations to design review queues, identify weak predictions, inspect failure modes, and define escalation rules. The system is more useful as a second-reader support layer than as an autonomous diagnostic actor.

A practical workflow could look like this:

Model + XAI output	Suggested workflow action
High model confidence + convergent XAI around visible lesion	Standard clinician review with model-supported annotation
High model confidence + XAI focus outside plausible anatomy	Flag for model-quality review
Low model confidence + divergent XAI	Escalate to senior review or require adjacent-slice inspection
Negative prediction + poor image quality	Treat as unreliable negative; request quality-aware review
Positive prediction + SHAP weakly supportive	Check for artefact, anomaly, or non-tumorous mimic

This is not glamorous, but it is where the money is. The ROI of explainability in healthcare is not “clinicians feel emotionally comforted by heatmaps.” The ROI is fewer hidden failure modes, better audit trails, clearer review prioritisation, and less blind dependence on a probability score that may be precise without being meaningful.

The evidence is promising, but not deployment evidence

The paper is candid about limitations, and the practical interpretation depends on taking them seriously.

The first limitation is the use of FLAIR only. FLAIR is useful for highlighting lesions and tumour-related abnormalities, but radiologists normally interpret multiple MRI sequences, including T1, T2, and contrast-enhanced views. A single-modality classifier cannot replicate that clinical information environment.

The second limitation is 2D slice analysis. The paper converts 3D MRI volumes into 2D slices and classifies those slices. That makes the modelling task tractable, but it loses contiguous spatial context. The authors themselves note that examining adjacent MRI slices could help in subtle cases. This is not a minor point. A tumour that is partial in one slice may be clearer in neighbouring slices. A clinical workflow is volumetric; a 2D classifier is a simplified proxy.

The third limitation is cognitive load. Combining GRAD-CAM, LRP, and SHAP produces more information, but more information is not automatically better. A radiologist or clinical reviewer would need a clean interface, interpretation rules, and escalation criteria. Otherwise, layered XAI becomes another demand on expert attention. Healthcare workflows are not short of things blinking for attention.

The fourth limitation is validation. The paper demonstrates the framework on BraTS 2021 and qualitative case examples. That is useful research evidence. It is not proof of clinical generalisation, safety, or workflow benefit. A deployment-oriented study would need external validation, prospective testing, reader studies, calibration analysis, quality stratification, and evidence that the explanation layer actually improves decisions rather than merely making users feel more confident.

The fifth limitation is that XAI methods explain model behaviour, not ground truth. If a model has learned a spurious pattern, XAI may reveal that pattern. It does not automatically correct it. Seeing the wrong reason clearly is better than not seeing it, but it is still the wrong reason. This is where some AI presentations quietly confuse transparency with trust. Transparency is the beginning of inspection. It is not the end of validation.

The real product is not the classifier; it is the review protocol

For business leaders, the natural instinct is to ask whether the model is good enough. That is the wrong first question. A better question is: what workflow can absorb this model safely?

The paper suggests that the useful product is not a standalone brain tumour detector. It is a review protocol where model output and explanation agreement jointly determine the next step.

A hospital or imaging vendor could define several review bands:

Review band	Model state	XAI state	Action
Routine support	Confident prediction	GRAD-CAM, LRP, SHAP broadly aligned	Present annotation as decision support
Cautious support	Moderate confidence	Partial agreement among XAI methods	Require explicit human confirmation
Escalation	Low confidence or near-threshold score	Divergent or scattered explanations	Route to more experienced reviewer
Quality rejection	Any confidence level	Image quality appears poor or attention unreliable	Flag input-quality issue before diagnosis
Model audit	Repeated failures in similar cases	Systematic mismatch between explanation and anatomy	Feed into monitoring and retraining review

This is a more mature interpretation of explainable medical AI. The goal is not to make every output look explainable. The goal is to decide when explanation agreement is strong enough to support routine review and when explanation disagreement should trigger caution.

That principle extends beyond brain tumour detection. The same mechanism can apply to insurance claims triage, credit risk models, fraud detection, industrial defect inspection, and compliance review. In every case, the useful question is not only “what did the model predict?” It is also “what kind of evidence did the model use, and do independent explanation lenses tell a coherent story?”

Explainability is infrastructure, not decoration

The paper’s contribution is not that GRAD-CAM, LRP, and SHAP exist. We knew that. Nor is the main contribution simply that a CNN can achieve 91.24% accuracy on a controlled brain MRI classification setup. Useful, but not shocking.

The contribution is architectural: explanation methods can be layered into an assurance mechanism. GRAD-CAM gives the rough geography. LRP tests the local relevance. SHAP gives a contribution balance. When they agree, confidence becomes more interpretable. When they disagree, uncertainty becomes visible.

That is the standard medical AI should be moving toward. Not “trust the model because it has a heatmap.” Not “trust the heatmap because it is colourful.” Trust, where justified, should come from structured inspection: model performance, input quality, explanation consistency, human review, and workflow monitoring.

The dry conclusion is also the useful one. One heatmap is often not enough. A probability score is not enough either. In high-stakes AI, the business value of explainability comes from turning prediction into reviewable reasoning. This paper gives a compact prototype of how that can work.

And yes, it still needs stronger validation before anyone treats it as clinical infrastructure. That is not a weakness of the article’s argument. That is the argument.

Cognaptus: Automate the Present, Incubate the Future.

Patrick McGonagle, William Farrelly, and Kevin Curran, “Explainable AI: A Combined XAI Framework for Explaining Brain Tumour Detection Models,” arXiv:2602.05240, 2026, https://arxiv.org/pdf/2602.05240. ↩︎

The useful unit is not a heatmap; it is an explanation stack#

The model improves, but the real result is the remaining uncertainty#

Positive cases are useful when explanations converge#

Negative cases are not empty; they reveal diffuse reasoning#

Partial tumours are where the stack earns its keep#

What the paper directly shows, and what Cognaptus infers#

The evidence is promising, but not deployment evidence#

The real product is not the classifier; it is the review protocol#

Explainability is infrastructure, not decoration#