Diagnosis has a simple business problem hiding inside a clinical one: nobody wants a black box that is confident for the wrong reason.
That is especially true in medical imaging. A brain MRI classifier that says “tumour” or “non-tumour” is not automatically useful because it crosses a respectable accuracy threshold. The difficult question comes next: did the model look at the clinically relevant region, or did it discover some convenient artefact in the image pipeline? A single heatmap may answer that question. It may also merely look persuasive, which is not quite the same thing. Medicine, regrettably, is one of those domains where aesthetic confidence is still not a validation method.
The paper behind this article, Explainable AI: A Combined XAI Framework for Explaining Brain Tumour Detection Models, tackles that problem by combining three explanation methods—GRAD-CAM, Layer-wise Relevance Propagation, and SHAP—around a custom convolutional neural network trained for tumour/non-tumour classification on FLAIR MRI slices from BraTS 2021.1 The CNN matters, but the more interesting contribution is the explanatory stack. The paper is not just asking whether a model can classify brain MRI slices. It asks whether different explanation methods can be made to cross-check one another.
That is a more useful question than “which XAI method is best?” Best for what? Localising a suspicious region? Showing pixel-level relevance? Quantifying whether the visible evidence supports or opposes a tumour prediction? These are different jobs. Treating them as one job is how explainability becomes theatre.
The useful unit is not a heatmap; it is an explanation stack
The paper’s mechanism is easy to state and harder to operationalise. It uses three complementary XAI methods because each one sees a different layer of the model’s reasoning.
GRAD-CAM provides the coarse visual story. It uses gradients from the target class with respect to the final convolutional layer to highlight regions that influenced the model’s decision. In plain English: it answers, “Where did the model look?” That is valuable, but broad. A GRAD-CAM map can show a region of interest without proving that every highlighted pixel is clinically meaningful.
LRP provides the finer relevance story. It propagates the model’s output backward through the network and assigns relevance scores to pixels. This answers a slightly different question: “Which input pixels carried the decision?” It is more granular than GRAD-CAM, but also more demanding to interpret. Pixel-level relevance is useful only when someone knows what kind of pixel-level pattern is plausible. Otherwise, one gets a very sophisticated red cloud. The industry already has enough clouds.
SHAP provides the contribution story. It estimates how features contribute positively or negatively to the model’s prediction. In this paper’s implementation, SHAP visualisations quantify whether areas of the image support tumour or non-tumour classification. That turns the explanation from “the model looked somewhere around here” into “these regions pushed the prediction in this direction.”
The paper’s combined framework therefore has a simple internal division of labour:
| Explanation method | What it contributes | What it cannot do alone | Operational role |
|---|---|---|---|
| GRAD-CAM | Broad region-of-interest localisation | May be too coarse or spatially overinclusive | First-pass attention check |
| LRP | Pixel-level relevance | Can be visually noisy and harder to interpret | Fine-grained plausibility check |
| SHAP | Positive/negative contribution balance | Can be abstract without spatial context | Quantified support/contradiction check |
This is why the mechanism-first reading matters. A conventional article would summarise the dataset, the CNN, the metrics, and the XAI methods in sequence. That would be tidy and mostly useless. The real value of the paper is the layered diagnostic logic: location first, relevance second, contribution third.
In a hospital or regulated AI vendor, that distinction matters. An explanation panel is not valuable because it has three colourful outputs instead of one. It is valuable if each output answers a different operational question and makes the next human action clearer.
The model improves, but the real result is the remaining uncertainty
The authors build a custom CNN rather than relying on a pre-trained architecture. Their reasoning is sensible. The BraTS data used in the paper consists of brain-focused MRI data rather than natural images or full head images, and the project aims to control the architecture closely enough to support interpretability. The model uses FLAIR sequences from BraTS 2021, converts 3D MRI volumes into 2D slices, labels slices as tumour or non-tumour using segmentation masks, filters for informative slices, normalises and resizes images, and splits the data at the subject level to reduce contamination between training, validation, and test sets.
The improved model performs better than the baseline version inspired by Hafeez et al. The test accuracy rises from 84.76% to 91.24%, and test loss falls from 0.3482 to 0.2355. Precision improves from 0.9191 to 0.9608. Recall rises from 0.7622 to 0.8622. F1-score increases from 0.8333 to 0.9088. AUC improves from 0.92 to 0.96.
| Metric | Original model | Improved model | Interpretation |
|---|---|---|---|
| Test accuracy | 84.76% | 91.24% | More slices classified correctly |
| Test loss | 0.3482 | 0.2355 | Predictions closer to labels, though not trivial |
| Precision | 0.9191 | 0.9608 | Fewer false positive tumour calls |
| Recall | 0.7622 | 0.8622 | Fewer missed tumour cases |
| F1-score | 0.8333 | 0.9088 | Better balance between false positives and false negatives |
| AUC | 0.92 | 0.96 | Better class separation |
For a business reader, the tempting conclusion is obvious: the model got better, therefore the system became more useful. That is only partly right.
The more important point is that even the improved model still fails in ways that are operationally meaningful. The improved confusion matrix reports fewer false positives and fewer false negatives than the original model. False negatives fall from 649 to 376. False positives fall from 183 to 96 in the confusion matrix discussion, while the later false-positive breakdown refers to 97 cases. That small internal inconsistency is not the headline, but it is a reminder that careful readers should not treat every reported count as a polished regulatory evidence package. This is a research prototype, not a clinical submission dossier.
The authors examine the remaining errors. Among false negatives, poor image quality and partial tumour visibility dominate the story. They report 123 of 376 false negatives as poor-quality images, with issues such as blurriness, low contrast, and motion artefacts. The remaining 253 are attributed to partial tumours or limited tumour visibility. Among false positives, some are poor-quality images, while others are non-tumorous anomalies with image characteristics that resemble tumour tissue.
This is where explainability becomes more than an academic accessory. The model’s mistakes are not random bookkeeping errors. They fall into categories that matter for workflow design:
| Error pattern | Likely cause | Business meaning | What XAI can support |
|---|---|---|---|
| False negative with poor image quality | Low contrast, blur, motion artefacts | The AI may need image-quality gating before interpretation | Flag the case as unreliable rather than silently negative |
| False negative with partial tumour visibility | Small or incomplete tumour evidence in a 2D slice | The slice-level classifier may miss context visible across adjacent slices | Trigger review of contiguous slices or 3D context |
| False positive with poor image quality | Noise mistaken for pathology | Avoid unnecessary alarm when input quality is weak | Separate image-quality risk from tumour evidence |
| False positive with non-tumorous anomaly | Benign or unrelated bright structures mimic tumour features | The model may need multimodal MRI context or richer labels | Show whether attention aligns with actual segmentation |
The lesson is not “the model is accurate.” The lesson is “the model’s residual errors suggest where an assurance layer should intervene.” That is a much more practical statement.
Positive cases are useful when explanations converge
In clear tumour cases, the three explanation methods tend to align. GRAD-CAM highlights the tumour region. LRP assigns strong relevance to the same area. SHAP shows that a majority of contributions support the tumour classification. In one positive combined example, the paper reports 64.73% of SHAP values supporting tumour classification and 35.27% contributing against it.
That convergence is important because it gives a human reviewer several forms of agreement. The model does not merely say “tumour.” Its broad attention map, pixel relevance pattern, and contribution balance all point in a compatible direction.
This is the cleanest use case for layered XAI. The explanation stack acts like a consistency check:
- Spatial check: Does the broad highlighted region overlap the visible abnormality?
- Granular check: Do pixel-level relevance scores concentrate where the suspicious tissue appears?
- Contribution check: Do positive feature contributions dominate in a way consistent with the prediction?
If the answer is yes across all three, the system provides a stronger explanation than any single method could provide. Not proof. Stronger explanation. The distinction matters because XAI does not certify clinical truth; it clarifies model reasoning. A map of the model’s attention is not the same as pathology. But a map, a relevance pattern, and a contribution balance aligned around a visible tumour are more useful than a lone heatmap asking to be admired.
For AI product teams, this suggests a design principle: explanations should not be dumped into the interface as visual ornaments. They should be arranged as an evidence sequence. First the user sees where the model looked. Then what pixels mattered. Then whether the evidence supported or opposed the classification. If an interface cannot tell the user how to read its explanation stack, it has not built explainability. It has built a radiology-themed dashboard.
Negative cases are not empty; they reveal diffuse reasoning
Negative cases are often treated as less interesting because there is no tumour to point at. The paper’s XAI results suggest the opposite. In non-tumour samples, GRAD-CAM shows less concentrated activation. LRP shows scattered relevance rather than a dominant tumour-like focus. SHAP shows contributions weighted strongly against tumour classification. In one negative SHAP example, 86.27% of SHAP values support non-tumour classification, with only 13.73% supporting tumour classification. In a combined negative example, 79.98% contributes against tumour classification and 20.02% supports tumour.
This matters because “nothing detected” is not the same as “nothing happened inside the model.” A negative decision still has a reasoning pattern. If that pattern is diffuse and the quantitative contribution balance points away from tumour, the model’s negative call becomes more interpretable.
That is useful for decision support. In production, negative cases can be dangerous because they may invite complacency. A layered explanation can help separate a confident negative from a fragile one. For example:
| Negative-case signal | Possible interpretation | Operational response |
|---|---|---|
| Diffuse GRAD-CAM, scattered LRP, SHAP mostly negative | Model sees no concentrated tumour-like evidence | Lower review priority, assuming image quality is adequate |
| Focused GRAD-CAM but SHAP mostly negative | Model attends to a region but contribution balance rejects tumour | Human review may check whether the region reflects artefact or benign structure |
| Scattered LRP with weak SHAP margin | Model reasoning may be unstable | Request additional slice/context review |
| Poor image quality plus negative classification | Input may be unreliable | Do not treat negative output as reassuring without quality control |
The business implication is subtle. XAI should not merely explain positive alerts. It should help triage negative decisions. A missed tumour is clinically and operationally worse than an extra review in many settings. If XAI can identify negative cases where the model’s reasoning is weak, then it becomes part of a risk-routing system rather than a retrospective explanation toy.
Partial tumours are where the stack earns its keep
The most interesting part of the paper is its treatment of partial tumour cases. These are exactly the cases where a 2D slice-level classifier can struggle. The tumour may be barely visible. The segmentation mask confirms tumour tissue, but the original slice may not make that obvious to a human reader at a glance. A single explanation method can become ambiguous.
In the first partial tumour example, GRAD-CAM focuses on the relevant region but also highlights other areas. That broader focus suggests uncertainty. LRP gives a more concentrated relevance pattern around the tumour region. SHAP reports 67.48% of values supporting tumour classification and 32.52% contributing against it. The combined interpretation is more useful than any component alone: GRAD-CAM says the model is looking broadly but includes the relevant region; LRP says the local pixel relevance still concentrates near tumour tissue; SHAP says the contribution balance still favours tumour.
In the second partial tumour example, the story is different and more valuable. The model predicts tumour with probability 0.507, essentially a coin toss dressed in neural-network clothing. The XAI methods reflect this uncertainty. GRAD-CAM activates multiple regions. LRP is scattered rather than concentrated. SHAP reports 85.97% of values contributing to non-tumour diagnosis and only 14.03% supporting tumour diagnosis. The segmentation mask confirms tumour tissue, but the explanation stack shows that the model’s internal evidence is weak.
This is the paper’s best practical insight. In borderline cases, disagreement is not a failure of the explanation system. It is the explanation.
A weak article would say: “The combined framework improves trust.” A better reading says: “The combined framework makes uncertainty harder to hide.” That is the point.
| Case type | GRAD-CAM | LRP | SHAP | Practical reading |
|---|---|---|---|---|
| Clear tumour | Focused on tumour region | Relevance concentrated near tumour | Majority supports tumour | Explanations converge; model reasoning appears plausible |
| Non-tumour | Broad or diffuse attention | Scattered relevance | Majority supports non-tumour | Negative decision has interpretable support |
| Partial tumour, detectable | Broad but includes tumour region | More focused around tumour | Majority supports tumour | Layering clarifies weak but usable evidence |
| Partial tumour, uncertain | Multiple activations | Scattered relevance | Majority supports non-tumour despite tumour label | Escalate for human review and adjacent-slice context |
This is where the paper’s business relevance becomes concrete. The framework should not be sold as “AI that doctors can trust.” That phrase has done enough damage already. It should be understood as a quality-assurance layer for model reasoning. It helps answer whether a prediction deserves routine acceptance, closer review, or outright suspicion.
What the paper directly shows, and what Cognaptus infers
The paper directly shows three things.
First, a custom CNN trained on FLAIR-derived 2D slices from BraTS 2021 can improve over the authors’ baseline configuration after architectural and hyperparameter changes. The reported improvement is meaningful within the study’s own setup: accuracy rises to 91.24%, recall improves, and false negatives fall.
Second, applying GRAD-CAM, LRP, and SHAP side-by-side provides complementary views of the model’s predictions. The methods often converge in clear positive and negative cases, and they provide different signals in partial or ambiguous cases.
Third, the combined XAI panel can expose uncertainty in borderline slices. The second partial tumour case is especially important because the model’s probability is only 0.507 and the explanation methods do not produce a clean, reassuring story. That is exactly what a serious decision-support tool should reveal.
Cognaptus infers a business pathway from these findings, but it should be stated carefully.
The pathway is not immediate clinical deployment. The pathway is operational assurance for imaging AI. Hospitals, AI imaging vendors, and governance teams can use layered explanations to design review queues, identify weak predictions, inspect failure modes, and define escalation rules. The system is more useful as a second-reader support layer than as an autonomous diagnostic actor.
A practical workflow could look like this:
| Model + XAI output | Suggested workflow action |
|---|---|
| High model confidence + convergent XAI around visible lesion | Standard clinician review with model-supported annotation |
| High model confidence + XAI focus outside plausible anatomy | Flag for model-quality review |
| Low model confidence + divergent XAI | Escalate to senior review or require adjacent-slice inspection |
| Negative prediction + poor image quality | Treat as unreliable negative; request quality-aware review |
| Positive prediction + SHAP weakly supportive | Check for artefact, anomaly, or non-tumorous mimic |
This is not glamorous, but it is where the money is. The ROI of explainability in healthcare is not “clinicians feel emotionally comforted by heatmaps.” The ROI is fewer hidden failure modes, better audit trails, clearer review prioritisation, and less blind dependence on a probability score that may be precise without being meaningful.
The evidence is promising, but not deployment evidence
The paper is candid about limitations, and the practical interpretation depends on taking them seriously.
The first limitation is the use of FLAIR only. FLAIR is useful for highlighting lesions and tumour-related abnormalities, but radiologists normally interpret multiple MRI sequences, including T1, T2, and contrast-enhanced views. A single-modality classifier cannot replicate that clinical information environment.
The second limitation is 2D slice analysis. The paper converts 3D MRI volumes into 2D slices and classifies those slices. That makes the modelling task tractable, but it loses contiguous spatial context. The authors themselves note that examining adjacent MRI slices could help in subtle cases. This is not a minor point. A tumour that is partial in one slice may be clearer in neighbouring slices. A clinical workflow is volumetric; a 2D classifier is a simplified proxy.
The third limitation is cognitive load. Combining GRAD-CAM, LRP, and SHAP produces more information, but more information is not automatically better. A radiologist or clinical reviewer would need a clean interface, interpretation rules, and escalation criteria. Otherwise, layered XAI becomes another demand on expert attention. Healthcare workflows are not short of things blinking for attention.
The fourth limitation is validation. The paper demonstrates the framework on BraTS 2021 and qualitative case examples. That is useful research evidence. It is not proof of clinical generalisation, safety, or workflow benefit. A deployment-oriented study would need external validation, prospective testing, reader studies, calibration analysis, quality stratification, and evidence that the explanation layer actually improves decisions rather than merely making users feel more confident.
The fifth limitation is that XAI methods explain model behaviour, not ground truth. If a model has learned a spurious pattern, XAI may reveal that pattern. It does not automatically correct it. Seeing the wrong reason clearly is better than not seeing it, but it is still the wrong reason. This is where some AI presentations quietly confuse transparency with trust. Transparency is the beginning of inspection. It is not the end of validation.
The real product is not the classifier; it is the review protocol
For business leaders, the natural instinct is to ask whether the model is good enough. That is the wrong first question. A better question is: what workflow can absorb this model safely?
The paper suggests that the useful product is not a standalone brain tumour detector. It is a review protocol where model output and explanation agreement jointly determine the next step.
A hospital or imaging vendor could define several review bands:
| Review band | Model state | XAI state | Action |
|---|---|---|---|
| Routine support | Confident prediction | GRAD-CAM, LRP, SHAP broadly aligned | Present annotation as decision support |
| Cautious support | Moderate confidence | Partial agreement among XAI methods | Require explicit human confirmation |
| Escalation | Low confidence or near-threshold score | Divergent or scattered explanations | Route to more experienced reviewer |
| Quality rejection | Any confidence level | Image quality appears poor or attention unreliable | Flag input-quality issue before diagnosis |
| Model audit | Repeated failures in similar cases | Systematic mismatch between explanation and anatomy | Feed into monitoring and retraining review |
This is a more mature interpretation of explainable medical AI. The goal is not to make every output look explainable. The goal is to decide when explanation agreement is strong enough to support routine review and when explanation disagreement should trigger caution.
That principle extends beyond brain tumour detection. The same mechanism can apply to insurance claims triage, credit risk models, fraud detection, industrial defect inspection, and compliance review. In every case, the useful question is not only “what did the model predict?” It is also “what kind of evidence did the model use, and do independent explanation lenses tell a coherent story?”
Explainability is infrastructure, not decoration
The paper’s contribution is not that GRAD-CAM, LRP, and SHAP exist. We knew that. Nor is the main contribution simply that a CNN can achieve 91.24% accuracy on a controlled brain MRI classification setup. Useful, but not shocking.
The contribution is architectural: explanation methods can be layered into an assurance mechanism. GRAD-CAM gives the rough geography. LRP tests the local relevance. SHAP gives a contribution balance. When they agree, confidence becomes more interpretable. When they disagree, uncertainty becomes visible.
That is the standard medical AI should be moving toward. Not “trust the model because it has a heatmap.” Not “trust the heatmap because it is colourful.” Trust, where justified, should come from structured inspection: model performance, input quality, explanation consistency, human review, and workflow monitoring.
The dry conclusion is also the useful one. One heatmap is often not enough. A probability score is not enough either. In high-stakes AI, the business value of explainability comes from turning prediction into reviewable reasoning. This paper gives a compact prototype of how that can work.
And yes, it still needs stronger validation before anyone treats it as clinical infrastructure. That is not a weakness of the article’s argument. That is the argument.
Cognaptus: Automate the Present, Incubate the Future.
-
Patrick McGonagle, William Farrelly, and Kevin Curran, “Explainable AI: A Combined XAI Framework for Explaining Brain Tumour Detection Models,” arXiv:2602.05240, 2026, https://arxiv.org/pdf/2602.05240. ↩︎