The explanation worked in the notebook; then production happened

A familiar enterprise AI story begins with a reassuring demo.

A model produces a questionable prediction. Someone opens a notebook, runs SHAP, LIME, a saliency map, a concept attribution method, or whatever interpretability tool is currently fashionable enough to appear in slide decks. The plot looks plausible. The team nods. Compliance is told that explainability has been “implemented.”

Then the real system arrives.

A domain expert wants to compare the explanation across two model versions. An auditor asks which training data snapshot produced a particular explanation. A product manager wants different explanation views for internal reviewers and external users. A user expects the interface to respond instantly, not after a heroic backend computation quietly melts a GPU. Governance wants access control. Engineers want stable APIs. Nobody wants another fragile dashboard glued to a notebook by hope and an intern.

This is the gap addressed by X-SYS: A Reference Architecture for Interactive Explanation Systems.1 The paper’s central move is simple but useful: explainability is not just a method-selection problem. It is a systems architecture problem.

That is not a decorative distinction. A method answers, “How can we generate an explanation?” A system answers, “How can explanations remain usable, reproducible, responsive, and governed when users interact with them repeatedly across changing models, data, roles, and workflows?”

The second question is less glamorous. Naturally, it is the one that matters in production.

XAI fails operationally when interaction demand outruns backend supply

The paper starts from a deployment gap that many AI teams already recognize but often misdiagnose. XAI research has produced a large technical toolbox: local attributions, counterfactuals, concept-based explanations, surrogate models, example-based rationales, activation steering, and more. These methods matter. But the paper argues that isolated methods do not automatically become usable explanation systems.

The difference appears once explanations become interactive.

A static explanation artifact can be produced once, stored in a report, and shown to a reviewer. An interactive explanation system must let users ask follow-up questions, change scope, compare cases, revisit prior sessions, test counterfactuals, switch explanation types, or inspect how model behavior changes under intervention. Each user-facing action quietly creates backend obligations.

The authors name this relationship as a movement from interaction demand to interaction supply. The interface demands an action; the system must supply the capability that makes the action reliable.

User-facing interaction Backend capability required What breaks without it
Compare explanations across model versions Versioned model access, stable identifiers, linked metadata The comparison becomes anecdotal rather than reproducible
Explore “what-if” changes Fast recomputation or approximation, caching, constraint checking The workflow stalls or produces inconsistent results
Switch explanation methods Pluggable explanation services and shared contracts The UI becomes hard-coded to one method
Return to recent interactions Session persistence, interaction logs, access-controlled replay Users cannot reconstruct their analysis path
Serve different stakeholders Role-aware XUIs, authorization, curated explanation access Developers, auditors, and end users receive either too much or too little

This is the paper’s strongest mechanism. The authors are not merely saying, “Architecture is important,” which is the kind of sentence that can be safely ignored in almost any meeting. They are showing how each interaction primitive imposes a system requirement.

A user who clicks “compare” is not asking for a prettier chart. They are asking the backend to reconstruct model version, data version, explanation parameters, user role, and possibly prior interaction state. A user who asks a counterfactual question is not asking for philosophical transparency. They are asking for low-latency recomputation or a defensible approximation path. A user who returns to a prior explanation is asking the system to preserve state, not vibes.

That is why the common “just add SHAP” version of XAI is too thin. It can produce explanation artifacts. It cannot, by itself, provide persistent state, governance, version reconstruction, latency management, or multi-stakeholder workflows.

STAR turns explainability from a slogan into system constraints

X-SYS organizes its reference architecture around four quality attributes: scalability, traceability, adaptability, and responsiveness, abbreviated as STAR.

The naming is tidy. The more important point is that STAR translates vague explainability ambition into engineering constraints.

STAR attribute What it protects Architectural consequence Business interpretation
Scalability The ability to serve different workloads, users, and stakeholder groups Components must be separable and independently scalable where needed XAI can move from one expert debugging session to repeated audit and review workflows
Traceability The ability to reconstruct which model, data, configuration, user action, and explanation artifact produced an outcome Data, model, and explanation artifacts need persistent identifiers, logs, and version metadata Explainability claims can survive compliance review rather than collapse into “we think this was the old model”
Adaptability The ability to add or change explanation methods, models, interfaces, and workflows Services need stable contracts and replaceable modules XAI infrastructure does not need to be rebuilt whenever the method or stakeholder changes
Responsiveness The ability to support interactive use without breaking cognitive flow Expensive computation must be separated from lightweight online interaction through caching, precomputation, and asynchronous processing Users can actually explore explanations instead of waiting long enough to forget the question

The paper does not present STAR as a benchmarked measurement framework. It is not saying, “Here are latency thresholds, throughput targets, and reproducibility scores for production XAI.” That would be a different paper, and probably a longer one that fewer executives would pretend to read.

Instead, STAR functions as an architectural filter. If an explanation feature damages one of these attributes, it is not operationally mature. If a system cannot trace explanation artifacts to model and data versions, it is not audit-ready. If a system cannot support repeated interaction without excessive latency, it is not truly interactive. If new methods require rewriting the entire interface and backend, adaptability is theater.

This is where the paper’s contribution becomes useful for AI governance. Many governance conversations treat explainability as an output: a report, a plot, a textual justification, a model card. X-SYS treats explainability as a running system shaped by lifecycle constraints.

That is a better mental model for organizations that need explainability not once, but repeatedly.

Five components separate what users do from what systems must supply

X-SYS decomposes interactive explanation systems into five components:

  1. XUI Services
  2. Explanation Services
  3. Model Services
  4. Data Services
  5. Orchestration and Governance

The names are unsurprising. The value is in the boundary discipline.

XUI Services manage interaction, not explanation logic

XUI Services are the human-facing layer: the screens, interaction flows, visualizations, session context, and stakeholder-specific views. They translate user actions into service requests.

The important constraint is negative: XUI Services should not become a messy container for backend explanation logic. If the interface itself computes explanations, stores model metadata, manages role permissions, and tracks version history, the system has already started accumulating explainability debt.

A mature XUI should ask for capabilities through contracts. It should not personally know every implementation detail behind the explanation method. Otherwise, every change in model architecture or explanation method becomes a UI refactoring project. That is not agility. That is a spreadsheet with React components.

Explanation Services compute and compose explanation artifacts

Explanation Services encapsulate the actual XAI methods: feature importance, concept probing, influential example retrieval, surrogate models, activation steering, prompt-based interventions, and other explanation or intervention mechanisms.

The key architectural point is that explanation computation may be synchronous or asynchronous. Lightweight interactive queries can run online. Expensive analyses should be precomputed, cached, or provisioned asynchronously.

This distinction matters because XAI methods can differ dramatically in computational demand. If the system treats every explanation request as an online computation, responsiveness becomes hostage to the slowest method. The result is a beautiful tool nobody uses twice.

Model Services provide versioned model access

Model Services supply predictions, internal representations, activations, embeddings, attention weights, and lifecycle metadata. They also expose model versions, training provenance, performance indicators, and potentially drift information.

This is where XAI connects to MLOps. Explanation Services often need access not only to predictions but to internal model states. That access must be stable and versioned. A saliency map without model version context is a screenshot pretending to be evidence.

For business use, this component is essential. When a regulated AI decision is challenged, the question is rarely “Can you show a heatmap?” The harder question is “Can you reconstruct the exact system state that produced this explanation at that time?” Model Services make that question answerable.

Data Services preserve the memory of explanation work

Data Services manage input data, reference datasets, cached explanation artifacts, interaction histories, metadata, version registries, and provenance records. They are not passive storage. They are the institutional memory of the explanation system.

This is where traceability becomes real. The system must know which reference dataset supported a concept explanation, which cached artifact was shown, which user viewed or modified it, which version of the model was used, and how the explanation was parameterized.

Without Data Services, an explanation system becomes an amnesiac consultant: confident, articulate, and unable to prove what happened yesterday.

Orchestration and Governance coordinate the whole system

Orchestration and Governance route requests, select synchronous or asynchronous execution paths, enforce authentication and authorization, manage caching, apply rate limits, maintain logs, and coordinate service interactions.

This component is cross-cutting because governance is not a final layer sprinkled on top after launch. It determines which users can access which model internals, which explanation artifacts can be stored, which requests should be logged, and how interactions are replayed or audited.

In many enterprise AI projects, governance appears late, after the prototype has already made architectural choices that governance cannot easily repair. X-SYS pushes governance into the reference architecture. A rare case where “shift left” is not just a phrase printed on a transformation slide.

The paper’s evidence is architectural demonstration, not statistical validation

The paper’s main evidence is not an experimental benchmark. It does not run a large user study showing that X-SYS improves trust calibration. It does not compare alternative architectures under load. It does not report latency distributions, throughput curves, or audit reconstruction success rates.

Instead, its evidence has three parts:

Paper element Likely purpose What it supports What it does not prove
Prior-work comparison of XAI systems, XUI work, process models, and ontologies Comparison with prior work Existing work covers parts of the problem but does not offer a general reference architecture for interactive explanation systems That X-SYS is superior to all alternatives in production
STAR and five-component reference architecture Main conceptual contribution Interactive XAI can be decomposed into quality attributes, service responsibilities, and interface contracts That the component split is complete or optimal for every domain
SemanticLens implementation Implementation detail and proof of concept X-SYS can be instantiated in a working system for semantic search and activation steering in vision-language models That X-SYS generalizes across all modalities, sectors, and governance regimes
Semantic search DTO sequence Implementation detail illustrating the mechanism Stable contracts can bind user requests to model context and response metadata That DTOs alone solve governance, security, or reliability

This distinction matters for interpretation. X-SYS is best read as a reference architecture and design argument, not as a performance paper. The authors’ contribution is to make the system-building problem explicit and to show one concrete implementation path.

For business readers, that means the paper is useful as a blueprint, checklist, and vocabulary. It is not yet a procurement-ready standard, a compliance certification framework, or a quantified ROI model.

That boundary should not weaken the paper’s value. Architecture papers often matter precisely because they help organizations stop solving the wrong problem with more enthusiasm.

SemanticLens shows the mechanism in a concrete vision-language system

The paper’s implementation case is SemanticLens, an interactive explanation system for concept-based interpretability in vision and vision-language models.

SemanticLens provides two main perspectives. The Concept Map supports global exploration of model components. It lets users search learned representations with natural language queries, inspect clusters of semantically related components, and identify concepts or spurious correlations. The Model Interaction perspective supports local analysis of specific predictions, using component attribution and activation steering to test causal hypotheses.

The case is useful because it links the abstract architecture to actual interaction patterns.

In the Concept Map, a user may search for a concept such as “pasta” and retrieve components whose embeddings align with that query. The paper gives an example in ResNet50 where the query returns a component associated with “carbonara,” along with adjacent food-related components. This is not mainly a quantitative result. It is an illustration of global semantic exploration: the user asks a natural-language question about what the model has learned, and the system supplies ranked component alignments and metadata.

In Model Interaction, the user can inspect a specific sample, review components ranked by relevance, and modify activations. The paper gives a medical-image example involving a melanoma case where textual markings influenced the model. Suppressing a spurious component associated with text artifacts changes the prediction back toward the correct diagnosis. Again, the point is not that activation steering is now a universal medical safety solution. Please do not put that in a hospital procurement memo. The point is that interactive explanation can support a workflow: detect a suspicious component, test its influence, and observe prediction changes.

The more important architectural lesson comes from how SemanticLens is built.

The authors separate expensive explanation provisioning from lightweight interaction. At startup or offline, the system prebuilds XUI perspectives and generates interpretable components, visualizations, JSON assets, and images. Online, the user interacts through services such as semantic search and model inspection. FastAPI services, static files, explanation provisioning, and DTO contracts map to the X-SYS components.

The mechanism looks like this:

User interaction
XUI request captured as structured DTO
Governed service routing
Model / explanation / data services respond
Response DTO returns ranked artifacts and metadata
XUI renders consistent interactive explanation

This is the paper’s architecture in miniature. The user sees a search box or interactive model view. The system sees a contract-bound request tied to model context, foundation model configuration, component identifiers, alignment scores, and normalization metadata.

That is the difference between “interactive explanation” as a UI feature and “interactive explanation” as a system capability.

DTOs are boring, which is exactly why they matter

One of the most practical details in the paper is the semantic search protocol. The request object includes the query, network identifier, and foundation model specification. The response returns ranked component alignments plus metadata such as minimum and maximum alignment values for consistent visualization.

This may sound mundane. Good. Mundane is where production systems either survive or become expensive folklore.

A stable Data Transfer Object does several things at once. It tells the XUI what it can ask for. It tells backend services what they must supply. It binds the request to model context. It carries enough metadata for consistent rendering. It allows backend services to evolve without breaking the frontend contract. It gives governance and logging systems a structured object to inspect.

This is why the paper’s “systems” framing is more than architectural aesthetics. In production, explainability failures often arise from missing contracts:

  • the UI expects an explanation that the backend cannot compute quickly;
  • the backend returns an artifact without enough metadata to reproduce it;
  • the model changes but the explanation cache is not invalidated;
  • the audit log records that an explanation was shown but not which data and model state produced it;
  • a stakeholder sees internal model details they should not access, or receives a simplified explanation that cannot support their task.

DTOs do not magically solve all of this. They are not pixie dust, despite what some API design meetings imply. But explicit contracts are the substrate on which traceability, adaptability, responsiveness, and scalability can be engineered.

What AI product teams should take from X-SYS

The business relevance of X-SYS is not that every organization should copy SemanticLens. Most will not be explaining vision-language model components through semantic search and activation steering. The transferable lesson is architectural.

For product teams building AI systems in regulated, high-stakes, or multi-stakeholder environments, X-SYS suggests a practical diagnostic question:

For every explanation feature in the UI, what backend capability makes it reproducible, responsive, governed, and evolvable?

That question can be turned into an implementation checklist.

Product decision X-SYS translation Business consequence
“Let users compare explanations across time.” Maintain model, data, and explanation artifact versions with stable identifiers. Reduces audit reconstruction cost and prevents unverifiable comparisons.
“Let auditors replay prior explanation sessions.” Store interaction logs, session state, role metadata, and response artifacts. Makes compliance review operational rather than performative.
“Let different roles see different explanation depths.” Implement role-aware XUI services and governance-controlled access. Avoids both oversharing sensitive internals and underserving expert users.
“Let users explore counterfactual or what-if scenarios.” Separate online interaction from heavy computation; cache or approximate where justified. Preserves workflow continuity and reduces compute waste.
“Let the team add new explanation methods later.” Encapsulate methods in pluggable Explanation Services behind stable contracts. Lowers refactoring cost and avoids locking the product into one interpretability technique.

This is where Cognaptus would extend the paper into business interpretation.

The direct paper contribution is a reference architecture and one implementation case. The business inference is that operational XAI should be budgeted and designed as infrastructure, not as a one-off analytics feature. The uncertainty is that the paper does not quantify the cost savings, user benefits, latency improvements, or compliance outcomes of adopting X-SYS. Those would need separate evaluation.

Still, the direction is clear. If explainability must support repeated decisions, changing models, role-based access, and audit trails, then the system must be designed for those properties from the beginning.

Retrofitting them later is possible in the same way renovating a plane during flight is possible: technically imaginable, operationally unpleasant.

The paper’s boundaries are important, not fatal

X-SYS is strongest as a conceptual and architectural blueprint. Its boundaries are also clear.

First, the architecture is demonstrated through one implementation, SemanticLens. That implementation is relevant and concrete, but it does not prove generality across finance, healthcare operations, industrial monitoring, recommender systems, LLM agents, or credit decisioning workflows. The authors explicitly identify broader validation across domains and stakeholder contexts as future work.

Second, the reference architecture is high-level by design. It identifies components, responsibilities, quality attributes, and interface logic. It does not provide detailed engineering guidance for security, privacy, reliability engineering, schema evolution, data retention, incident response, or production observability. Those concerns are acknowledged, but not fully specified.

Third, STAR is not yet operationalized as a benchmark suite. The paper calls for future benchmarks combining XAI evaluation with system benchmarking: workload models, latency targets, throughput targets, reproducibility checks, and traceability audits. Until then, STAR is a useful design lens rather than a measured maturity model.

Fourth, the paper does not empirically validate whether interactive explanation systems built this way improve user understanding, mental models, decision quality, or trust calibration. That is a major practical question. A system can be architecturally elegant and still confuse users magnificently. Enterprise software has demonstrated this with admirable consistency.

These limitations do not undermine the paper’s core argument. They define how to use it. X-SYS should not be treated as proof that one architecture solves XAI. It should be treated as a structured starting point for designing explanation systems whose operational requirements are visible rather than accidental.

The real shift is from explanation artifacts to explanation infrastructure

The paper’s most useful sentence is not a single quotation but a structural idea: interactive explanation is constrained by what the underlying system can compute, retrieve, version, and expose reliably.

That idea deserves to replace a weaker assumption still common in AI deployment: that explainability is mostly about choosing the right method.

Methods remain necessary. Bad explanations do not become good because they are served through microservices. But the reverse is also true: good explanation methods do not become operational simply because someone rendered them in a dashboard.

X-SYS reframes the production problem:

Explanation method
→ Explanation service
→ Governed system capability
→ Stakeholder interaction
→ Traceable explanation workflow

This chain is the article’s mechanism-first takeaway. User interaction creates system demand. System demand creates architectural requirements. Architectural requirements force service boundaries, versioning, offline/online separation, governance, and stable contracts. Only then can explanation methods become part of a durable product.

That is the upgrade from saliency to systems.

For AI teams, the practical question is no longer “Do we have an explanation method?” It is:

Can the system preserve context, reproduce outputs, respond quickly, respect roles, evolve methods, and support repeated interaction without collapsing into custom glue?

If the answer is no, the organization does not yet have operational explainability. It has an explanation demo.

A demo may win the meeting. A system survives the audit.

Cognaptus: Automate the Present, Incubate the Future.


  1. Tobias Labarta, Nhi Hoang, Maximilian Dreyer, Jim Berend, Oleg Hein, Jackie Ma, Wojciech Samek, and Sebastian Lapuschkin, “X-SYS: A Reference Architecture for Interactive Explanation Systems,” arXiv:2602.12748v3, 13 April 2026, https://arxiv.org/abs/2602.12748↩︎