A taxi is a useful little trap.
It looks harmless: pick up passengers, drive them to destinations, do not run out of fuel. A small grid-world taxi environment is not exactly the sort of thing that makes executives whisper “agentic transformation” over terrible conference coffee.
But that is precisely why it works. Strip away the enterprise theatre, and sequential decision-making becomes easier to see. An agent observes a state, chooses an action, receives the next state, and repeats. If two agents always make the same moves and achieve the same objective, most organizations would treat them as equivalent. Same behavior, same operational meaning. Audit passed. Ship it.
The paper Translating the Rashomon Effect to Sequential Decision-Making Tasks by Dennis Gross, Jørn Eirik Betten, and Helge Spieker argues that this instinct is too comfortable.1 The authors show that several learned policies can behave identically in a formal sequential task while relying on different internal feature attributions. More importantly, those hidden differences can matter later, when the environment changes.
The paper’s point is not merely that “many models can work.” That is old news. The sharper point is this: in sequential decision-making, sameness is not a single-layer concept. A policy can be behaviorally identical for a target property and still carry a different internal explanation, a different robustness profile, and a different value for verification.
That is where the business lesson begins. Not with the word “Rashomon,” elegant as it sounds. With the uncomfortable fact that identical observed performance can conceal operationally relevant differences.
In classification, sameness is easy; in decision-making, it has to be constructed
The original Rashomon idea is familiar in supervised learning: many models may make the same predictions while relying on different internal logic or feature importance. Two classifiers may both label an image correctly, yet one may focus on the object and another on background artifacts. Same answer, different mind. Charming, until the background changes.
For classification, behavioral comparison is relatively direct. Give two models the same inputs and compare their predictions. If they match across the relevant dataset, they are behaviorally aligned for that dataset.
Sequential decision-making is less convenient. A policy does not merely output one label. It interacts with an environment over time. One action changes the next state; the next state constrains future actions; stochastic transitions can make the same policy succeed in one run and fail in another. A single trajectory is therefore not enough evidence.
The paper’s key move is to replace “same prediction” with “same induced Markov chain.”
A policy applied to a Markov decision process resolves the available action choices and produces an induced discrete-time Markov chain, or DTMC. In simpler language, the policy plus the environment defines the full probabilistic behavior of the agent for the property being studied: which states it can reach, which transitions it can take, and with what probabilities.
So the sequential version of Rashomon requires two tests:
| Layer | What must be identical or different | Why it matters |
|---|---|---|
| Training source | Policies are trained on the same expert dataset | Otherwise we are comparing different histories, not a Rashomon-style effect |
| Observable behavior | Policies induce identical DTMCs for the specified property | This replaces “same prediction” in classification |
| Internal structure | Policies differ under a user-specified explanation metric, such as feature attribution | This creates the Rashomon ambiguity: same behavior, different explanation |
This is the mechanism-first part of the paper. The authors are not simply reporting that multiple policies performed well. They define what “same behavior” should mean when behavior is a whole probabilistic process.
That matters because business systems increasingly look more like sequential decision systems than static classifiers. A claims-processing agent decides what document to request next. A warehouse robot chooses a route, then updates its plan. A trading assistant opens, monitors, scales, and exits. A customer-support workflow agent routes, asks, escalates, and summarizes. In each case, performance is not one prediction; it is a chain of decisions.
And chains have memory, even when the policy itself is memoryless. The environment remembers through state.
The paper’s definition is strict for a reason
The authors define the Rashomon effect in sequential decision-making as the case where multiple policies trained on the same data select the same actions in the same relevant environment, thereby producing identical induced DTMCs for a specified property, while differing in internal structure.
The phrase “for a specified property” deserves attention. It prevents a common misunderstanding.
Two policies may be identical with respect to “complete five taxi jobs without running out of fuel,” but not identical with respect to every possible question one might ask about the environment. Verification is property-centered. This is not philosophical hair-splitting; it is how formal safety work actually behaves. A policy may be equivalent for reaching a goal, but not equivalent for minimizing fuel variance, avoiding certain cells, or preserving optionality for future tasks.
The authors use probabilistic model checking to make this comparison exact. Instead of estimating performance by repeated simulation, COOL-MC constructs the induced DTMC for each trained policy and verifies the property. Storm is used to synthesize the expert policy and support model checking. The workflow is formal, not anecdotal.
This is also why the paper is more interesting than a normal robustness experiment. If two policies merely have similar success rates, they may still behave differently in the original environment. That would be useful, but not Rashomon in the strict sense. Here, the authors first isolate policies that are observationally indistinguishable under the target property, then ask whether their internal differences matter.
That order is the whole argument.
The taxi experiment is small, but the comparison is clean
The experimental environment is a taxi domain. The agent must pick up passengers and deliver them to destinations without running out of fuel. The episode ends when the taxi completes a predefined number of jobs or runs out of fuel. The first passenger location and destination are fixed; later ones are assigned randomly from four predefined locations.
The state includes the taxi’s position, passenger location, passenger destination, fuel, whether the passenger is on board, jobs completed, and whether the task is done. The action set is simple: move north, east, south, west, pick up, or drop.
The authors first use Storm to extract an optimal policy for completing five jobs. From that policy, they collect state-action pairs for the full MDP and train 100 neural policies by behavioral cloning, each with a different random seed. The point of the different seeds is not exotic. It is the standard way of producing different learned models from the same data.
The main evidence then arrives in two stages.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| 100 behavioral-cloning policies checked by COOL-MC | Main evidence for existence | Sequential Rashomon can occur: many policies trained on the same data can induce identical behavior while differing internally | It does not prove the effect is common in all RL systems |
| Feature attribution rankings among behaviorally equivalent policies | Main evidence for internal divergence | Identical induced DTMCs do not imply identical feature importance | It depends on saliency ranking as the chosen explanation metric |
| Increasing required jobs from 5 to 6–10 | Robustness / distribution-shift test | Hidden internal differences can become visible when the environment changes | It tests one kind of shift, not all shifts |
| Majority-vote Rashomon ensemble and permissive policy | Exploratory operational extension | Rashomon sets can support robustness and cheaper verification | It does not directly solve open-ended agent deployment |
The clean part is the sequence. First, identify exact behavioral equivalence. Then compare explanations. Then disturb the environment and see whether the hidden differences matter.
No mystery music required.
The first result: identical behavior, different explanations
For the five-job property, the authors train 100 policies. COOL-MC identifies 10 behavioral equivalence classes. The largest class contains 82 policies that satisfy the property. The remaining classes contain policies that fail the property, with reachability probabilities ranging from zero to values between zero and one.
The authors then focus on the 82 policies in the largest behavioral equivalence class. These policies behave identically for the specified five-job property. Within that class, they compute saliency-based feature importance rankings.
This is where the Rashomon effect appears. The policies are behaviorally identical but do not assign the same importance ranking to all state features. Some policies share identical attribution patterns with each other, and those cannot form a Rashomon set together under this metric. But across different attribution patterns, the policies meet the sequential Rashomon condition: same training data, same induced behavior, different internal structure.
The table in the paper shows a subset of policies that all have model-checking result 1.0 for the five-job task but differ in feature rankings. For example, the feature done is consistently ranked most important in the displayed rows, but rankings for features such as fuel, jobs done, passenger destination, passenger location, and taxi coordinates vary.
The important interpretation is not “saliency maps are truth.” They are not. The paper uses saliency-based attribution as a user-specified internal metric. A different metric could produce a different Rashomon set. That is not a flaw in the definition; it is part of the definition. Rashomon ambiguity is always relative to how we decide to inspect the model’s internal structure.
For business users, the correction is simple:
| Reader belief | Correction | Why it matters |
|---|---|---|
| “If two policies behave the same, their explanations are interchangeable.” | Behavior can be identical while internal feature reliance differs. | Explanation cannot be treated as a single authoritative story just because performance matches. |
| “A passed simulation test identifies the best policy.” | A passed formal property identifies behavioral adequacy for that property. | Robustness and explanation may still need separate selection criteria. |
| “Model checking is only a safety gate.” | Model checking can also define behavioral equivalence classes. | Verification can become a tool for model selection and diagnosis, not only compliance. |
This is the paper’s first contribution: it translates Rashomon from static prediction into sequential behavior by making equivalence formal.
The second result: hidden differences become visible under shift
The more practically interesting result comes after the authors disturb the original task.
They take a Rashomon set of five policies that behave identically in the original five-job environment. Then they increase the number of jobs the taxi must complete: six, seven, eight, nine, and ten jobs.
Now the policies diverge.
All five policies complete five jobs with probability 1.0. But under harder job counts, their success probabilities separate sharply. One policy succeeds at six jobs but drops to 0.0 at seven. Another reaches 1.0 at seven jobs and 0.23 at eight. A third falls to 0.5 at six and then 0.0. One policy reaches 0.94 at seven but fails at eight.
The majority-vote Rashomon ensemble, denoted $\pi_R$ in the paper, performs better than randomly selecting an individual policy in the tested shifts, reaching 1.0 through seven jobs before failing at eight and beyond. The expected random choice over the five policies performs worse: 0.875 at six jobs, 0.423 at seven, and 0.046 at eight.
Here is the compact version of the distribution-shift evidence:
| Policy or construction | 5 jobs | 6 jobs | 7 jobs | 8 jobs | 9 jobs | 10 jobs |
|---|---|---|---|---|---|---|
| Individual policies in the Rashomon set | 1.0 each | 0.5–1.0 | 0.0–1.0 | 0.0–0.23 | 0.0 | 0.0 |
| Expected random individual, $E[\pi_{1:5}]$ | 1.0 | 0.875 | 0.423 | 0.046 | 0.0 | 0.0 |
| Majority-vote ensemble, $\pi_R$ | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| Permissive policy, $\tau_R$ | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Original optimal policy, $\pi^\ast$ | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
This is not a claim that majority voting magically solves robustness. It does not. The majority-vote ensemble still fails from eight jobs onward. The more precise claim is that once policies are behaviorally indistinguishable in the training environment, using the Rashomon set can be better than pretending that any one of them is equally safe to choose.
The shift test changes the article’s business interpretation. Before the shift, explanation diversity looks like an interpretability concern. After the shift, it becomes a risk-management signal.
Same behavior under known conditions is not enough. The internal differences may be dormant only because the original property is too narrow to reveal them.
The permissive policy is the operationally interesting twist
The paper’s most business-relevant result is not the majority-vote ensemble. It is the permissive policy.
A normal deterministic policy selects one action in a state. A permissive policy allows multiple possible actions. In this paper, the permissive policy $\tau_R$ is derived from the Rashomon set by allowing any action selected by at least one member policy. Instead of inducing a DTMC, it induces an MDP, because action choice is no longer fully resolved.
That sounds less decisive, but it is useful. The permissive policy defines a compact action envelope: a smaller decision space that still contains an optimal policy. In the taxi experiment, model checking this induced MDP shows that it preserves 100% success across all tested job-count shifts, matching the original optimal policy.
The computational reduction is large. The induced MDP from the permissive Rashomon policy contains 6,209 states and 12,149 transitions. The original MDP contains 72,064 states and 335,544 transitions.
That is roughly an 91% reduction in states and a 96% reduction in transitions.
For formal verification, those numbers matter. State-space explosion is not a poetic inconvenience. It is the thing that makes exact checking expensive or impossible. A compact induced model that preserves an optimal policy is therefore not just an elegant artifact. It is a possible verification strategy.
But precision matters. The permissive policy is not the same thing as saying “deploy every action any model suggests.” It is a formally checked envelope. Its value is that it can preserve the possibility of optimal behavior while shrinking the verification burden. The operational task would still be to synthesize or select safe actions inside that envelope.
In business terms, the permissive Rashomon policy points toward a different way of using model diversity:
| Technical object | Operational translation | ROI relevance |
|---|---|---|
| Behavioral equivalence class | A set of policies that pass the same formal property | Avoids wasting effort comparing policies that are equivalent for the current requirement |
| Attribution diversity | Different internal reliance patterns among equivalent policies | Reveals hidden model-selection criteria before deployment |
| Majority-vote ensemble | A practical hedge among behaviorally equivalent policies | May improve robustness under some shifts, but not guaranteed |
| Permissive policy | A compact safe action envelope derived from the Rashomon set | Can reduce verification cost while preserving optimal options |
This is the article’s central callback: “same moves” does not mean “same mind,” and “different minds” can be used rather than merely worried about.
What this means for agentic business systems
The paper is not about enterprise LLM agents, robotic fleets, compliance workflow orchestration, or autonomous trading systems. It is a formal study in a taxi environment. Good. We should be grateful when a paper does not pretend to have solved the entire economy by Tuesday.
Still, the pathway to business interpretation is clear.
First, observed success should not be the only selection criterion
Many organizations evaluate agents by task completion: Did the workflow finish? Did the robot reach the target? Did the AI assistant close the ticket? Did the strategy produce acceptable backtest returns?
That is necessary, but insufficient. The paper shows a stricter version of the problem: even when behavior is not merely similar but formally identical for a specified property, internal explanations can differ.
For applied systems, this suggests that model selection should include at least three layers:
- Does the policy satisfy the operational property?
- Is it behaviorally equivalent to other candidate policies under relevant specifications?
- Among equivalent policies, which internal structure appears more robust, interpretable, or cheaper to verify?
The third layer is where many organizations currently improvise. They choose the model with the nicest dashboard, the familiar architecture, the lower latency, or the vendor’s most soothing slide deck. A Rashomon-aware workflow would treat equivalence classes and internal diversity as explicit objects.
Second, explanation should become plural, not ceremonial
When a model explanation is presented after deployment, it is often treated as if it were the explanation. This paper says: be careful. When multiple policies behave identically but have different feature attributions, explanation is underdetermined by behavior.
That does not make explanations useless. It makes single explanations less sacred.
For governance, the practical move is to ask: “Which explanation family is stable across the Rashomon set?” If several behaviorally equivalent policies all rank some feature as important, that is different from one model’s private obsession. Conversely, if explanations vary widely among equivalent policies, the organization should not pretend that one saliency chart has revealed the system’s soul. The system may not have one. Awkward, but manageable.
Third, robustness testing should begin with equivalent policies, not only failed ones
The usual workflow is to test a model, find failures, and patch them. The Rashomon view suggests another diagnostic: examine models that do not fail under the original specification.
That sounds inefficient until the distribution shift arrives. In the taxi experiment, the five policies are indistinguishable at five jobs but diverge when the job count increases. The difference was latent, not absent.
For business systems, the analogous question is not only “Which model fails our current benchmark?” It is also “Among the models that pass, which ones collapse first when the operating regime changes?”
That question is especially relevant for agentic automation, where distribution shifts are boringly normal. Workflows receive unusual documents. Customers ask multi-step exceptions. Regulations change. Market volatility jumps. Internal tools return unexpected outputs. A policy that is perfect in the reference workflow may still be brittle in the next quarter’s workflow.
Fourth, formal specifications are not bureaucracy when the system is sequential
A common business reaction to formal methods is polite avoidance. Formal verification sounds expensive, specialized, and slow. Sometimes it is. But the paper demonstrates why sequential systems need more than simulation when the question is equivalence.
If stochastic transitions can make the same policy succeed or fail on a single run, then empirical trajectories cannot certify exact behavioral identity. They can estimate. They cannot prove. Model checking, by constructing the induced probabilistic system, gives a sharper object to compare.
This does not mean every AI workflow needs full formal verification. Many do not. But for safety-critical agents, regulated automation, robotic control, infrastructure scheduling, or high-cost operational systems, the paper strengthens the case for specifying properties clearly enough that policies can be compared by induced behavior, not just scoreboard metrics.
The result is strongest where the environment can be formalized
The paper’s business boundary is not hard to find. Its power comes from formal structure, and that same structure limits immediate generalization.
The experiment uses a finite taxi environment. The policies are trained by behavioral cloning from an expert dataset extracted from an optimal policy. The policies are memoryless and deterministic in the relevant setup. The explanation metric is saliency-based feature attribution. The distribution shift is created by increasing the number of taxi jobs. The property is clear: complete the required jobs with probability 1 without exhausting fuel.
Those conditions are not a weakness. They are why the result is clean. But they mean the paper should not be overextended into a claim that open-ended LLM agents can already be Rashomon-verified in messy enterprise environments.
A fair interpretation is narrower and more useful:
| Paper directly shows | Cognaptus infers for business use | What remains uncertain |
|---|---|---|
| Sequential Rashomon effects can be defined using identical induced DTMCs plus different internal metrics | Agent evaluation should distinguish behavioral equivalence from explanatory equivalence | How to scale exact equivalence checking to large, partially observed, open-ended systems |
| 82 of 100 trained policies fall into the largest successful behavioral equivalence class in the taxi setup | Passing policies may be abundant, making selection among “good enough” models a real governance problem | Whether similar equivalence-class structure appears in richer industrial tasks |
| A five-policy Rashomon set diverges under increased job-count shift | Internal explanation diversity can be a robustness signal worth testing | Which attribution differences predict which real-world shifts |
| A permissive Rashomon policy preserves 100% tested success while shrinking model-checking state space from 72,064 to 6,209 states | Rashomon sets can help build compact verification envelopes | Whether the memory savings generalize beyond this environment and property |
The correct business response is therefore not “use Rashomon ensembles everywhere.” That would be consultancy-flavored confetti.
The better response is: when several policies pass the same sequential requirement, do not collapse them into a single winner too early. First, identify whether they are behaviorally equivalent. Then inspect whether their internal structures differ. Then test whether those differences matter under plausible shifts. Finally, where formal verification is feasible, consider whether a permissive envelope can reduce the verification burden.
That is slower than picking the prettiest model. It is also less embarrassing when conditions change.
A better deployment question: which sameness do we mean?
The paper’s lasting value is conceptual discipline.
In static prediction, the Rashomon effect asks whether many models can make the same predictions while relying on different internals. In sequential decision-making, the same question becomes more demanding. We must ask whether policies induce the same probabilistic behavior for a specified property. Only then can internal differences be isolated cleanly.
This gives organizations a better vocabulary for agent evaluation.
Not all sameness is equal:
- Same training data does not imply same behavior.
- Same success probability does not imply same induced process.
- Same induced process does not imply same explanation.
- Same explanation metric does not imply same robustness under every shift.
- Same deployed behavior today does not imply same failure boundary tomorrow.
The taxi is small, but the lesson scales conceptually. As AI systems move from answering questions to operating workflows, the old habit of comparing outputs will not be enough. Agents do not merely predict; they act. Their actions reshape the next decision. Their risks live in sequences.
The Rashomon effect in sequential decision-making therefore gives us a useful warning and a useful tool. The warning: identical behavior can hide different minds. The tool: those different minds can be organized into equivalence classes, inspected for explanation diversity, and potentially used to build more robust or more efficiently verified systems.
Same moves are not the end of the audit. They are the beginning of the interesting part.
Cognaptus: Automate the Present, Incubate the Future.
-
Dennis Gross, Jørn Eirik Betten, and Helge Spieker, “Translating the Rashomon Effect to Sequential Decision-Making Tasks,” arXiv:2512.17470, 2025. ↩︎