Same Causal Effect, Different Bill: Derivation Graphs and the Estimand Trap

The formula is not a clerical detail

A business asks a causal question: What happens if we change X?

The analytics team returns a formula. Everyone relaxes. The effect is identifiable, the notation looks official, and a graph somewhere has probably been blessed by someone with a PhD. Excellent. Time to move to dashboards.

This is usually where the expensive part begins.

The problem is not that causal inference lacks machinery. It has plenty. Pearl’s do-calculus gives formal rules for turning interventional expressions such as $P(y \mid do(x), w)$ into equivalent expressions when the causal graph permits it. Identification algorithms can produce estimands from observational data. Adjustment theory tells us which covariates may be used under certain graph conditions. The machinery is not missing.

The missing piece is a map of the machinery itself.

Yvernes, Devijver, Clausel, and Gaussier’s paper, Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs, makes that map explicit.¹ Its central move is deceptively simple: treat every do-calculus expression as a node, draw an edge when one valid do-calculus rule transforms one expression into another, and study the resulting graph.

That graph is not decorative. It changes the way we should think about causal analysis. A causal query is not a lonely formula waiting to be identified. It may sit inside a whole connected component of equivalent expressions. Some are observational. Some are interventional. Some lead to simple estimators. Some lead to fragile, high-variance estimators. They may be equal in the causal universe and very unequal in the spreadsheet universe, where budgets, sample sizes, and measurement noise insist on existing.

This is the paper’s useful irritation: identification is not the finish line. It is the entrance to estimand selection.

Derivation graphs turn do-calculus from a recipe into a state space

The paper works with expressions of the form:

$$ P(y \mid do(x), w) $$

Here, $Y$ is the outcome set, $X$ is the set of intervened variables, and $W$ is the set of observed conditioning variables. Do-calculus supplies three graphical rewriting rules:

Rule	Plain-language role	Operational interpretation
$R1$	Insert or delete observations	Decide whether conditioning on a variable changes the expression
$R2$	Exchange action and observation	Replace observing a variable with intervening on it, or vice versa
$R3$	Insert or delete actions	Decide whether an intervention is irrelevant to the target expression

Ordinarily, these rules are used sequentially. You begin with a target query, apply a valid rule, then another, then another, until you reach a desirable form. That workflow is familiar: it resembles a proof search, or, less generously, a maze with Greek letters.

The derivation graph changes the perspective.

For a given causal graph $G$, the paper defines a derivation graph $D[G]$ whose vertices are all expressions $P(y \mid do(x), w)$ over disjoint sets of variables. An edge connects two vertices when one can be obtained from the other by a single valid do-calculus rule. A connected component of this graph is therefore the set of all expressions equivalent under do-calculus.

That one definition does a lot of work.

It says that a causal effect is not merely a formula. It belongs to an equivalence class. If $P(y \mid do(x))$ can be transformed into $P(y \mid z)$, $P(y \mid do(z), x)$, or another expression through valid do-calculus steps, those expressions live in the same component. The component is the object analysts should inspect.

The paper also shows why this object can become large. For an empty causal graph over variables $V$, the derivation graph has $2^{|V|}-1$ connected components, each corresponding to a non-empty outcome set. For any graph on the same variables, its derivation graph is a subgraph of the empty-graph case: adding causal edges removes valid transformations. For a fixed outcome set $Y$, the size of the equivalent-expression component is bounded above by:

$$ 3^{|V \setminus Y|} $$

The bound is tight in the worst case.

That exponential ceiling matters. It means the space of equivalent causal expressions can blow up quickly. The comforting fantasy that there is “the” causal formula is, mathematically, a little too tidy. There may be many routes through the do-calculus state space. Some are redundant. Some are useful. Some are estimator disasters wearing a valid proof as a hat.

The paper’s mechanism: three rules, but fewer degrees of freedom than expected

The strongest theoretical result is not merely that derivation graphs exist. It is that their structure is surprisingly constrained.

The paper studies how do-calculus rules commute with one another. This sounds like a private hobby for symbolic algebra enthusiasts, but the consequence is practical: if rule applications can be reordered, then a messy derivation can be compressed into a canonical sequence.

First, $R1$ is shown to be redundant in a precise sense. The effect of inserting or deleting observations can be represented through combinations of $R2$ and $R3$. In the derivation graph, an $R1$ edge becomes a shortcut across a triangle: what looked like one kind of move can be decomposed into action-observation exchange plus action insertion or deletion.

Second, $R2$ applications commute internally. Converting multiple observations into interventions, or interventions into observations, does not depend on the order in which those exchanges are performed. In the derivation graph, this produces quadrilateral patterns: two different paths reach the same endpoint.

Third, $R3$ is more awkward. Some action insertions commute, but action insertion and deletion do not commute in general. The paper’s examples using the “Napkin graph” show exactly where the missing quadrilateral appears. This is not pedantry. It identifies the asymmetry that makes do-calculus reasoning harder than a simple shuffling game.

Finally, $R2$ and $R3$ commute only weakly. When they move in the same direction, the structure is clean. When one inserts and the other deletes, the equivalence may hold only one way.

From this, the authors prove a normal form: if two expressions are equivalent, one can transform one into the other using a canonical sequence with at most two applications of $R2$ and two applications of $R3$. Informally:

observations -> interventions
then introduce/delete interventions
then interventions -> observations

This gives a compact route through what could otherwise be a very large derivation graph.

The paper then derives a sound and complete graphical criterion for equivalence. Instead of searching through arbitrary rule sequences, one can perform four d-separation tests on mutilated graphs. Since d-separation can be checked in linear time, the equivalence check is computationally efficient once the relevant variable sets are specified.

That is the mechanism-level contribution: derivation graphs expose the full equivalence space, but the rule algebra prevents that space from becoming an unstructured swamp. Helpful, because swamps are already well represented in enterprise analytics.

Equivalent expressions are causally identical, not statistically identical

The paper’s most business-relevant point arrives when equivalence meets estimation.

A do-calculus derivation proves that two expressions are causally equal under the graph assumptions. It does not prove that their finite-sample estimators behave the same. This is the misconception the paper quietly dismantles.

Suppose two formulae identify the same causal effect. One may rely on a simple backdoor adjustment. Another may use a front-door-like decomposition involving multiple regression models and integrals. Both may be unbiased under the assumed model. One may still have far larger variance.

For a decision-maker, “same estimand in theory” is not the same as “same decision quality in practice.” The business version is blunt:

Technical equality	Operational non-equality
Same causal effect	Different variance
Same identifiable query	Different data requirements
Same do-calculus component	Different model complexity
Same theoretical target	Different implementation risk
Same proof status	Different cost of experiment or measurement

This is where derivation graphs become more than a representation theorem. They provide a way to enumerate alternative causal expressions before committing to an estimator. The practical workflow becomes:

target causal query
        ↓
derive its connected component in the derivation graph
        ↓
identify equivalent expressions
        ↓
apply ID or adjustment methods to each feasible expression
        ↓
compare estimators by variance, cost, robustness, and available data

The graph does not select the estimator by magic. It creates the candidate set. That is already a meaningful improvement over accepting the first valid formula a tool emits.

The synthetic example is main evidence for the estimator-selection problem

The paper’s first estimation example uses a small graph with variables $W, Z, X, Y$. In the introduction, the authors show that the expression $P(y \mid do(w,z))$ can be identified in at least two ways. A front-door-style formula gives one valid representation. A simpler route using do-calculus and backdoor adjustment gives:

$$ P(y \mid do(w,z)) = P(y \mid do(z)) = \sum_w P(y \mid z,w)P(w) $$

The key observation is that the intervention on $W$ is irrelevant for the target once the equivalence is exposed. Without the derivation-graph view, an analyst might carry $W$ into a more complex estimator merely because the original query included it. This is the sort of thing that makes models look sophisticated while quietly taxing the variance account.

In Section 5.1, the authors generate data from a linear Gaussian model based on the graph and compare estimators over 1,000 runs. This experiment is not an ablation. It is the main evidence for the practical estimation claim.

The setup compares equivalent identification formulae. Both target the same causal effect of $Z$ on $Y$, whose true value is $2.4$. The backdoor estimator and the frontdoor-based estimator are both unbiased for that effect, but the frontdoor-based estimator has substantially larger variance. The same frontdoor-type derivation also estimates a direct effect of $W$ on $Y$ via $X$ whose true value is zero, again with large variance.

The purpose of this test is narrow and important. It does not prove that backdoor estimators always dominate frontdoor estimators. It proves that equivalence under do-calculus does not settle estimator quality. Once you see the equivalent-expression space, estimator choice becomes a statistical decision rather than an algebraic afterthought.

The Sachs benchmark shows the same issue outside a toy graph

The paper then applies the method to the Sachs protein-signalling dataset, a standard benchmark in causal inference. The authors use the biologically validated DAG as ground truth and focus on the effect:

$$ P(P38 \mid do(Mek)) $$

According to the reference DAG, this effect is null. Starting from the causal graph, their derivation graph produces 32 equivalent interventional densities. Because the graph is a DAG, each density leads to an identifiable adjustment formula via parental adjustment. Some densities induce the same adjustment set, so the authors report distinct estimators.

This is best read as a benchmark application, not a sweeping empirical validation campaign. Its role is to show that the multiplicity observed in the synthetic example also appears in a familiar real dataset.

The reported estimators all produce values close to zero, matching the reference graph’s null effect. But their bootstrap variances differ sharply:

Density	Prediction	Bootstrap variance
$P(P38 \mid do(Mek))$	0.022723	0.000572
$P(P38)$	-0.018090	0.001029
$P(P38 \mid do(Mek,Raf))$	-0.006157	0.000265
$P(P38 \mid do(Akt,Mek))$	0.021494	0.000574
$P(P38 \mid do(Erk))$	-0.004451	0.000065
$P(P38 \mid do(Akt,Erk))$	-0.004954	0.000064
$P(P38 \mid do(Akt,Erk,Mek,Raf))$	-0.007231	0.000262
$P(P38 \mid do(Akt,Erk,Jnk))$	-0.001808	0.000016
$P(P38 \mid do(Akt,Jnk))$	-0.007302	0.000269
$P(P38 \mid do(Akt))$	-0.019925	0.001032
$P(P38 \mid do(Erk,Jnk))$	-0.001539	0.000017

The difference between the largest and smallest reported bootstrap variances is roughly a factor of 64. That is not a rounding error. That is the sort of difference that determines whether a result looks stable, whether a manager trusts it, and whether an experiment gets funded twice.

The appendix adds adjustment sets for these estimators. The smallest variances are associated with adjustment sets involving $Mek$, $PKA$, and $PKC$, with or without $PIP3$. The unadjusted estimator and some other alternatives have much higher variance. Again, the message is not “always use this adjustment set.” The message is that the equivalent-expression space contains materially different estimator designs.

The appendix examples are extensions, not a second thesis

The supplementary material does three things worth separating.

First, it provides proofs for the structural results: the component bounds, rule redundancy, commutativity properties, normal form, graphical equivalence criterion, and diameter bound. These are the paper’s theoretical foundation.

Second, it gives additional details for the synthetic and Sachs experiments. These are implementation and reproducibility details: data generation, estimator construction, bootstrap variance computation, and adjustment-set reporting.

Third, Appendix D.3 presents a more complex causal graph and applies the ID algorithm via the causaleffect R package to equivalent interventional queries within the same connected component. This yields eight distinct formulae, each with different statistical properties.

That final appendix example is an exploratory extension. It is useful because it shows how derivation graphs can feed the ID algorithm in cases beyond simple adjustment. It does not yet solve the larger problem of searching over all algebraically equivalent ID-derived formulae. The authors explicitly note why: once one includes arbitrary probabilistic identities, the transformation graph can become infinite. You can always multiply by an identity such as:

$$ \sum_z P(z \mid do(x), w) = 1 $$

without changing the expression.

That is a polite mathematical way of saying: the rabbit hole has no floor unless someone defines a disciplined subclass.

What this means for business causal systems

The obvious audience for this paper is causal-inference researchers. The less obvious audience is anyone building decision systems where causal claims drive real action: drug discovery, marketing attribution, platform experimentation, pricing, credit policy, supply-chain intervention, and automated scientific design.

The immediate business relevance is not “better causal inference” in the vague brochure sense. It is a specific layer in the causal workflow: estimand enumeration and audit.

A practical system inspired by this paper would sit between causal-query specification and estimator deployment.

Workflow stage	Conventional behaviour	Derivation-graph-informed behaviour
Define query	Ask for $P(y \mid do(x))$	Ask where the query sits in an equivalence component
Identify effect	Accept one formula from an ID or adjustment method	Enumerate equivalent expressions first
Select estimator	Use the first valid estimand or familiar adjustment set	Compare variance, data availability, intervention feasibility, and cost
Audit assumptions	Trace one derivation path	Inspect the component and the d-separation tests supporting equivalence
Plan experiments	Treat interventions as fixed by the original query	Search for equivalent, cheaper, or more feasible intervention designs

This matters in experimentation because not all interventions cost the same. In biology or medicine, intervening on one variable may be feasible while intervening on another is expensive, unethical, or impossible. In product analytics, some variables are controlled by the platform while others are merely observed. In policy, some levers are available to the decision-maker and others are decorative PowerPoint mythology.

Derivation graphs allow a more disciplined question:

Which equivalent causal expression gives us the best estimator under our data, cost, and intervention constraints?

That is a much better question than:

Did the ID algorithm return a formula?

The second question is necessary. The first one is how adults spend money.

The result applies after the graph, not before it

The main boundary is straightforward: derivation graphs assume a causal graph. They do not learn the graph. They do not rescue a bad graph. If the ADMG or DAG is wrong, the equivalence structure is wrong in the usual and unpleasant way.

There are other boundaries.

The full connected component can be exponentially large in the number of non-outcome variables. The paper’s four-test equivalence criterion helps with pairwise equivalence, and the normal form constrains derivations, but enumeration can still be expensive in large systems.

The estimation examples are illustrative. The synthetic case uses a linear Gaussian model. The Sachs case is a standard benchmark with a biologically validated graph, but it is still one domain-specific demonstration. The paper does not provide a universal estimator-selection algorithm that ranks every equivalent expression by finite-sample performance.

The framework also covers equivalence under do-calculus rules over interventional and observational expressions. It does not yet provide a finite graph for every algebraic manipulation produced by ID-style reductions, marginalisations, products, and divisions. The authors point to this as future work, and correctly so. Once arbitrary probability identities are allowed, finite enumeration becomes much harder.

These boundaries do not weaken the paper’s core contribution. They define where it should be used: as a structural and diagnostic layer for causal reasoning, not as a one-click causal oracle. The industry already has enough one-click causal oracles. Most of them are linear regression wearing sunglasses.

The strategic lesson: causal equality is not operational equality

The paper’s quiet achievement is to move causal reasoning from proof-by-route to proof-by-space.

A single derivation tells you one way to reach an estimand. A derivation graph tells you what else was reachable, which expressions were equivalent, where the rules commute, and which alternatives might produce better estimators. That difference is not cosmetic. It is the difference between accepting a causal formula and managing an estimand portfolio.

For business users, the lesson is simple: once a causal effect is identifiable, the next question is not merely whether the formula is correct. It is whether this formula is the one you should estimate.

Two expressions may be equal in the causal model. One may still require more data, introduce more noise, depend on harder measurements, or obscure a cheaper experimental path. The derivation graph is a way to see those alternatives before the organisation commits to the wrong estimand with great confidence and a tasteful dashboard.

That is the paper’s useful provocation. Do-calculus is not only a logic of intervention. It is also a routing system. And in applied causal inference, routing determines the bill.

References

Cognaptus: Automate the Present, Incubate the Future.

Clément Yvernes, Emilie Devijver, Marianne Clausel, and Eric Gaussier, “Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs,” arXiv:2606.03719, 2026, https://arxiv.org/pdf/2606.03719. ↩︎

The formula is not a clerical detail#

Derivation graphs turn do-calculus from a recipe into a state space#

The paper’s mechanism: three rules, but fewer degrees of freedom than expected#

Equivalent expressions are causally identical, not statistically identical#

The synthetic example is main evidence for the estimator-selection problem#

The Sachs benchmark shows the same issue outside a toy graph#

The appendix examples are extensions, not a second thesis#

What this means for business causal systems#

The result applies after the graph, not before it#

The strategic lesson: causal equality is not operational equality#

References#