From Causal Parrots to Causal Counsel: When LLMs Argue with Data

Causal claims are cheap now.

A model can look at variable names such as advertising spend, web traffic, sales conversion, and customer churn, then produce a causal story in seconds. The story may even sound sensible. That is precisely the problem. In business analytics, “sensible” is often the polite costume worn by “untested.”

The paper “Leveraging Large Language Models for Causal Discovery: a Constraint-based, Argumentation-driven Approach” does not ask whether LLMs can talk about causality. They obviously can. Some parrots are very articulate. The more useful question is whether their causal suggestions can be made disciplined enough to support causal discovery without turning the whole exercise into an expensive horoscope with arrows.¹

The answer proposed by the paper is not to trust the LLM. It is to put the LLM on the witness stand.

The authors introduce ABAPC-LLM, a hybrid causal discovery pipeline that treats LLM-generated causal directions as defeasible semantic constraints. The LLM proposes required and forbidden causal directions. A consensus filter keeps only the suggestions that appear consistently across multiple runs. Data-derived conditional-independence evidence then defines and narrows the search space. Finally, a Causal Assumption-Based Argumentation framework decides which assumptions survive.

That mechanism matters more than the headline result. The paper is not saying, “LLMs discover causality.” It is saying something more useful and less glamorous: LLMs can generate structured causal priors, but those priors should be filtered, challenged, and overruled when the evidence does not support them.

A small disappointment for anyone hoping to replace the statistics team with a prompt. A rather important result for anyone building auditable AI-assisted analytics.

The useful role for the LLM is not oracle; it is prior generator

Classical causal discovery tries to recover causal structure from data, often represented as a directed acyclic graph, or DAG. In such a graph, variables become nodes and causal relations become directed edges. The difficulty is that observational data rarely tells the full story by itself. Conditional-independence tests can reveal which variables behave independently under certain conditioning sets, but they often identify only a Markov equivalence class rather than a unique causal graph.

This is why causal discovery has always wanted expert knowledge. Human experts may know that disease onset cannot be caused by a later diagnostic result, that birth year cannot be caused by income, or that a downstream KPI cannot temporally cause the event that produced it. Such knowledge can orient edges that data alone leaves ambiguous.

The traditional problem is how to insert this knowledge. Hard constraints are brittle: if the expert is wrong, the algorithm may faithfully enforce a mistake. Bayesian priors are more flexible but can be opaque and hard to audit. Causal ABA offers a third route: encode assumptions, expose contradictions, and search for a stable set of claims.

The paper’s move is to use the LLM as a scalable but unreliable expert. It reads variable names and optional descriptions, then proposes two types of structural priors:

LLM output	Meaning	Practical interpretation
Required direction	$X \rightarrow Y$ should hold	The LLM believes one variable should causally precede another
Forbidden direction	$X \rightarrow Y$ should not hold	The LLM believes that direction violates logic, temporal order, or domain knowledge

The key word is believes. ABAPC-LLM does not let belief become law immediately. That is the whole point.

The pipeline has three gates before an LLM claim affects the graph

The paper’s mechanism is best understood as a sequence of gates. Each gate removes a different failure mode.

First, the LLM is instructed to produce only high-precision causal judgments. The prompt asks for required and forbidden directions based on logic, temporal precedence, or established mechanisms, and explicitly tells the model not to include uncertain relations. This is already a design choice: the authors prefer fewer but cleaner constraints.

Second, the paper separates reasoning from extraction. The main LLM generates the causal-prior text; a lighter schema-enforcing model extracts structured constraints from that text. This is a small but important engineering pattern. Asking one model to reason freely while another enforces format is often more robust than pretending a single prompt can make language models both subtle and obedient. Anyone who has asked an LLM for strict JSON and received a motivational essay knows the issue.

Third, the authors query the primary LLM five times and take the intersection of the resulting required and forbidden sets. In simple terms:

$$ R_{\text{consensus}} = \bigcap_{i=1}^{5} R_i $$

The same idea applies to forbidden constraints. A causal direction survives only if it appears in every independent run. This is not a magic truth detector. It is a conservative variance-reduction device. It sacrifices recall for precision, which is sensible in causal discovery because a false structural constraint can remove the true graph from consideration.

Only after this consensus step are the surviving constraints integrated into Causal ABA.

Data gets the first veto

The most important design detail is easy to miss: the LLM constraints do not enter the system as unconstrained commands.

The authors first perform a data-driven skeleton reduction. High-confidence conditional-independence evidence removes edges from the graph. If an edge has already been removed by this statistical reduction, an LLM-required arrow on that edge is discarded. In other words, if the data gives strong evidence that two variables are not adjacent, the LLM does not get to resurrect the relationship simply because the variable names sound persuasive.

After this reduction, LLM constraints guide orientation among the remaining possibilities. Required arrows are enforced only when the corresponding edge survives. Forbidden arrows restrict certain directions among the remaining edges.

This ordering is the paper’s governance logic in miniature:

Use data to remove implausible adjacencies.
Use high-precision LLM priors to guide what remains.
Use argumentation to handle conflict rather than hiding it.

That is a very different architecture from “ask the LLM for a graph.” It is closer to a courtroom than a chatbot.

Causal ABA makes conflict visible instead of sweeping it under the model

Causal Assumption-Based Argumentation matters because causal discovery is not just a graph search problem; it is also a conflict-management problem.

Conditional-independence tests can be noisy. Expert claims can be wrong. LLM claims can hallucinate. A causal discovery pipeline that cannot represent contradiction will either ignore the conflict or quietly encode it as a modelling artifact. Neither is ideal when the output may inform business decisions.

Causal ABA represents candidate assumptions and rules in an argumentation framework. Assumptions can attack one another when they imply contraries. The solver searches for a stable extension: a defensible set of assumptions that can coexist and support a DAG. In ABAPC, the non-LLM version, conditional-independence facts from MPC are ranked by credibility, and weaker facts can be progressively relaxed until a stable extension is found.

ABAPC-LLM adds semantic constraints to this framework. The benefit is not just better accuracy. It is provenance. Accepted and defeated claims can be traced back to the assumptions and evidence that supported or undermined them.

For business use, that distinction is not decorative. If a model suggests that marketing spend causes churn reduction, a manager should not only ask whether the edge appears in the final graph. The manager should ask why it survived. Was it supported by conditional-independence evidence? Was it merely suggested by variable semantics? Did an LLM-proposed direction get overruled? These are governance questions, not academic footnotes.

The CauseNet benchmark attacks the memorization problem directly

The paper’s second major contribution is methodological: it introduces a synthetic evaluation protocol grounded in CauseNet.

This matters because standard causal discovery benchmarks such as Asia, Sachs, and Cancer are widely published. If an LLM performs well on them, the result may reflect memorization rather than causal reasoning. That does not make the result useless, but it does make it less comforting. A student who has seen the answer key is not necessarily a genius.

Anonymizing variables would avoid memorization, but it would also destroy the semantic signal that the LLM needs. The authors therefore take a more interesting route: they generate random DAG structures, then ground those structures in real concepts from CauseNet, a knowledge graph of cause-effect relations.

The pipeline has three parts:

Stage	What it does	Why it matters
Structural scaffolding	Generates random DAGs using Erdős-Rényi, Scale-Free, or Lower-Triangle methods	Creates structurally diverse target graphs
Semantic grounding	Finds isomorphic subgraphs in CauseNet	Gives variables meaningful causal semantics without using standard benchmarks
Heuristic selection	Scores candidate subgraphs by semantic compactness, node specificity, and structural-semantic correlation	Avoids absurd concept sets and overly generic hub nodes

The result is a set of semantically meaningful but structurally novel causal graphs. The paper uses 54 synthetic Bayesian networks, covering 5, 10, and 15 nodes, different edge densities, three graph types, and three semantic-selection heuristics. Each dataset uses 5,000 observational samples and is repeated across 50 random seeds.

This is not a perfect guarantee against contamination. Individual CauseNet facts may still appear in model pretraining. But it makes verbatim retrieval of a known benchmark DAG much less plausible. For evaluating LLM-assisted causal discovery, that is a serious upgrade.

The main evidence: ABAPC-LLM improves structural recovery across graph sizes

The headline empirical result is straightforward: on the CauseNet synthetic datasets, ABAPC-LLM achieves the lowest normalized Structural Hamming Distance and the highest F1-score across graph sizes. The paper reports especially large margins on 5-node and 15-node problems relative to MPC, FGS, NOTEARS-MLP, GRaSP, BOSS, the LLM-only baseline, and random graphs. The gains remain positive on 10-node graphs, although the gap narrows.

The authors also report that BH-corrected two-sample unequal-variance tests show statistically significant improvements over baselines for both normalized SHD and F1 across graph sizes. Additional appendix results show the same broad ordering for precision, recall, and Structural Intervention Distance.

Here is the evidence map, because not every figure in a paper is doing the same job:

Evidence item	Likely purpose	What it supports	What it does not prove
Main CauseNet SHD/F1 results	Main evidence	ABAPC-LLM improves graph reconstruction on semantically grounded synthetic DAGs	That the method scales to large enterprise graphs
Precision, recall, and SID appendix results	Robustness / additional metric check	The result is not limited to one structural metric	That inferred effect sizes are accurate
BH-corrected statistical tests	Significance check	The observed differences are unlikely to be random variation in these experiments	That assumptions hold in all domains
Runtime appendix	Implementation practicality	The method remains practical up to 15-node graphs, excluding external LLM API latency	That real-time large-scale deployment is solved
bnlearn benchmark results	Comparison with prior work and leakage warning	The method remains competitive or superior on standard benchmarks	That LLM performance there is free from memorization
LLM constraint-quality tables	Ablation / component analysis	Descriptions and consensus materially affect constraint quality	That consensus is always better than higher-recall alternatives

This distinction matters because the paper’s most interesting claim is not merely “our bar is higher than their bar.” The deeper claim is that semantic priors help most when they are good enough and when the data signal is reliable enough to interact with them.

The interaction result is the part businesses should actually remember

The paper examines how the quality of LLM-derived constraints interacts with the quality of data-derived conditional-independence evidence. The heatmap analysis shows a clear pattern: high-quality LLM constraints produce consistent F1 gains when the underlying CI information is also reliable. Noisy LLM outputs are ignored or mildly harmful when statistical evidence is weak.

That sounds modest. It is also the most operationally useful result in the paper.

The system does not succeed because the LLM is always right. It succeeds because the pipeline is designed to make LLM knowledge useful when it aligns with reliable evidence and less dangerous when it does not. This is the difference between AI assistance and AI theatre.

The appendix also sharpens the picture. Providing semantic descriptions alongside variable names generally improves LLM constraint quality. For forbidden constraints on bnlearn, the paper reports F1 increasing from 0.53 to 0.61 under the average single-run method, and from 0.34 to 0.39 under consensus. For required constraints on bnlearn, the average method’s F1 increases from 0.40 to 0.56 with descriptions.

Consensus behaves as expected: it produces fewer constraints, improves precision when zero-constraint cases are excluded, and lowers recall. For example, forbidden constraints on bnlearn without descriptions reach consensus precision of 1 when zero-constraint cases are removed, but the paper is clear that this comes with lower recall.

That is not a bug. It is the chosen risk posture.

In causal discovery, a high-recall LLM that sprays plausible arrows everywhere is not helpful. It is a graph vandal with good grammar.

Standard benchmarks look good, but perhaps too good

The bnlearn benchmark results are useful, but the paper treats them with appropriate suspicion. On common datasets such as Sachs, Asia, and Cancer, both ABAPC-LLM and the LLM-only baseline perform strongly. In several cases, the LLM-only baseline achieves 100% recall; the paper also notes suspiciously strong SID behavior on those same datasets.

The authors interpret this carefully: standard benchmarks may be contaminated because they are public and likely present in LLM training corpora. This is why the CauseNet protocol is not just an appendix curiosity. It is central to the credibility of the evaluation.

For enterprise AI evaluation, this lesson travels well. If a vendor demo works beautifully on famous benchmark tasks, the correct response is not applause. The correct response is: “What has the model not already seen?”

The CauseNet approach is one way to answer that question for causal discovery. Generate structurally novel but semantically meaningful test cases. Preserve the kind of context LLMs need while reducing the chance that they are simply replaying known answers.

What the paper directly shows

The paper directly shows three things.

First, consensus-filtered LLM constraints can improve a Causal ABA pipeline when combined with conditional-independence evidence. The improvement appears across the synthetic CauseNet experiments and remains visible in additional metrics beyond SHD and F1.

Second, LLM-derived constraints are not uniformly reliable. Their value depends on elicitation quality, semantic descriptions, consensus filtering, and interaction with data-derived evidence. The LLM is useful because it contributes a complementary signal, not because it has become a causal oracle after reading a variable list.

Third, evaluation design matters. Standard benchmarks alone are not enough when LLMs may have seen benchmark structures during training. The CauseNet-grounded synthetic protocol is a practical attempt to test generalization without removing the semantic information that makes LLMs useful in the first place.

These are strong but bounded claims. They justify interest. They do not justify handing the causal graph to a chatbot and calling it science.

What Cognaptus infers for business use

For business analytics, the immediate value is not automated causal truth. It is cheaper structured diagnosis.

Many firms already have messy causal questions:

Business setting	Typical causal question	How this paper’s mechanism could help
Marketing attribution	Which campaign activities plausibly affect conversion rather than merely correlate with it?	LLMs translate campaign metadata into candidate priors; data and argumentation test whether those priors survive
Credit and risk modelling	Which borrower attributes are causes, proxies, downstream outcomes, or forbidden decision variables?	Semantic constraints help flag impossible or governance-sensitive directions
Operations analytics	Which process delays are upstream causes and which are symptoms?	Variable descriptions can generate temporal and mechanistic constraints for process graphs
Customer retention	Which interventions plausibly reduce churn rather than just identify churn-prone customers?	Causal graph construction becomes more auditable before intervention design
Compliance analytics	Which model explanations are causally defensible rather than merely predictive?	Defeated assumptions and accepted constraints provide traceable reasoning artifacts

The business relevance is therefore not “LLMs replace analysts.” It is “LLMs can help analysts formulate candidate causal assumptions faster, provided those assumptions enter a system that can reject them.”

That has ROI implications, but not the cheap kind. The savings would come from reducing the expert bottleneck in early causal modelling, making assumptions auditable, and accelerating hypothesis generation. The risk control comes from refusing to treat fluent language as evidence.

A reasonable enterprise version of this architecture would look like this:

Data analysts prepare variables, descriptions, and known temporal constraints.
An LLM proposes required and forbidden causal directions.
Consensus or confidence filtering removes unstable suggestions.
Statistical tests construct the data-derived skeleton.
An argumentation or constraint solver integrates evidence and priors.
Human reviewers inspect accepted and defeated assumptions before using the graph for intervention analysis.

This is not a fully autonomous causal scientist. It is a disciplined assistant with a leash. In many firms, that would already be an upgrade.

The boundary conditions are not small print

The limitations are not ceremonial. They define where the paper can and cannot be used.

The experiments are based on observational data. Conditional-independence evidence alone generally identifies causal structure only up to a Markov equivalence class, and the paper relies on semantic constraints to help orient edges within that ambiguity. This is useful, but it does not remove the usual assumptions behind causal discovery.

The paper also assumes causal sufficiency: no unobserved confounders. In business settings, this is a heroic assumption. Hidden factors are everywhere: macro conditions, competitor actions, unmeasured customer intent, operational policy changes, and the ancient corporate variable known as “someone changed the spreadsheet.”

The synthetic Bayesian networks use binary variables and randomly initialized conditional probability tables. The authors are explicit that these CPTs are for generating data consistent with the sampled structure, not for realistic effect sizes. So the evaluation targets structural recovery, not calibrated causal effects.

Scale is another boundary. The synthetic experiments cover graphs up to 15 nodes. The runtime analysis indicates practicality at that size, excluding external LLM API latency. Many enterprise causal questions can be decomposed into small subgraphs, so this is not fatal. But it does mean the paper should be read as a strong mechanism demonstration, not as a proof of enterprise-scale causal automation.

Finally, the model and prompt choices matter. The paper uses Gemini models, schema-guided extraction, prompt refinement, and a specific five-run consensus strategy. Different models, domains, descriptions, and prompts may shift the precision-recall trade-off.

None of these limitations weaken the paper’s core idea. They prevent the wrong interpretation of it. Which, given the topic, is almost poetic.

The strategic lesson: causal AI needs institutions, not just models

The deeper business lesson is that causal AI will not be made reliable by model size alone.

Causal reasoning requires institutions inside the system: procedures for proposing claims, checking evidence, resolving conflict, recording provenance, and rejecting unsupported assumptions. ABAPC-LLM is interesting because it treats the LLM as one participant in such an institution. The model speaks, but it does not rule.

That design philosophy generalizes beyond causal discovery. Many AI systems fail not because the model is useless, but because the surrounding process is intellectually lazy. They accept outputs without adversarial checks. They confuse confidence with validity. They evaluate on benchmarks that may already be inside the model’s memory. Then everyone acts surprised when the system behaves like a gifted intern with no supervision.

The paper offers a better pattern:

Bad pattern	Better pattern
Ask the LLM for the final causal graph	Ask the LLM for auditable candidate constraints
Treat fluent explanations as evidence	Treat explanations as assumptions requiring support
Use public benchmarks as comfort blankets	Build semantically meaningful, less-memorized evaluation tasks
Hide conflicts inside model outputs	Make conflicts explicit through argumentation
Optimize only for answer generation	Optimize for traceable decision support

This is where the title’s move from “causal parrots” to “causal counsel” becomes precise. Counsel does not decide the case. Counsel presents arguments. Evidence is examined. Contradictions are tested. A judgment is reached through procedure.

That is much closer to how causal AI should work.

Conclusion: the LLM should argue, not adjudicate

The paper’s contribution is not that LLMs suddenly understand causality in the deep philosophical sense. Thankfully, it does not try to sell that story. Its contribution is more practical: LLMs can generate useful semantic causal priors when those priors are filtered for stability, constrained by data, and adjudicated by a symbolic argumentation framework.

For businesses, the lesson is direct. Do not use LLMs as causal decision-makers. Use them as structured prior generators inside a pipeline that can expose, challenge, and reject their claims.

The future of AI-assisted causal analytics will not be a single model drawing arrows with supreme confidence. It will be a system in which language models, statistical tests, symbolic solvers, and human reviewers each do a narrower job.

Less magical. More useful.

A familiar trade-off. Also the correct one.

Cognaptus: Automate the Present, Incubate the Future.

Zihao Li and Fabrizio Russo, “Leveraging Large Language Models for Causal Discovery: a Constraint-based, Argumentation-driven Approach,” arXiv:2602.16481, https://arxiv.org/abs/2602.16481. ↩︎

The useful role for the LLM is not oracle; it is prior generator#

The pipeline has three gates before an LLM claim affects the graph#

Data gets the first veto#

Causal ABA makes conflict visible instead of sweeping it under the model#

The CauseNet benchmark attacks the memorization problem directly#

The main evidence: ABAPC-LLM improves structural recovery across graph sizes#

The interaction result is the part businesses should actually remember#

Standard benchmarks look good, but perhaps too good#

What the paper directly shows#

What Cognaptus infers for business use#

The boundary conditions are not small print#

The strategic lesson: causal AI needs institutions, not just models#

Conclusion: the LLM should argue, not adjudicate#