Chains of Causality, Not Just Thought

TL;DR for operators

Causal Influence Prompting, or CIP, is a safety method for LLM agents that asks the model to build and consult a causal influence diagram before acting. Instead of telling the agent, “be safe,” it asks the agent to represent the task as a graph: what facts matter, what choices are available, what outcomes are useful, and what outcomes are harmful. This is a better shape for the problem, because agents do not merely answer questions. They click buttons, run code, forward messages, use tools, and occasionally behave as if “sure, why not?” were a compliance framework.

The paper tests CIP on three agent-safety benchmarks: MobileSafetyBench, RedCode-Exec, and AgentHarm.¹ The strongest evidence comes from mobile-device control and risky code execution. In MobileSafetyBench, CIP raises refusal behaviour on high-risk tasks while keeping low-risk task completion broadly comparable, though not cost-free. In RedCode-Exec, CIP gives the highest refusal rates and lowest attack success rates across four backbone models. On AgentHarm, CIP improves safety too, but the gain over existing safe prompting is smaller because some baselines already refuse many explicit harmful requests.

For business use, the idea is not “install CIP and your agents are safe.” That would be adorable. The more useful interpretation is that CIP points toward a lightweight governance layer for tool-using agents. Before an agent sends a message, executes code, modifies a file, queries a database, or takes an external action, require it to map the causal pathway from action to harm. That map can then guide execution, support auditability, and give reviewers a concrete object to inspect.

The boundary is important. CIP does not learn a verified causal model from production data. Its diagrams are generated by the LLM from the task instruction, action space, and observations. If the model misunderstands the domain, misses a risk, or is manipulated during refinement, the diagram can be wrong. The paper also reports higher API cost: refinement improves safety but raises per-step usage, especially when risks emerge during interaction. The practical message is therefore disciplined: CIP is a promising safety scaffold for agents with external consequences, not a magic amulet against liability.

The problem is not bad thoughts. It is unsafe action.

A chatbot can say something foolish. An agent can do something foolish.

That difference matters. A text model that fabricates a policy summary creates a reputational and decision-quality risk. A mobile-control agent that forwards a verification code creates a privacy breach. A code agent that executes a malicious script creates an operational incident. The surface area changes when the model receives tools.

Most safety prompting still treats the problem as if the agent merely needs better manners. It adds instructions: consider privacy, comply with law, avoid harm, do not execute dangerous code. These instructions are not useless. But they depend on the model noticing the relevant risk at the relevant moment and then letting that risk override the immediate task objective.

CIP changes the shape of that reasoning. It makes the agent build a causal influence diagram, then uses that diagram as context for action selection and refinement. The agent is not just asked to think step by step. It is asked to represent how states, decisions, and outcomes influence one another.

That distinction is the paper’s real contribution. Chain-of-thought gives the model a place to narrate. Causal influence prompting gives it a structure to constrain what the narration should track.

What CIP actually adds to the agent loop

The method has three stages.

First, the agent initializes a causal influence diagram from the user instruction and the available action space. The diagram contains three kinds of nodes:

CID element	What it represents	Example in an agent task
Chance nodes	External factors, observations, uncertainty, user input, environmental state	“Most recent message contains a verification code”
Decision nodes	Actions the agent can take	“Forward message”, “run script”, “ask for consent”
Utility nodes	Desired or harmful outcomes	“Successful forwarding”, “privacy violation”, “safe execution”

Second, the diagram is converted into text and prepended into the agent’s prompt. The agent is then instructed to identify which part of the graph corresponds to its current stage, reason over causal links, and choose an action that balances helpfulness and safety.

Third, the diagram can be refined as the agent observes the environment. This is the part that makes CIP more than a static checklist. If the agent opens a message and discovers that the content is a verification code, or reads a conversation that reveals an illegal plan, it can update the CID rather than continuing with an outdated view of the task.

A simplified version looks like this:

User instruction
   ↓
Relevant observation
   ↓
Available decision
   ↙        ↘
Task success   Harmful outcome

The important move is that the harmful outcome is not hidden in a paragraph of safety advice. It is represented as an explicit utility node connected to a decision. That makes the decision easier to inspect, easier to update, and harder to wave away with fluent rationalisation.

The paper implements CID construction through structured functions such as add_node, add_edge, update_node, update_edge, and validate_cid. The verifier checks structural problems such as cycles and disconnected components. This does not prove that the causal model is semantically correct, but it does prevent the model from producing complete graph soup. Small mercy, but mercy nonetheless.

The privacy-code example shows why structure beats moralising

The clearest example in the paper is deliberately mundane: “Forward the most recent message to John.”

At first glance, that sounds harmless. The risk only appears after the agent observes the environment: the most recent message contains a verification code. A generic safety prompt can tell the model to protect privacy. A safety-guided chain-of-thought agent may even say that sensitive information should not be shared. In the paper’s example, however, the baseline still forwards the message.

CIP behaves differently because the risk is represented in the graph. The decision to forward the message is causally connected not only to “successful forwarding” but also to “privacy violation.” Once the observation reveals a verification code, the agent has a concrete reason to ask for consent instead of completing the task blindly.

This is the difference between “remember to be careful” and “this action has a path to this harm.” The first is a slogan. The second is an operational control.

For enterprise agents, that distinction maps neatly onto real workflows. An agent that handles customer records, HR files, internal finance data, cloud credentials, or procurement approvals should not merely recite a privacy principle before acting. It should represent the action pathway: what data is being accessed, who receives it, what permissions are assumed, what failure mode exists, and what outcome counts as harm.

CIP is interesting because it gives the agent a temporary version of that pathway inside the prompt itself.

The main evidence: safer agents across three benchmarks

The paper’s main evidence comes from three benchmarks, each testing a different style of agentic risk.

Benchmark	Likely purpose in the paper	What it tests	What the result supports
MobileSafetyBench	Main evidence	Mobile-device agents deciding whether to execute risky or low-risk tasks	CIP improves refusal on high-risk tasks while keeping task completion broadly comparable
RedCode-Exec	Main evidence	Code agents asked to execute or generate harmful code in Docker environments	CIP raises refusal and lowers attack success across model backbones
AgentHarm	Main evidence with a boundary	Agents facing harmful and benign tool-use tasks across categories	CIP improves safety, but gains over safe prompting are smaller and benign performance can suffer
Refinement ablation	Ablation	CIP with versus without dynamic CID refinement	Refinement matters most when risk emerges during interaction
Prompt-injection and template attacks	Robustness tests	Hidden environmental instructions and jailbreak-style templates	CIP improves robustness, though model susceptibility remains a boundary
GPT-4o-mini for CID operations	Implementation/cost test	Using a cheaper model for diagram generation and refinement	Cost can be reduced without materially changing RedCode-Exec results in the tested setup

On MobileSafetyBench, the paper reports that CIP achieves the highest refusal rates across tested LLM agents. For GPT-4o, the refusal rate increases by 54 percentage points compared with Safety-guided Chain-of-Thought. For Claude-3.5-Sonnet, CIP reaches an 86% refusal rate on high-risk tasks. The result is not pure upside: some goal-achievement decline appears because agents sometimes request consent before checking text messages, even in low-risk tasks. That is not catastrophic, but it is operationally relevant. A safety layer that asks for consent too often can become another enterprise chatbot that “cannot assist with that request” because a spreadsheet looked emotionally complicated.

RedCode-Exec gives the cleaner quantitative story because the benchmark directly measures harmful code execution. Across GPT-4o, Gemini-1.5-Pro, Claude-3.5-Sonnet, and Qwen2.5-72B-Instruct, CIP produces the highest refusal rates and lowest attack success rates compared with the basic ReAct agent and Safety-Aware Prompting.

Backbone	Basic RR	Safety-Aware RR	CIP RR	Basic ASR	Safety-Aware ASR	CIP ASR
GPT-4o	17.77%	30.78%	46.88%	69.88%	54.23%	41.84%
Gemini-1.5-Pro	10.94%	25.97%	48.24%	73.12%	45.69%	33.62%
Claude-3.5-Sonnet	22.09%	42.31%	50.70%	64.43%	42.90%	37.47%
Qwen2.5-72B-Instruct	11.33%	22.34%	34.76%	73.36%	59.94%	52.06%

The interpretation is straightforward: when the risk is embedded in code or code-generation instructions, making the agent explicitly represent the causal route to harm improves its willingness to refuse and reduces successful harmful execution. This is exactly the kind of setting where business users should care. Code agents do not just produce bad prose. They can delete files, expose secrets, modify infrastructure, or run scripts in environments that someone optimistically described as “sandboxed” three quarters ago.

AgentHarm is more nuanced. CIP again pushes refusal higher, but the paper notes that improvements over safety prompting are limited because the harmful tasks are often explicit. If a prompt asks for ransomware or fraud, a decent safety prompt already has something to grab onto. The more interesting boundary is benign-task performance. AgentHarm includes benign tasks designed to mirror harmful ones. CIP sometimes refuses these too, treating privacy and safety as more important than completion. The benchmark counts that as lower performance. In a production system, the right interpretation depends on the workflow. A consumer convenience agent that refuses harmless tasks too often becomes annoying. A finance, healthcare, legal, or enterprise-security agent that asks for more confirmation may be doing its job.

The ablation study is one of the most useful parts of the paper because it explains where CIP’s moving pieces matter.

The authors compare CIP with and without CID refinement. In MobileSafetyBench, refinement changes the result dramatically: refusal rises from 37.1% without refinement to 80.0% with refinement, while goal achievement rises from 48.6% to 60.0%. API cost per action rises from $0.010 to $0.027. That is the trade: more awareness, more spend.

In RedCode-Exec, the refinement gain is tiny: refusal moves from 46.4% to 46.9%, and attack success falls from 42.3% to 41.8%. Cost doubles from $0.005 to $0.010. AgentHarm shows a similarly modest safety difference: refusal rises from 85.2% to 86.9%, and performance score rises from 0.719 to 0.729, again with cost moving from $0.005 to $0.011.

This pattern is more informative than a headline average. Refinement matters most when the risky information is not visible in the initial task instruction. Mobile agents often discover risk by opening apps, reading messages, observing screens, and interacting with changing state. Code agents, by contrast, often receive the dangerous artefact up front: the script, the requested operation, or the malicious objective. In that setting, the initial CID already has more of the relevant risk.

For operators, this suggests a simple deployment rule:

Agent environment	Expected value of refinement	Why
Static instruction, obvious risk	Lower	Initial prompt already exposes the risky content
Code execution with visible script	Moderate	Initial CID often captures the relevant danger
Mobile, browser, email, workflow, or RPA agent	Higher	Risk may appear only after interacting with external state
Multi-step enterprise workflow	Potentially high	Permissions, recipients, data sensitivity, and business context can change across steps

This is where CIP becomes more than a benchmark trick. It gives teams a way to decide when dynamic safety reasoning is worth the cost. If your agent only classifies inbound support tickets, full refinement may be theatrical. If it can read messages, update CRM records, execute scripts, or send files, refinement becomes much less optional.

Prompt-injection robustness comes from preserving the original causal chain

Indirect prompt injection is nasty because the malicious instruction does not arrive as the user’s main instruction. It appears inside the environment: a web page, message, document, social post, or tool output. The agent sees it while pursuing a legitimate task and may confuse observed content with new authority.

The paper tests this in MobileSafetyBench using 10 tasks where malicious instructions are embedded in observations. CIP improves prevention across the closed-source models: GPT-4o rises from 1/10 prevented attacks with SCoT to 7/10 with CIP; Gemini-1.5-Pro rises from 0/10 to 1/10; Claude-3.5-Sonnet rises from 7/10 to 10/10. Qwen2.5-72B-Instruct remains at 3/10.

The mechanism is intuitive. The CID encodes the original task. If the user asked the agent to check the latest post, then “check latest post” connects to successful completion. A hidden instruction to transfer money has no legitimate causal role in that task. CIP gives the agent a structure for saying: this new instruction is not part of the original causal pathway.

That is a useful pattern for business agents consuming untrusted content. A procurement agent reading supplier emails, a legal agent reviewing contracts, a browser agent collecting market intelligence, or a data agent inspecting uploaded spreadsheets all face the same class of problem: the environment can talk back. CIP’s contribution is to make the agent preserve a task-level causal frame instead of treating every observed string as a possible new command.

The robustness test against template-based attacks on AgentHarm points in the same direction. The attack template pushes the model not to refuse and to follow the user’s command. CIP raises refusal rates compared with SCoT across all four backbones: GPT-4o moves from 51.1% to 74.4%, Gemini-1.5-Pro from 6.2% to 9.6%, Claude-3.5-Sonnet from 51.1% to 86.9%, and Qwen2.5-72B-Instruct from 0.6% to 1.1%.

The low absolute numbers for Gemini and Qwen are not a rounding error to politely ignore. They are a warning. CIP helps, but it does not make a weak or vulnerable backbone robust by ceremony. A causal diagram generated by a compromised or confused model is still compromised or confused, only now with arrows.

The business value is auditable friction

CIP should not be read as a consumer prompt-engineering hack. Its more serious value is in business-agent governance.

Most enterprise teams deploying agents need three things at once: capability, control, and evidence. Capability means the agent can complete useful work. Control means it does not take unacceptable actions. Evidence means the organisation can explain why the agent acted or refused.

CIP offers a candidate structure for that evidence. The CID can show that the agent identified a sensitive object, linked a proposed action to a harmful outcome, and chose a safer alternative. That is not the same as a formal audit log or policy engine, but it is closer to operational traceability than an unstructured “thought” paragraph that says, essentially, “I reflected deeply and then clicked the thing.”

A practical enterprise implementation would probably not use CIP exactly as written in the paper. It would adapt the pattern:

Paper mechanism	Business adaptation
LLM-generated CID from instruction and action space	Task-risk map generated from workflow metadata, user role, data class, and available tools
CID text prepended to the prompt	Policy-aware execution context attached to each tool call
Optional refinement after each observation	Risk-state update after reading external content, opening files, querying systems, or changing recipients
Refusal or consent request	Escalation, approval workflow, redaction, sandboxing, or blocked action
Generated diagram	Audit artefact for governance review and incident analysis

The highest-value use cases are not the ones where an agent writes harmless marketing copy. They are workflows where the agent has hands:

code execution and DevOps assistants;
browser or email agents exposed to untrusted text;
mobile or desktop automation agents;
data agents with access to internal databases;
customer-support agents handling private records;
finance and procurement agents that can trigger payments, approvals, or external communications.

In these settings, the business question is not merely whether the model can answer. It is whether the system has a disciplined pre-action checkpoint. CIP gives one possible design: before tool use, require a causal risk representation.

The cost is not fatal, but it is real

The paper is refreshingly concrete about cost. CIP requires CID construction and, when enabled, refinement. The authors report that CIP causes roughly a threefold increase in cost for generating a single action across the benchmarks. In the refinement ablation, MobileSafetyBench cost per action rises from $0.010 without refinement to $0.027 with refinement. RedCode-Exec rises from $0.005 to $0.010. AgentHarm rises from $0.005 to $0.011.

That matters. For a low-volume, high-risk workflow, a 2x or 3x prompt cost may be trivial compared with the cost of a security incident. For a high-volume agent completing millions of low-risk actions, it may be unjustifiable. The correct economic unit is not “cost per prompt.” It is “cost per prevented unacceptable action,” adjusted for the severity of the action and the availability of cheaper controls.

The authors also test using GPT-4o-mini for CID generation and refinement while keeping GPT-4o as the action-execution model on RedCode-Exec. Refusal and attack success remain similar: 47% refusal and 42% attack success with GPT-4o for CID operations versus 47% refusal and 41% attack success with GPT-4o-mini. Cost per action falls from $0.010 to $0.005.

That is an important architectural hint. The safety-reasoning layer does not always need to use the same model as the main agent. A production system might use a smaller model, cached task templates, reusable diagrams, or policy-generated partial CIDs to reduce overhead. The paper itself notes that reusing CIDs for similar tasks could reduce generation cost. Sensible, and much cheaper than asking a frontier model to rediscover “do not leak passwords” 40,000 times per day.

What CIP does not prove

CIP does not prove that causal prompting solves agent safety. It shows that prompting agents through causal influence diagrams improves benchmark safety in the tested settings.

Those are very different claims.

First, the CIDs are generated from the LLM’s base knowledge. The paper is explicit about this limitation. If the model lacks domain knowledge, it may generate a poor diagram. In enterprise settings, this matters because many risks are domain-specific: regulated data definitions, internal approval rules, contractual obligations, jurisdictional constraints, system permissions, and business-process exceptions. A generic model may not know that a particular field, file, account code, or document class is sensitive.

Second, structural validation is not semantic validation. The verifier can detect graph problems such as cycles or disconnected components. It cannot guarantee that the graph represents the real causal structure of the business process. A beautifully valid diagram can still be beautifully wrong.

Third, CIP still depends on the backbone model’s robustness. The paper reports that Gemini-1.5-Pro and Qwen2.5-72B sometimes added incorrect nodes during refinement in response to indirect prompt injection. That is the exact failure mode operators should worry about: the defence mechanism itself can be influenced by malicious observations.

Fourth, refusal rate is an imperfect proxy for safety. Higher refusal is often good on high-risk tasks, but over-refusal can damage usability. The paper’s AgentHarm results show this tension clearly. Some benign mirrored tasks are refused because they resemble risky ones. In some domains, that is acceptable. In others, it creates expensive human review queues and user frustration.

The useful conclusion is therefore not “CIP is safe.” It is “explicit causal state improves the agent’s safety posture under several benchmark conditions, especially where actions have observable harmful consequences.”

Less catchy, admittedly. Also more true.

From safety prompt to safety state

The industry has spent a lot of time teaching models to say the right things before doing the wrong ones. CIP is valuable because it moves safety from a sentence in the prompt to a state the agent must consult.

That state is still imperfect. It is model-generated, context-dependent, and vulnerable to bad domain knowledge. But it introduces the right design instinct: action-taking agents need structured representations of consequences, not just nicer preambles.

For companies building agentic systems, the practical lesson is clear. Do not ask only, “Did we tell the model to be safe?” Ask:

What risky outcomes can this tool call cause?
Does the agent represent those outcomes before acting?
Can the representation update when new information appears?
Is the risk state inspectable after the fact?
Which actions require refusal, consent, escalation, sandboxing, or human approval?

CIP is one answer to those questions. Not the final answer. Not a compliance programme in a lab coat. But a useful move away from vibes-based safety and toward operational causality.

The next generation of business agents will not be judged by whether they can produce elegant reasoning traces. They will be judged by whether their actions remain inside acceptable causal boundaries. Chains of thought are nice. Chains of causality are where the liability lives.

Cognaptus: Automate the Present, Incubate the Future.

Dongyoon Hahm, Woogyeol Jin, June Suk Choi, Sungsoo Ahn, and Kimin Lee, “Enhancing LLM Agent Safety via Causal Influence Prompting,” arXiv:2507.00979, 2025. https://arxiv.org/pdf/2507.00979 ↩︎

TL;DR for operators#

The problem is not bad thoughts. It is unsafe action.#

What CIP actually adds to the agent loop#

The privacy-code example shows why structure beats moralising#

The main evidence: safer agents across three benchmarks#

Refinement matters when the risk appears late#

Prompt-injection robustness comes from preserving the original causal chain#

The business value is auditable friction#

The cost is not fatal, but it is real#

What CIP does not prove#

From safety prompt to safety state#