The Butterfly Defect: Diagnosing LLM Failures in Tool-Agent Chains

TL;DR for operators

Most LLM agent failures are still discussed as if the model had a grand philosophical lapse: bad reasoning, weak planning, insufficient context, not enough “agenticness” sprinkled on top. This paper points to a less glamorous culprit: parameter filling. A tool-agent chain can fail because the model supplies the wrong field name, omits a required value, invents a value not present in the user request, misreads a tool return, or follows a type description that was wrong in the first place.¹

The practical lesson is not “use a smarter model” and call it a day. Charming, but insufficient. The paper’s evidence suggests that enterprise agent reliability depends on parameter governance: complete tool documentation, type validation, input templates, standardized JSON returns, useful error feedback, and consistency checks between tools in a chain.

The strongest experimental signal is that when user-query parameter information is removed, models tend to drift into task deviation: they still call tools, but the call no longer corresponds to the user’s intent. Wrong parameter type metadata in tool documents is especially damaging for specification mismatch and task deviation. Tool returns produce lower overall failure rates than query and document perturbations, but malformed or inconsistent returns still disturb downstream invocation. Hallucinated parameter names, meanwhile, appear less sensitive to the tested input perturbations and more tied to the model’s own behaviour.

Cognaptus’s business inference is direct: production agents need parameter observability before they need more theatrical autonomy. Log the arguments. Validate the schema. Preserve user intent through the chain. Refuse unsafe inference when required fields are missing. The dull plumbing is where the expensive failures hide, naturally.

A small parameter error is enough to spoil the chain

A user asks an agent to search for suggestions related to “Bonds” in Australia and then retrieve details for the first result. The agent calls an autocomplete tool. It passes the country correctly as Australia in ordinary language, but the tool expects an abbreviation. Or it uses query where the tool expects q. Or it receives a list of IDs and selects the wrong one for the next call.

None of these errors sounds spectacular. No existential misalignment. No dramatic hallucinated legal memo. No sci-fi incident report.

Just a parameter.

That is the useful discomfort in Butterfly Effects in Toolchains. The paper studies LLM tool agents at the point where natural language is converted into executable arguments. This is the handoff layer between the model’s interpretation of a task and the external system’s unforgiving interface. It is also where many business deployments quietly become unreliable.

The misconception worth killing early is that tool-agent failure is mostly a retrieval or planning problem. Retrieval and planning matter, of course. But a tool can be correctly selected, the chain can be broadly sensible, and the final result can still collapse because one parameter is missing, over-specified, malformed, semantically wrong, or simply named incorrectly.

That is the butterfly defect: a small local flaw in parameter filling changes the behaviour of the downstream chain.

The paper studies the handoff layer, not “agent intelligence” in general

The authors focus on API-type tool invocation in mainstream tool-agent workflows. A typical chain looks roughly like this:

Receive a user query.
Retrieve relevant tools.
Plan a sequence of tool calls.
Fill parameters for the next tool.
Execute the tool.
Parse and aggregate returns.
Generate the final answer.

Most agent evaluations compress this into an end-to-end success measure. Did the model answer the task? Did it choose the right tool? Did it finish the chain? Useful, but blunt. A failed final answer does not tell an engineer whether the root problem was retrieval, planning, schema interpretation, return parsing, or parameter propagation.

This paper zooms in on parameter filling. It first constructs a taxonomy of parameter failures using grounded-theory-style coding over tool-agent behaviour traces. Then it perturbs three sources of parameter information: tool documents, user queries, and tool returns. Finally, it tests how those perturbations change failure patterns across GPT-3.5-Turbo, GPT-4o-mini, ToolLLaMA-v2, and Qwen2.5-Plus.

That structure matters. The study is not merely asking, “Which model is best at tool use?” It asks a more operational question: when the parameter environment is degraded in specific ways, which failure modes appear?

For builders, that is a better question. Model rankings age quickly. Failure mechanisms age more slowly.

Five failure modes explain how the defect mutates

The paper’s first contribution is a five-part taxonomy of parameter failure. It is useful because it separates failures that may look similar in a final trace but require different controls.

Failure mode	What happens	Operational consequence
Missing Information	Required parameters are not filled.	The tool may fail directly, return incomplete results, or force the model to guess.
Redundant Information	Extra recognizable parameters are supplied despite not being required by the user.	The tool may still run, but the result is narrowed or distorted. Silent failure’s favourite outfit.
Hallucination Name	The model invents or uses a parameter name the tool does not recognize.	Invocation fails because the interface contract is violated.
Task Deviation	Parameter values are valid-looking but misaligned with the user’s intent.	The chain completes the wrong task, which is worse than crashing because it may look successful.
Specification Mismatch	Parameter values violate required type, range, or format.	The tool may reject the call or return misleading failure feedback.

The distinction between hallucinated names and task deviation is especially important. A hallucinated parameter name is an interface error: the model says page_size when the tool expects something else. Task deviation is subtler: the model supplies a value that the tool can process, but the value no longer expresses the user’s actual request.

Businesses should fear the second category more. A hard failure creates a ticket. A plausible wrong result creates a decision.

The experiments are defect-injection tests, not generic benchmark decoration

The paper uses correct initial behaviour trajectories as test oracles, then perturbs input sources to see whether parameter behaviour changes. This is closer to software defect injection than to ordinary leaderboard benchmarking.

The study uses ToolBench-derived data, filters out originally unsolvable queries and no-parameter cases, supplements data using GPT-4, and obtains 600 initial behaviour trajectories for each investigated LLM. It then applies 15 perturbation methods to the input sources, producing 9,000 enhanced behavioural trajectories for sensitivity analysis.

The perturbations have different roles. They should not be read as one undifferentiated stress test.

Test area	Perturbations	Likely purpose	What it supports	What it does not prove
Tool documents	Removed description, removed example, wrong description, swapped description, changed order, wrong type	Main evidence on schema/documentation sensitivity	Whether documentation quality changes parameter failure patterns	Whether every production API doc behaves similarly
User queries	Remove first parameter, remove last parameter, complicate parameter, add noise	Main evidence on user-intent preservation	Whether missing or noisy user inputs make the agent infer unsafe parameters	Whether multi-turn clarification would solve the same cases
Tool returns	Fuzz key, apply prefix, camelCase key, underscore key, corrupt JSON	Main evidence on downstream propagation	Whether returned data format affects later parameter filling	Whether robust parsers or typed middleware would eliminate the issue
Rouge-L comparison for task deviation and specification mismatch	Similarity thresholding against oracle behaviour	Severity/sensitivity lens	Whether failures are semantically close to or far from the oracle	A complete measure of business impact
Appendix transfer analysis	Correlation among failure modes	Exploratory extension	Whether failures co-occur and propagate asymmetrically	Clean causal proof of one failure causing another

That last distinction matters. The appendix is useful, but it should not be inflated into a second thesis. It supports the propagation story; it does not turn correlation into a full causal graph. We can leave that kind of overreach to pitch decks with too many arrows.

User-query omissions turn agents into guessers

The strongest result is around user-query parameter removal. When the experiments remove the first or last parameter information from the user query, task deviation rises sharply. In Table 1, task-deviation failure rates under these two removal perturbations range from roughly 42% to 59% across the tested models.

This is the mechanism: the model does not simply stop when information is missing. It often uses its generative ability to construct a plausible parameter and continue. That looks productive. It is actually dangerous when the missing value defines the user’s intent.

A human operator might notice that a field is absent and ask a follow-up question. An agent under pressure to complete a toolchain may infer. Once it infers, the downstream tool call can become technically coherent but semantically wrong.

This is why “frictionless” agent UX can backfire. Hiding tool requirements from users may make the interface feel elegant, but it also deprives the user of the chance to provide structured intent. The paper explicitly argues that completely invisible tool operation can be harmful to agent functionality. For enterprise systems, that means user interfaces should not merely accept natural language and pray. They should expose the minimum structured requirements needed for reliable execution.

Cognaptus inference: the right product pattern is not a giant form that destroys the point of an agent. It is adaptive structure. Ask for missing high-risk parameters. Use templates for recurrent tasks. Distinguish optional inference from required intent. Record when the model filled a value from the user versus when it inferred the value itself.

The audit log should not say only “tool called successfully.” It should say where each parameter came from.

Wrong type metadata is schema sabotage

Tool-document perturbations show another important mechanism. Wrong parameter type descriptions are especially damaging. In the reported results, wrong-type perturbation produces elevated task-deviation rates for several models and specification-mismatch rates around 20% to 24% across the four tested models.

This is not surprising, but it is often under-managed. Tool documentation is treated as developer-facing prose. For LLM agents, tool documentation is executable context. The model reads it as a map for action. If the map says a field is one type when the actual API expects another, the agent may faithfully follow the wrong map.

That is worse than a missing description. Missing information creates uncertainty. Wrong information creates confidence.

The paper’s document perturbations separate several failure sources: missing required descriptions, missing examples, wrong descriptions, swapped descriptions, changed order, and wrong types. Removed descriptions are associated with missing-information failures. Wrong types are more tied to specification mismatch and task deviation. This gives operators a practical hierarchy.

First, validate types. Second, ensure required parameter descriptions are present. Third, add examples where ambiguity is likely. Fourth, treat parameter order and description correspondence as part of the contract, especially for models that are sensitive to layout.

The boring engineering move is to generate tool documentation from typed schemas rather than maintaining schema and natural-language documentation separately. If the schema says region must be an ISO-like abbreviation but the prose examples show full country names, congratulations: the agent now has two sources of truth. It will choose one at runtime, because chaos enjoys A/B testing.

Tool returns fail more quietly, but they still move the downstream chain

Tool-return perturbations produce lower overall failure rates than tool-document and user-query perturbations. That does not make them harmless.

Tool returns matter because tool-agent chains are not single calls. The output of one tool often becomes the input to the next. If the first tool returns malformed JSON, inconsistent key names, unexpected prefixes, or truncated content, the model may misread the value it needs for the next invocation.

The paper tests several return perturbations: fuzzing key names, adding prefixes to ID-type values, converting keys to camel case, converting them to underscore notation, and corrupting JSON format. The highest rates are not always dramatic, but the mechanism is operationally familiar. A downstream call depends on a field. The field arrives under a different name, in a different format, with a prefix, or inside a corrupted structure. The model then either drops the parameter, selects the wrong value, or tries to repair the situation with guesswork.

The paper notes that corrupted JSON has the highest failure rate for missing information among the tool-return perturbations. It also argues that failure feedback matters: when tools return poor error messages, models cannot adjust effectively after failure.

That is a design lesson, not just a model lesson. Tool outputs should be boringly consistent. IDs should remain IDs. Error messages should identify which parameter failed and why. Return lengths should avoid truncating structured content mid-object. Cross-tool parameter passing should be designed explicitly, not discovered by the LLM as a kind of interpretive dance.

Cognaptus inference: if your agent chains tools together, you should maintain a parameter lineage graph. Each downstream parameter should be traceable to a user input, tool document field, previous tool return, or model inference. When a result is wrong, the team should be able to locate the mutation point. Otherwise, debugging becomes archaeology with logs.

Hallucinated parameter names are the stubborn residue

One finding deserves restraint. Hallucinated parameter names do not appear to be strongly driven by the tested input-source perturbations. Their rates remain comparatively low and relatively stable across perturbation types, with Qwen2.5-Plus showing higher values than the others but still at low single-digit percentages in the table.

The authors interpret this as evidence that hallucinated parameter-name failures primarily stem from inherent LLM limitations rather than from the external perturbations tested.

This does not mean documentation does not matter. It means that parameter-name hallucination may require a different control layer from the one used for task deviation or specification mismatch. Better prose may reduce ambiguity, but invented field names should be blocked mechanically.

The practical fix is schema-constrained invocation. Let the model select from permitted arguments. Reject unknown keys before tool execution. Use function signatures, JSON Schema, typed wrappers, or tool-call validators. If a model emits page_size and the tool does not accept it, the system should not discover this only after a runtime failure. The wrapper should catch it immediately and either map it safely, ask for repair, or stop.

This is where “agent autonomy” needs a seatbelt. An LLM should be creative in solving the user’s problem, not creative in inventing API parameters. There are places for imagination. Function signatures are not one of them.

The appendix shows propagation, not isolated accidents

The appendix adds an important piece: failures often co-occur. The authors report that more than half of failed data across perturbation cases exhibits multiple failure patterns. They also find an asymmetric transfer tendency: except for hallucinated parameter names, other failure modes tend to push the chain toward task deviation. Redundant information is particularly implicated in altering tool results and later reasoning.

This supports the “butterfly” framing. A parameter failure is not necessarily a local event. It can change the data returned by a tool, which changes the next parameter, which changes the next call, which changes the final answer. By the time the user sees the output, the original defect may be several hops upstream.

That has consequences for monitoring. A dashboard that counts only final task success will miss the mechanism. A logger that records only tool names will miss the argument mutation. A retry loop that simply asks the model to “try again” may repeat the same faulty assumption.

A useful production trace should include at least:

the user-stated parameter information;
the tool-document source used to fill each field;
the model-filled argument values;
validation results before execution;
tool return schema and parse status;
downstream reuse of returned values;
error messages and repair attempts.

This is not glamorous. Neither is fire insurance.

What this changes for enterprise agent design

The paper directly shows that parameter failures can be categorized, induced through perturbations, and linked to specific sources of input information. Cognaptus infers that enterprises should treat parameter handling as a managed reliability layer, not as incidental prompt behaviour.

Paper result	Operational interpretation	Practical control
User-query parameter removal strongly drives task deviation.	Missing intent often becomes model inference.	Use adaptive clarification, query templates, and required-field gates.
Wrong type descriptions in tool documents drive specification mismatch and task deviation.	Tool docs are executable context, not background prose.	Generate docs from schemas; validate types before execution.
Tool-return perturbations have lower but non-trivial impact.	Downstream calls depend on return format stability.	Standardize JSON, preserve IDs, avoid truncation, validate parsers.
Hallucinated parameter names are comparatively model-inherent.	Better context alone may not stop invented fields.	Enforce allowed-argument schemas and reject unknown keys.
Failures often co-occur and transfer.	The final wrong answer may be several hops away from the first defect.	Maintain parameter lineage and chain-level observability.

The ROI argument is not only higher accuracy. It is cheaper diagnosis.

When an enterprise agent fails, the expensive part is rarely the first visible error. It is the engineering time spent reconstructing what happened. Did the user omit a region? Did the model infer it? Did the tool document imply the wrong type? Did a prior return truncate the ID? Did the model use the label instead of the identifier? Did an error message fail to specify the bad argument?

Without parameter observability, every incident becomes a detective story. Some teams enjoy detective stories. Finance, healthcare, logistics, and compliance teams usually prefer systems that do not require literary interpretation.

The boundary is controlled API tool use

The study’s limitations are not cosmetic; they define where the evidence travels.

First, the experiments mainly use English data. Parameter extraction can behave differently across languages because morphology, syntax, abbreviations, and semantic cues differ. A query template that works in English may not preserve intent in Chinese, Tagalog, Arabic, or Japanese without redesign.

Second, the experiments focus on relatively controlled single-turn scenarios. Multi-turn agents introduce additional failure channels: stale memory, contradictory updates, partial corrections, tool results from earlier turns, and user revisions that override previous parameters.

Third, the scope is API-type tool invocation. Command-line tools, programming libraries, robotic actions, browser automation, and enterprise SaaS workflows may have different parameter semantics. Some accept positional arguments. Some mutate state. Some produce side effects before failure is detected. Lovely, in the same way a trapdoor is architectural.

Fourth, the study evaluates perturbations in a benchmark-like setting. It does not prove that the exact failure-rate mix will appear in every production environment. It does, however, give a practical diagnostic vocabulary and a set of stress tests that teams can adapt.

The right use of the paper is therefore not to copy the percentages into a universal risk model. The right use is to ask: where do our agents obtain parameters, how do those parameters mutate, and what happens when each source is degraded?

The business lesson is less demo theatre, more parameter discipline

The paper is valuable because it moves the reliability conversation down one level of abstraction. Tool agents do not fail only because the model cannot “reason.” They fail because reasoning has to pass through brittle interfaces.

The future of enterprise agents will not be won by models that sound increasingly confident while passing malformed arguments into production systems. It will be won by systems that know when a parameter is user-stated, inferred, missing, invalid, contradicted, or unsafe to guess.

That sounds mundane. Good. Mundane is where production reliability lives.

A mature tool-agent stack should include model capability, yes. It should also include schema governance, type validation, return normalization, argument lineage, repairable error messages, and UI patterns that expose the right amount of structure at the right time.

The agent does not need to become less intelligent. It needs fewer opportunities to be confidently wrong at the interface boundary.

Cognaptus: Automate the Present, Incubate the Future.

Qian Xiong, Yuekai Huang, Ziyou Jiang, Zhiyuan Chang, Yujia Zheng, Tianhao Li, and Mingyang Li, “Butterfly Effects in Toolchains: A Comprehensive Analysis of Failed Parameter Filling in LLM Tool-Agent Systems,” arXiv:2507.15296, 2025. https://arxiv.org/pdf/2507.15296 ↩︎

TL;DR for operators#

A small parameter error is enough to spoil the chain#

The paper studies the handoff layer, not “agent intelligence” in general#

Five failure modes explain how the defect mutates#

The experiments are defect-injection tests, not generic benchmark decoration#

User-query omissions turn agents into guessers#

Wrong type metadata is schema sabotage#

Tool returns fail more quietly, but they still move the downstream chain#

Hallucinated parameter names are the stubborn residue#

The appendix shows propagation, not isolated accidents#

What this changes for enterprise agent design#

The boundary is controlled API tool use#

The business lesson is less demo theatre, more parameter discipline#