Bias Busters: Teaching Language Agents to Think Like Scientists

TL;DR for operators

Language-model agents do not merely make wrong causal guesses. In this paper, they gather evidence in a biased way, then interpret that evidence through the same bias. That is the uncomfortable part.

The study turns the classic Blicket Test from developmental psychology into a text-based active exploration game for LM agents. The agent must test objects, observe whether a machine turns on, then infer which objects are “Blickets” and whether the hidden rule is disjunctive — any Blicket activates the machine — or conjunctive — all relevant Blickets must be present together.¹

The major result is not “LLMs fail a toy benchmark”, which would be a familiar and not especially nourishing meal. The result is more specific: agents systematically favour disjunctive, OR-style causal explanations over conjunctive, AND-style explanations. The bias appears across models, prompt variants, and task sizes. It also survives in a separate inference-only test where models are given standardised exploration data, including data from an InfoGain oracle that can fully resolve the hypothesis space.

For business use, this matters wherever an agent is expected to diagnose, investigate, audit, experiment, or troubleshoot. The failure mode is not just a bad final answer. It is premature causal closure: the agent behaves as if the world probably works in the obvious way, explores less where it feels confident, and therefore misses causal structures that require joint conditions.

The practical lesson is plain. Do not rely on more verbose reasoning, larger models, or charming “think step by step” theatre as your primary safety rail. Build agent workflows that maintain explicit hypothesis inventories, choose actions by expected information gain or hypothesis elimination, and test rare conjunctive failure modes before deployment. Apparently, even agents need to be reminded that science is not the art of believing the first plausible story.

The real failure is biased exploration, not just biased answers

A familiar enterprise version of this problem looks like this: a service incident occurs, an AI assistant reviews logs, finds one suspicious signal, and declares a likely cause. Maybe a cache expired. Maybe a vendor API slowed down. Maybe one configuration flag flipped. The answer sounds tidy.

But many operational failures are conjunctive. The outage happens only when a region-specific configuration, a version mismatch, and an unusual traffic pattern occur together. No single object is the culprit. The causal rule is in the combination.

That is the distinction the paper studies using the Blicket Test. In the original cognitive-science paradigm, subjects interact with objects and a detector. Some objects are Blickets. Depending on the hidden rule, either any Blicket can activate the machine, or a set of Blickets must be present together. The authors translate this into a sequential text game. A language-model agent can put objects on or take them off the machine, observe whether the light turns on, exit when it thinks it knows enough, and then answer which objects are Blickets.

The neat design feature is that the task separates three things that often get blurred in AI evaluation:

Layer of behaviour	What the paper tests	Why it matters operationally
Exploration	Does the agent choose informative actions?	Bad investigations produce bad evidence, even before reasoning begins.
Inference	Can the agent infer the causal rule from observations?	Even good data can be misread through a biased prior.
Bias correction	Can the agent reduce bias without retraining?	Many deployed systems need runtime controls, not six-month retraining fantasies.

The paper’s central mechanism is the interaction between prior belief and evidence gathering. If an agent expects OR-style causes, it will tend to collect evidence that quickly confirms OR-style explanations. It will then interpret ambiguous evidence through the same lens. The final answer is only the visible end of the pipeline. The bias has already been doing paperwork in the basement.

Why the Blicket setup is sharper than another benchmark leaderboard

The experimental setup is deliberately small. The authors test easier environments with four objects and harder ones with eight objects, with two true Blickets in each case. Agents are allowed up to 32 exploration steps before answering. The paper compares language-model agents with two non-LM baselines: a random agent and an InfoGain Oracle.

The oracle is important. It explicitly computes the expected information gain of candidate actions and selects actions that eliminate uncertainty over the discrete hypothesis space. Under a uniform prior over consistent hypotheses, this amounts to choosing actions that eliminate as many possible causal hypotheses as possible.

That means the benchmark is not merely asking, “Can a model solve a puzzle?” It asks whether the agent behaves like a competent experimenter in a world where the optimal exploration strategy is tractably computable. The oracle gives the paper a clean reference point. If models struggle in the conjunctive condition while the oracle resolves both rule types, the problem is not simply that conjunctive rules are inherently impossible. The problem is that the agents do not explore and infer in the right way.

The paper’s information-gain framing can be summarised as:

$$ \text{InfoGain}(x,y) = H[p(F)] - H[p(F \mid x,y)] $$

where $F$ is the causal hypothesis, $x$ is the tested state, and $y$ is the observed machine outcome. The operational translation is less mathematical but more useful: each action should shrink the set of still-plausible explanations.

Most business agents are not asked to solve Blicket puzzles. They are asked to inspect transactions, test marketing hypotheses, classify support incidents, review compliance exceptions, or diagnose process failures. But the same principle applies: an agent that does not actively reduce the plausible hypothesis set is not investigating. It is narrating.

The models prefer OR-worlds even when AND-worlds are valid

The first main evidence comes from the agents’ Q&A accuracy after exploration. Across model families and prompting methods, performance drops as the environment grows from four to eight objects. More importantly, models systematically perform worse under conjunctive rules than disjunctive rules.

The InfoGain Oracle reaches perfect Q&A accuracy in the tested setting. That matters because it rules out the lazy interpretation that conjunctive conditions are simply too hard in principle. They are hard for these agents because of how the agents explore and infer.

The paper also reports that this pattern persists across prompt variants, including common prompting strategies such as default prompting, chain-of-thought, ReAct, and Reflexion-style prompting. This is where the common misconception starts to wobble. More capable or more verbally elaborate models do not automatically become more scientific. Sometimes they just produce longer autopsies for the same mistake.

A useful way to read the results is not “Model X beats Model Y”. The more durable result is structural:

Finding	Evidence role	Interpretation
Accuracy falls from four to eight objects	Main evidence	Scaling the object space makes exploration harder for LM agents.
Conjunctive accuracy is systematically lower	Main evidence	Agents carry a disjunctive bias, not merely random weakness.
The InfoGain Oracle solves the hypothesis space	Control / upper-bound comparison	The conjunctive condition is not inherently unsolvable in this setup.
Prompting variants do not remove the pattern	Robustness test	The bias is not just a bad instruction template.

The scatter plot across model and prompt combinations strengthens the same point. In the eight-object setting, models skew toward lower conjunctive accuracy. The story survives the usual “maybe a different prompt fixes it” reflex. Convenient, yes. Sufficient, no.

Information gain predicts success better than exploration theatre

The paper then asks what correlates with final performance. The authors examine information gain, unique state visitation, number of steps taken, and response length.

Information gain shows the strongest positive relationship with final accuracy. Unique state visitation also helps, but less strongly. This distinction is useful. Trying lots of states is not the same as trying informative states. An agent can wander creatively and still fail to ask the one question that collapses the hypothesis space.

Two results are especially worth separating.

First, the number of steps taken is negatively correlated with performance. That does not mean laziness is intelligence. It suggests that agents that have actually gathered enough information tend to know when to stop. Poor agents may keep thrashing because the remaining uncertainty has not been cleanly eliminated.

Second, longer Q&A responses correlate with lower performance. This should make anyone building “reasoning-heavy” agent pipelines slightly uncomfortable. Response length is not understanding. It may be a symptom of unresolved uncertainty dressed in formal wear.

For operators, the message is direct: measure whether the agent’s actions reduce uncertainty. Do not treat token count, apparent deliberation, or broad exploration as substitutes for hypothesis elimination.

The agent fails twice: first as an explorer, then as an interpreter

The paper’s strongest diagnostic move is to separate exploration from inference.

If agents perform poorly after their own exploration, there are two possible explanations. They may have collected bad evidence. Or they may have collected adequate evidence but reasoned badly from it. The authors test this by giving models standardised exploration histories generated by different sources: GPT-4o exploration, InfoGain Oracle exploration, random exploration, and a count-based exploration policy.

This experiment is doing diagnostic work, not just adding another result. It asks: if we fix the evidence stream, does the bias disappear?

The answer is: partly, but not fully.

In conjunctive settings, models improve when given oracle exploration data rather than LM-generated exploration data. That shows bad exploration is causally involved in poor performance. However, most models still do not reach near-perfect accuracy even with oracle data. The paper notes that DeepSeek-R1 is the exception in this particular inference-only comparison, but the broader pattern remains: for models such as GPT-4o, conjunctive inference remains worse than disjunctive inference even when both receive oracle data.

This is the mechanism in miniature:

The agent explores inefficiently, especially in conjunctive worlds.
Better exploration data improves performance.
Yet biased inference remains for most models even when the data are strong.
Therefore the failure is not only data collection or only reasoning. It is the coupling of biased priors with active exploration.

That coupling is precisely what makes agentic systems risky. A passive classifier can be benchmarked on fixed inputs. An agent creates part of its own evidence. Once it starts asking biased questions, the downstream reasoning task has already been quietly sabotaged.

The human comparison is evidence about shape, not identity

The paper then compares LM behaviour with human developmental findings from Blicket experiments. This section should be read carefully. It does not prove that LMs and humans have the same internal cognitive mechanisms. The authors say as much. The comparison is about behavioural profile.

In passive inference settings with ambiguous evidence, most language models show adult-like disjunctive bias. Four-year-old children, by contrast, are more willing to infer conjunctive rules when the evidence supports them. This echoes the developmental-psychology idea of children as more flexible hypothesis searchers in certain causal-learning tasks.

The active exploration comparison is also telling. Prior work with children found that their exploration was not significantly shaped by whether the underlying machine was conjunctive or disjunctive. In the LM experiments, exploration is more rule-sensitive. Models generally attempt fewer unique combinations and spend less time exploring in disjunctive settings.

The point is not that children are magic and adults are broken. The point is that adult-like priors can be efficient in ordinary environments and harmful in scientific discovery. Language models trained on adult-generated text may inherit not only knowledge but also the shape of adult plausibility.

That is a sobering design lesson. Alignment to human preference does not automatically mean alignment to scientific inquiry. Humans often prefer explanations that feel clean, familiar, and single-cause. The universe, in a continuing act of poor stakeholder management, often does not.

Hypothesis sampling works because it attacks the prior

The proposed fix is inference-time hypothesis sampling. It does not require weight updates. Instead, it changes what the agent explicitly represents during exploration.

The method can be understood in four steps:

Step	What happens	Why it helps
Sample candidate hypotheses	The LM generates possible causal functions.	The agent is forced to articulate alternatives, not just follow intuition.
Reject duplicates	Repeated hypotheses are removed.	The represented prior becomes flatter over its support.
Eliminate inconsistent hypotheses	Observations are used to discard hypotheses that no longer fit.	Evidence changes the working hypothesis set explicitly.
Choose actions to disprove hypotheses	The LM is prompted to take actions that eliminate as many active hypotheses as possible.	Exploration becomes adversarial against uncertainty rather than confirmatory.

The technical idea is that the model’s internal prior $p(F)$ may be skewed toward disjunctive hypotheses. The procedure constructs an empirical distribution $q(F)$ over sampled unique hypotheses and gives each accepted hypothesis equal weight. As more unique hypotheses are included, this explicit working prior becomes less dominated by whichever hypothesis the model would otherwise keep regenerating.

The exploration prompt then changes from “figure out the answer” to “take actions that eliminate hypotheses.” That is a small wording change with a large behavioural implication. The agent no longer gets rewarded, even implicitly, for settling quickly on a plausible causal story. It has to keep track of what would still be true if it were wrong.

In the eight-object environment, the hypothesis-sampling agent improves exploration and Q&A performance as the number of initial unique hypotheses increases. The paper reports that with enough samples, the method removes the earlier drop in conjunctive versus disjunctive performance. Appendix results further show that the biased exploration pattern seen in the naive LM is no longer significant under hypothesis sampling.

This is not a miracle. It is a scaffold. The agent is being made more scientific by outsourcing part of scientific discipline into the runtime procedure.

The appendix is mostly robustness, not a second thesis

The appendix matters because it clarifies what the main claims do and do not rest on.

The model-access details show that the authors tested a mix of model families and sizes, including GPT-4o, GPT-4o-mini, DeepSeek-V3, DeepSeek-R1, QwQ, and Gemma variants. This supports the claim that the bias is not a single-model quirk.

The statistical validation section describes the repeated trials and prompt/system-message variants. For the main results, the paper uses at least 16 independent trials per model, prompt variant, and system prompt combination. That is not industrial-scale evaluation, but it is enough to make the reported patterns more than anecdotal prompt archaeology.

The reasoning-effort appendix is especially useful. Tests with OpenAI o-series mini reasoning models at different reasoning efforts show that increasing reasoning effort does not cleanly solve hypothesis elimination. In some conjunctive settings, higher reasoning effort still leaves many hypotheses unresolved. This supports the paper’s broader warning: more reasoning tokens are not the same as better experimental design.

The human-exploration appendix reinforces the comparison with children. Children’s exploration statistics remain broadly similar across conjunctive and disjunctive machines, while LM exploration varies more with rule type. Again, this supports behavioural similarity claims, not claims about identical cognition.

Finally, the hypothesis-sampling appendix gives implementation detail: hypotheses are sampled as executable Python-style functions, checked against observations, moved into active or eliminated sets, and used to condition subsequent actions. This is important because it shows the method is not mystical “better prompting”. It is a structured runtime loop.

What this implies for enterprise agent design

The paper directly shows a causal-discovery bias in a simplified text environment. Cognaptus’ business inference is broader but bounded: any agent that must investigate unknown systems may fail if its exploration policy inherits strong priors about what causes “usually” look like.

The risk is highest in domains where important causes are conjunctive, conditional, or rare:

Domain	Disjunctive shortcut	Missed conjunctive reality
IT operations	“The database is slow.”	Slowdown occurs only under a region, cache state, and deployment version combination.
Fraud review	“This transaction feature is suspicious.”	Fraud pattern requires several weak signals jointly.
Compliance	“This vendor category is risky.”	Risk appears only when vendor type, approval path, and contract structure combine.
Clinical or safety triage	“One factor explains the alert.”	The dangerous state requires multiple interacting conditions.
Marketing experiments	“Campaign A caused uplift.”	Uplift depends on segment, timing, channel, and offer interaction.

The design response is not to ban intuition. Priors are useful. The design response is to make priors inspectable and contestable.

A practical agent workflow should include:

A hypothesis ledger. The agent should maintain a list of active causal explanations, not just a running narrative.
Explicit elimination criteria. Each observation should say which hypotheses it rules out and which remain possible.
Information-gain action selection. The next step should be chosen for its ability to reduce uncertainty, not its ability to confirm the favourite story.
Conjunctive stress tests. Evaluation sets should include cases where no single feature is sufficient.
Reasoning-token scepticism. Longer explanations should be treated as audit material, not as evidence of correctness.

This is where many agent demos are still too shallow. They show an agent using tools. They do not show whether the tool calls were scientifically useful. Tool use is not experimentation. It is merely clicking with a better vocabulary.

What remains uncertain

The paper’s limits are real and specific.

First, the Blicket environment is simplified. It has a small discrete hypothesis space, controlled observations, and two rule families. Enterprise environments contain noisy data, hidden confounders, changing processes, partial observability, and incentives that are more annoying than a light turning on.

Second, the comparison to humans is behavioural. The fact that LMs resemble adults in certain Blicket patterns does not prove that they share human cognitive mechanisms. It suggests overlapping outward tendencies under a translated task.

Third, the hypothesis-sampling method depends on the model’s ability to generate useful candidate hypotheses. If the correct hypothesis is outside the sampled support, flattening the sampled set does not save you. A uniform distribution over the wrong menu is still the wrong menu, just democratically wrong.

Fourth, runtime hypothesis management introduces cost and complexity. Sampling, validating, and eliminating hypotheses may be worthwhile in high-stakes diagnosis or experimentation. It may be unnecessary for low-risk automation where speed matters more than causal depth.

The right conclusion is not “all agents need full scientific scaffolding all the time.” The right conclusion is that agents assigned investigative work need procedures that force them to behave less like confident adults and more like disciplined experimenters.

The useful upgrade is procedural humility

The paper’s most valuable contribution is not the discovery that language models have biases. That sentence could have been printed on a mug in 2023. The valuable contribution is the mechanism: biased priors shape exploration, exploration shapes evidence, and evidence then gets interpreted through the same biased prior.

That is why the proposed fix is interesting. It does not ask the model to be humbler in the abstract. It gives the model a procedure that makes humility operational: list hypotheses, test them, eliminate them, and only then answer.

For Cognaptus readers building agentic systems, the design lesson is crisp. Do not ask whether your agent can explain its conclusion. Ask whether it tried hard enough to disprove it.

Cognaptus: Automate the Present, Incubate the Future.

Anthony GX-Chen, Dongyan Lin, Mandana Samiei, Doina Precup, Blake Aaron Richards, Rob Fergus, and Kenneth Marino, “Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?”, arXiv:2505.09614, 2025, https://arxiv.org/abs/2505.09614. ↩︎

TL;DR for operators#

The real failure is biased exploration, not just biased answers#

Why the Blicket setup is sharper than another benchmark leaderboard#

The models prefer OR-worlds even when AND-worlds are valid#

Information gain predicts success better than exploration theatre#

The agent fails twice: first as an explorer, then as an interpreter#

The human comparison is evidence about shape, not identity#

Hypothesis sampling works because it attacks the prior#

The appendix is mostly robustness, not a second thesis#

What this implies for enterprise agent design#

What remains uncertain#

The useful upgrade is procedural humility#