In the latest paper “Language Agents Mirror Human Causal Reasoning Biases” (Chen et al., 2025), researchers uncovered a persistent issue affecting even the most advanced language model (LM) agents: a disjunctive bias—a tendency to prefer “OR”-type causal explanations over equally valid or even stronger “AND”-type ones. Surprisingly, this mirrors adult human reasoning patterns and undermines the agents’ ability to draw correct conclusions in scientific-style causal discovery tasks.

This blog breaks down the findings and walks through how the test-time hypothesis sampling method proposed in the paper helps overcome these limitations. We’ll then explore how this method can be implemented into the XAgent framework, our modular R-based multi-agent system.


🔍 The Problem: LLM Agents Fail the Blicket Test

The researchers adapted the classic developmental psychology experiment called the Blicket Test into a text-based game. In this game, agents interact with a virtual machine that responds to certain object combinations. The rule governing activation could be disjunctive (any blicket triggers the light) or conjunctive (all blickets must be present).

Findings:

  • LM agents (including GPT-4o and DeepSeek models) performed well under disjunctive rules.
  • But under conjunctive conditions, their accuracy dropped significantly—even when fed perfect data.
  • Even with sophisticated prompting (Chain-of-Thought, ReAct, Reflexion), this bias persisted.

🧠 Why? Disjunctive Priors in LMs

Language models are trained on human-generated text, mostly from adults. Decades of psychological research show that adults tend to prefer simpler, disjunctive causal explanations. As such, LMs appear to internalize and reproduce these reasoning shortcuts.


🧪 The Solution: Hypothesis Sampling at Inference Time

To fix this, the authors introduce a clever inference-time method that doesn’t require model fine-tuning. Here’s how it works:

  1. Sample Hypotheses: Prompt the LM to generate a diverse set of candidate causal hypotheses (functions like object[1] OR object[2]).
  2. Flatten the Prior: Reject duplicate or overly similar hypotheses to construct a more uniform belief distribution $q(F)$.
  3. Prompt for Elimination: Ask the LM to take actions that eliminate the most remaining hypotheses (i.e., maximize information gain).
  4. Iterate: After each action, update the hypothesis set and repeat.

This loop nudges the LM to adopt a more scientific, Bayesian hypothesis-testing approach, rather than relying on intuition.

Result:

When applying this method in the harder 8-object Blicket environment, the LM agent’s accuracy rose significantly—closing the gap between conjunctive and disjunctive inference.


🧩 Integrating into XAgent Framework

Our XAgent R package supports flexible agent construction and reasoning pipelines. Here’s how we can incorporate the test-time hypothesis sampling idea:

Step 1: Add a New Policy Module

Create R/policy_hypothesis_sampling.R and define:

hypothesis_sampling_policy <- function(agent, memory) {
  # Step 1: Sample N diverse hypotheses using the LLM
  hypotheses <- sample_hypotheses_llm(agent$name, memory$context)

  # Step 2: Select action that eliminates the most hypotheses
  action <- choose_elimination_action(hypotheses, memory$history)

  # Step 3: Update memory and return
  memory$hypotheses <- update_hypotheses(hypotheses, action, observation)
  return(list(action = action, memory = memory))
}

Step 2: Add to Pipeline

Update the agent’s pipeline in agent_schedule.yaml or wherever the agent’s config is stored:

pipeline:
  - think: hypothesis_sampling_policy
  - act: execute_action
  - update: update_memory

Step 3: Allow Dynamic Prior Sampling

In memory_io.R, ensure memory$hypotheses is saved and loaded.


🔄 From Bias to Bayesian

Rather than blindly imitating flawed adult logic, the hypothesis sampling method helps LM agents behave more like careful scientists—generating competing theories and eliminating the wrong ones. For any project aiming to build trustworthy autonomous AI agents, integrating this method is a must.

At Cognaptus, we’re embedding this upgrade into XAgent’s reasoning module starting next release.


Cognaptus: Automate the Present, Incubate the Future.