Wrong on Purpose: FalsifyBench and the Agent Skill We Keep Forgetting

A good analyst should occasionally try to break their own idea.

Not performatively. Not with a decorative “on the other hand” paragraph. Actually break it. Ask the kind of question that could make the current hypothesis collapse, then watch whether the evidence forces a better one.

That simple discipline is the center of FalsifyBench: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games, a new paper by Leonardo Bertolazzi, Katya Tentori, and Raffaella Bernardi.¹ The paper is framed around scientific reasoning, but its practical message travels well beyond science. If an AI agent cannot test outside its own current belief, it may look careful while doing something much less impressive: confirming the first plausible story it invented.

That is not intelligence. That is PowerPoint with API access.

The paper introduces FalsifyBench, a benchmark inspired by Wason’s classic 2-4-6 task. Instead of asking models to infer a numerical rule, the benchmark asks them to discover hidden semantic categories. A model sees three examples, proposes more examples to test, receives oracle feedback, and eventually guesses the hidden rule.

The trick is that the initial examples are sampled from a narrower category than the real target. For example, the player might see examples from a specific animal subgroup, while the hidden rule is simply “animal.” The model’s natural first hypothesis is therefore too narrow. If it only tests more things that fit that narrow hypothesis, every answer looks reassuring. The model learns almost nothing.

The useful move is to test something outside the current hypothesis.

That is the mechanism worth understanding before looking at model rankings.

The benchmark is built around one uncomfortable relation: the hypothesis is too narrow

FalsifyBench formalizes the game using two sets:

$H$: the model’s current hypothesis;
$R$: the true hidden rule.

The most important case is $H \subset R$: the hypothesis is a strict subset of the real rule. The model is not completely wrong. It is just too specific. That is exactly why the failure is dangerous.

If the model thinks the rule is “mammal” while the true rule is “animal,” then testing more mammals will always receive positive feedback. The feedback is true, but unhelpful. It confirms the narrower hypothesis without revealing the broader one.

The only way to discover that the rule is broader is to test something outside $H$ but inside $R$ — for instance, a bird or a fish. If the oracle says it conforms, the model must revise upward.

Relation between model hypothesis and true rule	Useful falsification route	What it means in plain English
$H = R$	None; guess and finish	The model already has the rule.
$H \subset R$	Negative test: test outside $H$ and get “Conform”	The model’s idea is too narrow. It must generalize.
$H \supset R$	Positive test: test inside $H$ and get “Do not conform”	The model’s idea is too broad. It must narrow.
Partial overlap	Positive or negative tests may help	The model is near the right region but not aligned.
Disjoint	Positive or targeted negative tests may help	The model is off-track. It needs a reset, not more polish.

This table matters because “confirmation bias” is often treated as a vague psychological insult. The paper makes it operational. In the dominant FalsifyBench setup, positive tests are not merely biased; they are structurally uninformative.

That distinction is useful for business AI. A research assistant that keeps asking questions compatible with its current interpretation can look diligent while staying trapped. A compliance agent can keep finding examples that fit a narrow reading of a policy. A market-intelligence bot can keep validating a customer segment definition that should have been generalized two steps earlier.

The problem is not that the agent is lazy. The problem is that the evaluation never required it to risk being wrong.

What FalsifyBench actually asks the models to do

The benchmark uses WordNet as a semantic taxonomy. The authors begin with seven high-level candidate categories — including “animal,” “artifact,” “body part,” “food,” “location,” “plant,” and “worker” — and then manually curate a final benchmark of 100 games across five target categories: “animal,” “artifact,” “body part,” “food,” and “plant.”

Each game has two hidden elements:

a broad target rule $R$, such as “animal”;
a narrower sampling category $S$, such as “vertebrate.”

The player sees only three initial examples drawn from $S$. It does not see $S$ or $R$. On each turn, the player can either:

Test three new items, while stating its current hypothesis and reasoning; or
Guess the hidden property.

The oracle answers whether the tested items conform to the hidden rule, or whether the guess is correct. Games end when the model guesses correctly or hits the 20-turn limit.

A nice design choice is that each LLM is evaluated in two roles: player and oracle. This lets the authors separate two possible failure explanations. Maybe models fail because they reason badly as players. Or maybe they fail because the oracle gives poor feedback. The paper then tests that distinction instead of waving vaguely at “model limitations,” a phrase that has done enough damage already.

The experiment evaluates 12 LLMs across model families and scales. The player is stateful, accumulating the conversation history. The oracle is stateless, judged call by call. Both are constrained to JSON outputs. The authors also annotate player strategy as positive or negative testing and later annotate the set relation between the model’s current hypothesis and the target rule.

This is not just a pass/fail benchmark. It records the path by which the model succeeds or fails. That path is the point.

The main result: stronger models do better, but negative testing explains the motion

The headline performance pattern is straightforward. Reasoning models generally outperform instruction-tuned models, but performance is far from saturated.

The best instruction-tuned model, DeepSeek-V3.1, reaches 41% success. The strongest reasoning models do better: GPT-5.2-Chat reaches 75%, and GPT-OSS-120B reaches 68%. But smaller reasoning models do not automatically benefit from “thinking longer”: GPT-5-Nano reaches 12%, and Qwen3.5-9B reaches 21%, both below the instruction-tuned cluster.

So yes, model strength matters. But the more interesting question is how the successful models move.

The paper finds a strong negative correlation between success rate and confirmation bias: models that relied more on positive testing performed worse, with Spearman $\rho = -0.779$ and $p = 0.003$. Stronger reasoning models also had much lower positive-testing ratios. GPT-5-Mini is reported at 39.0%, GPT-5.2-Chat at 27.3%, and GPT-OSS-20B at 24.2%, while instruction-tuned models cluster between 66% and 82%.

The interpretation is not “reasoning models are magically scientific.” A cleaner interpretation is this:

Better players more often generated tests that could force their current hypothesis to change.

That is a narrower claim, and therefore a more useful one.

The oracle is not the main scapegoat

A tempting explanation would be that models fail because the oracle gives bad feedback. The authors test this directly.

Oracle accuracy against human annotations is moderately high across models, mostly between 85% and 95%. Cohen’s $\kappa$ varies more, from 0.34 for Mistral-Small-24B on Guess turns to 0.98 for GPT-5.2-Chat in both conditions. So oracle quality is not perfect. Still, it does not explain the main player outcomes.

The paper gives a useful contrast: MiniMax-M2.5, GLM-5, and GPT-5-Nano have broadly similar oracle $\kappa$ ranges, yet their player success rates are 35%, 58%, and 12%, respectively. Similar oracle quality; very different game performance.

The regression analysis makes the same point more formally. The authors fit a Bayesian mixed-effects logistic regression over 1,200 game-level observations, using confirmation bias and oracle error as fixed-effect predictors, with model identity as a random effect.

The result is blunt:

confirmation bias has a strong negative association with success: posterior mean $-4.15$, odds ratio $0.016$, posterior SD $0.116$;
oracle error has no credible independent effect: posterior mean $-0.12$, odds ratio $0.885$, posterior SD $0.147$, with a credible interval spanning zero.

The appendix adds a helpful interpretation: moving from no confirmation bias to maximum confirmation bias reduces the odds of success to about $1/63$ of their original value. Oracle error, by contrast, corresponds to an estimated 11.5% reduction in odds, but the uncertainty interval crosses zero.

This regression is not a license to ignore tool quality in real systems. In a production agent, bad retrieval, bad APIs, and bad databases can absolutely poison reasoning. But in this benchmark, the dominant failure is not the oracle. It is the player’s testing strategy.

That is a valuable diagnostic separation. Before blaming the tool, check whether the agent ever asked a question that could have overturned its own working theory.

The turn-level analysis is where the paper earns its argument

The paper’s strongest section is not the model leaderboard. It is the turn-level analysis.

A simple confirmation-bias score tells us whether the model intended to confirm or falsify its hypothesis. But that is not enough. Negative testing is not always optimal. If $H \supset R$, positive testing can be the correct way to falsify the hypothesis. The authors therefore annotate the relation between $H$ and $R$ across all test turns, using GPT-5-Mini as an offline annotator.

This matters because it checks whether FalsifyBench really creates the intended Wason-like structure.

It does. Across almost all models, the dominant relation is $H \subset R$: the model’s hypothesis is too narrow. For almost all models, this relation appears in more than 50% of test turns. The major exception is GPT-5.2-Chat, which has a lower share of $H \subset R$ turns and a higher share of partial-overlap relations.

Now the mechanism becomes visible. In the dominant $H \subset R$ case, positive testing is uninformative. Negative testing is uniquely capable of forcing upward revision.

The authors then compute conclusive falsification rate: the proportion of test turns that actually yield a conclusive falsification given the relation between the current hypothesis and the true rule. This is better than merely counting “negative” tests. It asks whether the test actually had the logical power to reject the current hypothesis.

Here, reasoning models again lead:

GPT-OSS-20B: 59.2%;
GPT-OSS-120B: 53.4%;
GPT-5.2-Chat: 49.5%.

The lowest rates are Llama-4-Maverick at 18.0% and GPT-5-Nano at 26.6%. Confirmation bias and conclusive falsification rate are strongly negatively correlated, with Spearman $\rho = -0.937$ and $p < 0.001$.

That result is the core of the paper. Models do not fail simply because they guess too early, or because the task is semantically weird, or because the oracle is imperfect. They fail because they do not reliably generate tests that can make their current hypothesis untenable.

In business terms: the agent lacks an internal adversary.

The appendix is not decoration; it tells us which evidence is doing what

The paper’s appendix is unusually important because it clarifies what kind of support each analysis provides. Not every appendix result should be read as a second main claim. Some parts define the normative logic of the task; others are implementation details; others are robustness or exploratory failure analyses.

Paper component	Likely purpose	What it supports	What it does not prove
Main model performance and confirmation-bias results	Main evidence	Reasoning models generally perform better, and positive-testing-heavy models perform worse.	It does not alone prove why positive testing fails.
Regression separating confirmation bias and oracle error	Main explanatory evidence	Player strategy matters more than sampled oracle error in this setup.	It does not prove oracle quality is irrelevant in all agent systems.
Turn-level relation annotation	Mechanism evidence	The benchmark is dominated by $H \subset R$, where negative testing is the informative route.	It depends on the quality of the offline relation annotator.
Conclusive falsification rate	Mechanism evidence	Successful models more often generate tests that logically force revision.	It is still measured inside a curated WordNet game space.
Target-rule decomposition	Robustness / sensitivity test	Some semantic domains, especially “artifact,” are much harder than others.	It does not establish a universal hierarchy of business-domain difficulty.
Linguistic-feature classifier	Exploratory failure analysis	Failed games often contain surface-level word-pattern hypotheses.	The classifier is heuristic and can miss subtle linguistic drift.
Qualitative game traces	Interpretive example	Shows how upward revision succeeds and how drift can derail the game.	One trace does not quantify prevalence by itself.

This distinction matters for readers who are allergic to benchmark papers, usually for understandable reasons. The benchmark score is only the surface. The paper’s stronger contribution is the instrumented path: hypothesis, test, feedback, revision, failure mode.

For agent evaluation, that path is more useful than another bar chart wearing a suit.

Failure is often a path problem, not a knowledge problem

The paper’s failure analysis goes beyond confirmation bias. Failed games show more turns where the model’s hypothesis is either partially overlapping with or disjoint from the true rule. This means the model is not simply stuck at a narrow special case; it is drifting into nearby-but-wrong or fully off-track concepts.

For stronger reasoning models, partial overlap is especially revealing. Failed games show substantially higher partial-overlap turns than successful games, with reported gaps of +27.0 percentage points for GPT-5.2-Chat, +22.3 points for GPT-OSS-120B, and +21.4 points for GLM-5.

This is the sophisticated failure mode. The model is not babbling. It is near the target, but its revision path is messy. It circles adjacent concepts without committing to the clean upward move.

The qualitative traces make this concrete.

In the successful example, the target rule is “body part,” while the initial examples come from “sense organ.” The model first considers specialized sensory structures. It tests boundary cases. It receives feedback showing that non-sensory organs such as heart, kidney, and lung also conform. It then revises upward and guesses “anatomical body parts.”

That is exactly the desired mechanism: narrow hypothesis, negative test, upward generalization.

The failed example has target rule “artifact” and initial examples from “electrical device.” The model begins sensibly with coil-based devices, then electrical components, then mechanical parts, then tangible physical objects. It is searching. But it drifts into engineering-related terms, components of larger systems, non-living things, countable nouns, and even letter or syllable patterns. It reaches the 20-turn limit without identifying “artifact.”

The embarrassing part is not that the model tried a strange hypothesis. Strange hypotheses are allowed. The embarrassing part is the failure to maintain a disciplined search path through the semantic hierarchy.

Surface-level rules are a warning light for agent drift

One of the paper’s sharper observations is that failed games often contain hypotheses based on surface-level linguistic properties: starting letters, syllables, vowels, word count, pronunciation, and similar features.

For high-performing models, this failure mode is particularly visible. In failed games, the presence of at least one linguistic hypothesis appears in 97.6% of GLM-5 failures, 92.0% of GPT-5.2-Chat failures, and 84.4% of GPT-OSS-120B failures. Weaker models such as GPT-5-Nano and Qwen3.5-9B do not show the same significant contrast between successful and failed games, likely because their failure is more general.

This distinction is useful. Stronger models may fail in more diagnostic ways. They get close enough to have identifiable derailments.

For business systems, this suggests a practical monitoring rule: do not only log the final answer. Log the kind of hypothesis the agent is entertaining. If a semantic reasoning agent suddenly starts using surface-form rules, it may be compensating for uncertainty with pattern trivia.

That is not always wrong. Sometimes surface form matters. But in a semantic task, it is a warning light.

“Artifact” is hard because some categories do not generalize cleanly

The appendix target-rule decomposition adds an important boundary to the main story. The hardest target category is “artifact.” Across models and metrics, artifact games have low success rates, high confirmation bias, and low conclusive falsification rates. Most models are in the 0–30% success range for this target category.

The authors suggest a plausible reason: artifact subcategories are semantically heterogeneous. “Artifact” contains electrical devices, utensils, musical instruments, weapons, vessels, paintings, tubes, lamps, robes, and many other man-made things. The upward path is not as clean as moving from “vertebrate” to “animal.”

By contrast, “animal” is the easiest target for most reasoning models. GPT-5.2-Chat reaches 100% success on animal games, and several other reasoning models exceed 75%.

This matters because business taxonomies often look more like “artifact” than “animal.”

A corporate policy category, a customer segment, a procurement risk label, or an operational incident class may contain heterogeneous subtypes. In these cases, the agent may need more than negative testing. It may need an explicit ontology, stronger retrieval, human-defined boundary cases, and a mechanism for representing multiple candidate generalizations at once.

Cognaptus inference: FalsifyBench is not telling businesses that negative testing solves all reasoning. It is telling them that without negative testing, even the first step of disciplined hypothesis revision is missing.

What this changes for business AI evaluation

The practical lesson is not “use FalsifyBench as your enterprise benchmark.” The benchmark is built on English WordNet categories, curated semantic games, and a specific narrow-to-broad discovery pattern. It is valuable, but it is not a direct simulation of legal review, credit analysis, customer research, or strategy consulting.

The better takeaway is an evaluation design principle:

Do not evaluate agent reasoning only by final-answer correctness. Evaluate whether the agent can construct tests that would falsify its own current hypothesis.

That principle leads to a different kind of agent audit.

Evaluation question	Weak agent behavior	Stronger agent behavior
Does the agent state its current hypothesis?	It jumps straight to an answer.	It makes the working hypothesis explicit.
Does it test outside the hypothesis?	It gathers compatible examples.	It deliberately probes boundary and counterexample cases.
Does it distinguish tool error from reasoning error?	It blames bad data generically.	It separates feedback reliability from its own search strategy.
Does it revise upward when evidence demands it?	It patches the narrow hypothesis.	It generalizes to a broader category when boundary tests conform.
Does it recognize drift?	It keeps inventing adjacent explanations.	It detects partial-overlap and disjoint hypotheses as failure states.
Does it rely on guessing?	It submits repeated guesses and learns from “Incorrect.”	It uses targeted tests before committing.
Does it produce surface-pattern hypotheses in semantic tasks?	It grabs letters, syllables, or naming quirks.	It stays aligned with the semantic level of the task.

This kind of audit is especially relevant for research agents, compliance review agents, diagnostic assistants, market-intelligence systems, and due-diligence workflows. In all of these settings, false confidence is expensive. The agent does not merely need to answer. It needs to know how to unsettle its own answer.

What the paper directly shows, and what Cognaptus infers

The paper directly shows four things.

First, FalsifyBench can operationalize a semantic version of the Wason task using WordNet categories and multi-turn LLM agents.

Second, across 12 models and 100 curated games, reasoning models generally outperform instruction-tuned models, but even the best model remains far from perfect.

Third, success is strongly associated with lower confirmation bias and higher conclusive falsification rate.

Fourth, failure is not random. Failed games often involve partial-overlap drift, disjoint hypotheses, passive guessing, and surface-level linguistic rules.

Cognaptus infers several business design lessons from those results.

One, agent evaluations should include hypothesis-path logging, not just answer scoring. A final answer hides whether the agent reasoned well or stumbled into correctness.

Two, agent prompts and workflows should include an explicit counterexample budget: before finalizing an interpretation, the agent must test cases that would be awkward for its current view.

Three, evaluators should separate feedback-channel quality from agent strategy quality. Bad retrieval and bad tools matter, but an agent that never asks a falsifying question can fail even with adequate feedback.

Four, organizations should monitor semantic drift. If an agent moves from a business concept to wording quirks, naming patterns, or arbitrary adjacent categories, it may be substituting pattern search for conceptual revision.

What remains uncertain is equally important. FalsifyBench is a controlled semantic benchmark. It does not prove that today’s strongest models can act as autonomous scientists. It does not cover all forms of scientific reasoning. It does not test messy real-world data pipelines, political incentives, strategic deception, or organizational ambiguity. It also relies on WordNet, an English-centric taxonomy with its own lexicographic assumptions.

Those boundaries do not weaken the paper. They keep the lesson precise.

The real benchmark is whether the agent can leave its favorite answer behind

The best way to read FalsifyBench is not as another contest between model names. Model names change. The mechanism is more durable.

A model sees narrow evidence. It forms a narrow hypothesis. It can either keep collecting evidence that fits, or it can ask a question that might force the hypothesis to expand. Successful models do more of the second. Failed models either confirm themselves into a corner, drift into adjacent concepts, guess passively, or retreat into surface patterns.

This is exactly the failure mode businesses should worry about as AI agents move from chat interfaces into workflows. The danger is not always hallucination in the cartoon sense. Sometimes the answer is locally coherent, supported by examples, and still too narrow.

That is worse, because it looks responsible.

A serious agent should be able to say: “Here is my current theory. Here is the test that could break it. Here is what changed after the test.”

Until then, many “reasoning agents” are still doing what humans have always done very elegantly: protecting the first plausible idea from the inconvenience of evidence.

Cognaptus: Automate the Present, Incubate the Future.

Leonardo Bertolazzi, Katya Tentori, and Raffaella Bernardi, “FalsifyBench: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games,” arXiv:2606.04751v1, 3 June 2026, https://arxiv.org/abs/2606.04751. ↩︎

The benchmark is built around one uncomfortable relation: the hypothesis is too narrow#

What FalsifyBench actually asks the models to do#

The main result: stronger models do better, but negative testing explains the motion#

The oracle is not the main scapegoat#

The turn-level analysis is where the paper earns its argument#

The appendix is not decoration; it tells us which evidence is doing what#

Failure is often a path problem, not a knowledge problem#

Surface-level rules are a warning light for agent drift#

“Artifact” is hard because some categories do not generalize cleanly#

What this changes for business AI evaluation#

What the paper directly shows, and what Cognaptus infers#

The real benchmark is whether the agent can leave its favorite answer behind#