A promise is rarely just a logical operator.
“If you mow the lawn, I’ll give you 50 dollars” does not sound like a philosophical exercise in truth tables. It sounds like a deal. Most people hear it as: no mowing, no money. By contrast, “If you’re hungry, there’s pizza in the oven” does not mean the pizza appears only under the metaphysical condition of your hunger. It means the pizza is there, and your hunger merely explains why I am telling you.
This is the small, ordinary distinction that Large Language Models still struggle to make reliably. Not because they cannot process an “if.” They often can. The problem is sharper: they can treat conditionals as formal objects while missing the human habit of enriching them with context, relevance, speaker intention, and world knowledge.
That distinction is the center of Paolo Morosi and colleagues’ paper, Tracing the ongoing emergence of human-like reasoning in Large Language Models.1 The paper tests 25 LLMs against human participants across Catalan, English, Italian, and Spanish using two kinds of conditionals: standard conditionals, where humans often infer a strengthened “if and only if” reading, and biscuit conditionals, where humans often infer that the consequent is true regardless of the antecedent.
The business lesson is not “LLMs cannot reason.” That would be too easy, and therefore probably wrong. The better lesson is more inconvenient: benchmark-friendly reasoning and operational meaning are not the same thing.
The important split is not true versus false, but semantic versus pragmatic
Classical logic gives the conditional $p \rightarrow q$ a clean definition: it is false only when $p$ is true and $q$ is false. If $p$ is false, the conditional is treated as vacuously true.
That is elegant. It is also not how people usually interpret business language.
In ordinary communication, people do not merely ask whether a sentence satisfies a truth table. They ask why the speaker said it, what alternatives were left unsaid, what causal or contractual relation is implied, and whether the statement is relevant in the situation. This is where pragmatic enrichment enters.
The paper focuses on two forms of enrichment:
| Conditional type | Example | Formal-logical reading | Human pragmatic reading |
|---|---|---|---|
| Standard conditional | “If you mow the lawn, I’ll give you 50 dollars.” | The sentence is false only if mowing happens and payment does not. | The payment depends on mowing; no mowing, no payment. |
| Biscuit conditional | “If you’re hungry, there’s pizza in the oven.” | The conditional can be treated as true under standard truth-table logic. | The pizza is there regardless of hunger; hunger explains relevance, not causality. |
The first case is often called a perfected conditional: humans strengthen “if $p$, then $q$” into something close to “$q$ if and only if $p$.” The second is the classic biscuit conditional: the consequent is interpreted as independent of the antecedent. Hunger does not manufacture pizza. A useful reminder, apparently necessary.
This distinction matters because a model can be “logical” in one sense and still behave badly in realistic language tasks. A model that rigidly applies truth-table semantics may look disciplined, but it will miss the implied dependency in a promise, rule, policy, workflow condition, or customer-support instruction. A model that rigidly converts every conditional into a biconditional may look more human on standard promises, but it will fail when the conditional is only a relevance cue.
So the problem is not whether the model knows the word “if.” The problem is whether it knows which layer of meaning is active.
The benchmark isolates the layer where humans add meaning
The study’s design is useful because it does not simply ask models to solve abstract logic puzzles. It places conditionals inside short contexts and asks whether the consequent should be judged true.
For each language, the authors built 54 experimental prompts: 27 standard conditionals and 27 biscuit conditionals. These were divided into true controls, false controls, and critical cases. The critical cases are where the task becomes interesting.
In a standard conditional critical case, the antecedent is false. For example: the lawn was not mowed, and Mary had said, “If Paul mows the lawn, he will get 50 euros.” The question asks whether Paul got 50 euros. The human pragmatic answer tends to be “no,” because the deal is interpreted as dependent on mowing.
In a biscuit conditional critical case, the antecedent is also false. John is not hungry, and Mary says, “If John is hungry, there is pizza in the oven.” The question asks whether there is pizza in the oven. The human pragmatic answer tends to be “yes,” because the pizza’s existence is independent of John’s hunger.
That gives the benchmark its diagnostic power. Both critical cases involve a false antecedent. But the target human interpretation goes in opposite directions depending on the conditional type.
| Test component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| True and false controls | Check whether humans and models can handle basic truth-value judgments. | Whether failures are specific to pragmatic enrichment rather than general task confusion. | Real-world reasoning competence in open-ended settings. |
| Critical standard conditionals | Test whether agents infer a perfected, dependency-like reading. | Sensitivity to implied exclusivity or conditional dependence. | Legal or contractual reasoning in full documents. |
| Critical biscuit conditionals | Test whether agents infer independence between antecedent and consequent. | Sensitivity to relevance-based pragmatic meaning. | General pragmatic competence across all conversational acts. |
| Cross-linguistic testing | Check whether the pattern is stable across four related languages. | Some evidence that the phenomenon is not English-only. | Generalization to unrelated or low-resource languages. |
| Model category comparisons | Examine whether open/closed status, dense/MoE design, or generative/reasoning orientation predicts performance. | Whether broad vendor-facing labels explain results. | Causal claims about architecture or training strategy. |
The human sample after exclusions included 313 participants. The model side included 25 LLMs, spanning commercial and open systems, dense and mixture-of-experts architectures, and generative or reasoning-oriented labels. The models were given the same prompts as humans, without extra task-specific instructions.
That last detail matters. In deployment, users rarely phrase every prompt as a formal instruction to distinguish truth-conditional semantics from pragmatic enrichment. They just write normal language. Cruel, but realistic.
Humans do not merely apply the truth table
The human results largely replicate what linguistic theory would predict.
Humans were near ceiling in true control conditions and less perfect in false controls, but the key result appears in the experimental critical cases. Participants interpreted standard conditionals pragmatically at a high rate: the paper reports that humans perfected standard conditionals at more than 88% on average across languages. Biscuit readings appeared at about 57% on average.
That difference is important. Human pragmatic reasoning is not a single magic switch. Standard conditionals receive strong and stable pragmatic strengthening. Biscuit conditionals receive the independence reading only about half the time. That means the paper is not claiming that humans are perfectly consistent pragmatic machines. Humans are not tiny theorem provers wearing sweaters. They show a structured bias toward context-sensitive enrichment, but the strength of that enrichment depends on the construction.
This is exactly why the benchmark is more informative than a generic “reasoning accuracy” score. It separates three layers:
- Can the agent handle the task structure?
- Can it process the conditional’s formal semantics?
- Can it enrich the sentence pragmatically in the human-like direction?
Many model evaluations compress those layers into one number. This paper makes the compression harder to get away with.
LLMs look competent in controls, then stumble where context matters
The paper’s main statistical result is straightforward: humans and LLMs do not differ meaningfully in the control conditions, but they diverge sharply in the experimental conditions requiring pragmatic interpretation.
The logistic mixed-effects model shows that the experimental manipulation affected humans and LLMs differently. In focused analysis of experimental trials, LLMs performed at a significantly lower level than humans, while both groups showed higher accuracy for standard than biscuit conditionals.
The practical interpretation is simple: LLMs are not failing everywhere. They fail where the task requires moving from the literal conditional to the intended communicative meaning.
That is more damaging than a generic failure. A generic failure can often be solved by better prompting, retrieval, or narrower task design. A selective failure at the semantics-pragmatics boundary is harder because the model may look reliable under basic tests, then misread the very cases that matter in operations.
Consider these business sentences:
| Business sentence | Literal structure | Likely operational meaning | Risk if the model misses pragmatics |
|---|---|---|---|
| “If the invoice is approved, release payment.” | Conditional rule | Approval is a necessary condition for payment. | Payment may be treated as allowed under unrelated conditions. |
| “If the customer asks, the warranty form is in the portal.” | Biscuit-like relevance cue | The form exists regardless of whether the customer asks. | The model may infer the form is conditional on the request. |
| “If the vendor misses the deadline, escalate to procurement.” | Conditional trigger | Deadline failure activates escalation. | The model may miss the implied exclusivity or trigger logic. |
| “If you need the audit trail, it is attached to the ticket.” | Biscuit-like support statement | Attachment exists; need explains relevance. | The model may answer as though attachment existence is uncertain. |
In workflow automation, this is not a philosophical annoyance. It affects escalation, compliance routing, payment release, support resolution, and contract interpretation. The edge cases are not exotic; they are written every day by people who think language is obvious because they are not machines.
Two model failure modes: truth-table rigidity and biconditional rigidity
The most useful part of the paper is not merely that LLMs underperform humans. It is the behavioral profile.
The authors identify two broad model patterns.
The first group adheres closely to truth-table semantics. These models treat false-antecedent conditionals as vacuously true and therefore fail when the task requires contextual reasoning about the consequent. This is the “formal operator” failure mode. It is clean, internally consistent, and wrong for many human communication settings.
The second group strengthens conditionals into biconditionals across the board. These models perform better on perfected standard conditionals, because “if $p$, then $q$” often is pragmatically understood as “only if $p$.” But the same rule breaks biscuit conditionals. If every conditional becomes a deal, then pizza starts depending on hunger. This is how a superficially human-like rule turns into a brittle shortcut.
There were also less tidy cases: some models showed variable behavior, and one model tended toward positive responses across conditions. But the two dominant profiles explain the broader pattern. The models are often not flexibly choosing between semantic and pragmatic readings. They are applying a single interpretive strategy too widely.
That is the key mechanism. A model can be wrong not because it lacks a rule, but because it has the wrong level of rule.
Model rankings are less useful than diagnostic profiles
The paper reports large variation across individual models. In experimental trials, mean accuracies ranged from 0.000 for Distil-BERT to 0.833 for Llama 3.3. Some models, including Llama 3.3 70B Instruct and Kimi K2-Instruct-0905, maintained accuracies above 0.75 across languages. Others, including DeepSeek-v3.1 and GPT-5 in the authors’ test, remained near chance regardless of language.
A weak reading of this result would turn it into a leaderboard.
That would miss the point.
The stronger reading is diagnostic: models can reach similar headline performance for different reasons, and high performance on one conditional type may be purchased by rigidity on another. Llama 3.3, for example, is reported as highly accurate on standard conditionals but much weaker on biscuit conditionals. That profile suggests strong biconditional strengthening, not necessarily flexible pragmatic reasoning.
This distinction is essential for business model selection. A procurement team should not ask only, “Which model scored highest?” It should ask:
- Did the model succeed by applying a brittle shortcut?
- Does the model distinguish dependency conditionals from relevance conditionals?
- Does performance survive changes in domain, language, and document style?
- Does the model explain uncertainty when the intended interpretation is underdetermined?
A model that wins a narrow task for the wrong reason can become expensive in production. The bill arrives later, usually disguised as “human review backlog.”
The architecture labels do not rescue us
The paper also tests whether broad model categories explain the variation: open versus closed, dense versus mixture-of-experts, and generative versus reasoning-oriented.
They do not.
The mean differences are small and statistically indistinguishable. Open models averaged 0.626 accuracy and closed models 0.668. Dense models averaged 0.636 and MoE models 0.664. Generative, hybrid, and reasoning-oriented models clustered around 0.638, 0.653, and 0.648 respectively. None of these high-level design labels significantly improved model fit.
This is a useful antidote to procurement folklore.
“Reasoning model” is not a guarantee of pragmatic reasoning. “MoE” is not a magic wand. “Closed” is not automatically better. “Open” is not automatically more interpretable in behavioral terms. These categories may matter for cost, deployment control, latency, privacy, and governance. But in this task, they do not explain the pragmatic competence that the benchmark measures.
For business users, the implication is blunt: do not outsource evaluation to architecture labels. Test the behavior you need.
Language was not the main bottleneck, but the leakage is a warning
The study finds similar mean accuracy across Catalan, English, Italian, and Spanish, ranging roughly from 0.639 in English to 0.652 in Catalan for model performance in the relevant comparison. Adding language as a random intercept did not improve model fit, suggesting that the pragmatic limitation was not primarily language-specific within this set.
That finding should be interpreted carefully.
The four languages are related, and the paper itself notes cross-linguistic leakage: some models answered in English, Spanish, or another language when queried in a different target language. Catalan was especially vulnerable, with some responses shifting into Spanish or English.
So the practical message is not “multilingual deployment is safe.” The better message is: in this controlled task, the main failure signal was model-specific rather than language-specific, but multilingual behavior still needs separate evaluation. A model can preserve the right answer distribution while still showing operational language instability. For customer support, legal intake, public-sector services, or multilingual back-office automation, that instability matters.
Language compliance is part of task compliance. The model does not get extra credit for being correct in the wrong language. Charming, perhaps. Useful, less so.
What this means for AI evaluation inside firms
The paper’s business relevance is clearest when translated into evaluation design.
Most firms already know they should test accuracy. Fewer test whether the model is using the right interpretive strategy. That is the gap this paper exposes.
A useful evaluation set for business conditionals should include at least three categories:
| Evaluation category | Example | What it tests |
|---|---|---|
| Dependency conditionals | “If approval is missing, block payment.” | Whether the model treats the antecedent as operationally necessary. |
| Relevance conditionals | “If the auditor asks, the supporting file is in SharePoint.” | Whether the model recognizes that the consequent may hold independently. |
| Ambiguous conditionals | “If the client requests escalation, a senior analyst reviews the file.” | Whether the model asks for clarification or states assumptions when policy intent is unclear. |
The third category is especially important. Real business language is often under-specified. A good model should not always force a reading. Sometimes the correct operational behavior is to preserve ambiguity and request clarification.
This gives firms a better testing framework:
| Question | Bad evaluation habit | Better evaluation habit |
|---|---|---|
| Can the model reason over policies? | Test abstract logic puzzles. | Test policy-like conditionals with dependency, relevance, and ambiguity. |
| Can the model handle multilingual workflows? | Translate the same prompt and compare final answers only. | Check answer, language fidelity, explanation stability, and error mode. |
| Can the model support contracts or compliance? | Ask for summaries. | Test implied conditions, exceptions, and trigger logic. |
| Is a reasoning model enough? | Trust model label or benchmark marketing. | Run domain-specific pragmatic tests before deployment. |
| Is the model safe for automation? | Measure average answer correctness. | Identify failure modes that cause irreversible actions. |
The ROI relevance is not that pragmatic testing sounds intellectually refined. The ROI relevance is that it catches expensive misinterpretations before they become workflow errors.
In accounting, a missed condition can release a payment too early. In compliance, it can misroute an exception. In customer support, it can create a promise the firm did not intend to make. In HR, it can misread eligibility conditions. In procurement, it can confuse a relevance note with a contractual dependency.
These are not “AI hallucinations” in the usual sense. They are interpretation failures. Different disease, different treatment.
What the paper directly shows, and what Cognaptus infers
The paper directly shows that, in a controlled truth-value judgment task, humans and LLMs diverge most in conditions requiring pragmatic enrichment. It shows that LLMs often behave like reliable semantic operators but less reliable pragmatic reasoners. It also shows substantial model-level heterogeneity that is not explained by broad categories such as open/closed status, dense/MoE architecture, or generative/reasoning label.
Cognaptus infers that business AI evaluations should include pragmatic conditional tests before models are deployed into workflows involving policies, contracts, customer promises, support instructions, or compliance rules. This is an inference, not a direct experimental claim, because the paper does not test real enterprise documents or production workflows.
What remains uncertain is how much richer context would help. The experiment uses controlled sentence-level prompts. In real documents, surrounding clauses, headings, examples, and historical records may help models infer speaker intent. Or they may introduce new ambiguity. The paper does not settle that. It gives us a clean diagnostic signal, not the final word on enterprise language understanding.
The boundary: this is not a universal model ranking
The limitations matter because they constrain how the result should be used.
First, the language sample covers four related languages. The cross-linguistic stability is informative, but it does not establish generalization to unrelated languages or lower-resource settings.
Second, the model categories are not a balanced causal experiment. The finding that broad labels do not predict performance is useful, but it does not prove that architecture or training never matters. It means these coarse labels failed to explain performance in this benchmark.
Third, the task is sentence-level. That is a strength for isolating mechanism and a limitation for deployment inference. Enterprise settings involve longer documents, retrieval context, user histories, tool calls, and sometimes explicit procedural rules.
Fourth, the benchmark uses manually mapped free-text responses. That is reasonable for this kind of study, but production systems often need calibrated structured outputs, not just plausible explanations.
So the right business use is not to declare a winner or loser among models. The right use is to improve the test suite.
The practical lesson: test meaning, not just logic
The most tempting misconception is that a model that handles logical conditionals is reasoning like a human.
This paper shows why that is too generous. Human interpretation does not stop at the truth table. It adds relevance, dependence, exhaustivity, speaker intention, and world knowledge. LLMs can approximate some of this, sometimes impressively, but the behavior remains uneven and often rule-bound.
For businesses, the lesson is not to avoid LLMs in conditional workflows. That would be lazy pessimism. The lesson is to stop treating “reasoning” as a single capability. A model may be good at formal consistency, weak at pragmatic enrichment, strong in one language, leaky in another, accurate under one conditional pattern, and brittle under another.
The next generation of enterprise AI evaluation should therefore include small, sharp tests of ordinary language: promises, exceptions, eligibility rules, support notes, contractual triggers, and relevance cues.
Because in business, “if” is rarely just “if.”
Sometimes it is a condition. Sometimes it is a hint. Sometimes it is a promise. Sometimes it is a trap.
And if the model cannot tell the difference, there is probably pizza in the oven — but it may still deny lunch.
Cognaptus: Automate the Present, Incubate the Future.
-
Paolo Morosi, Nikoleta Pantelidou, Fritz Günther, Elena Pagliarini, and Evelina Leivada, “Tracing the ongoing emergence of human-like reasoning in Large Language Models,” arXiv:2605.21299, https://arxiv.org/pdf/2605.21299. ↩︎