Contract.

A supplier writes, “If payment is received by Friday, the discount applies.” Most business readers do not treat this as a detached logic puzzle. They hear a practical rule: pay by Friday, get the discount; miss Friday, probably no discount. The phrase carries intent, relevance, and a small but important threat wrapped in polite operational language.

Now change the sentence: “If you need the invoice, it is attached.” Here the attachment does not depend on your need. The invoice is attached either way. The condition is not a gate; it is a courtesy.

Both sentences use “if.” Humans do not interpret them the same way. That is the entire problem.

The arXiv paper Tracing the ongoing emergence of human-like reasoning in Large Language Models tests whether 25 large language models can make this distinction in the same way as 313 human participants across Catalan, English, Italian, and Spanish.1 The answer is not the usual cartoon version of “LLMs cannot reason.” The more interesting answer is worse for lazy deployment: many models can be good logical operators while still being weak pragmatic interpreters. They can process the formal skeleton and miss the social muscle.

That distinction matters because business language is rarely pure logic. Contracts, support policies, compliance instructions, sales promises, HR rules, insurance clauses, and procurement messages all live in the space between what was literally said and what a reasonable participant understands the speaker to mean. If an AI system handles the literal sentence but not the communicative act, it may look competent exactly until the expensive misunderstanding arrives.

The paper is not testing “logic” in the abstract; it is testing the seam between logic and use

The study focuses on conditionals because they are a useful stress test. A conditional has a clean formal shape: if $p$, then $q$. In classical propositional logic, $p \rightarrow q$ is false only when $p$ is true and $q$ is false. When $p$ is false, the conditional is formally true because the falsifying case never occurs.

That is the neat classroom version. Natural language immediately makes it messier.

The paper separates two pragmatic enrichments that humans often apply to conditional sentences:

Conditional type Example intuition Formal baseline Human pragmatic enrichment
Standard conditional “If you mow the lawn, I will give you €50.” If $p$, then $q$ Often strengthened into “€50 if and only if the lawn is mowed.”
Biscuit conditional “If you are hungry, there is pizza in the oven.” If $p$, then $q$ The pizza is in the oven independently of whether you are hungry.

The first pattern is called conditional perfection. The listener strengthens “if $p$, then $q$” into something closer to $p \leftrightarrow q$: the named condition is treated as the relevant condition. The second pattern is the biscuit reading: the consequent is understood as independently true. The pizza is not caused by hunger. Convenient, yes. Metaphysically dependent, no.

This is why the paper’s mechanism-first framing is useful. The main issue is not whether a model can answer “true” or “false” on a benchmark. The issue is whether it can decide which interpretive mechanism is appropriate:

Input sentence with "if"
        |
        v
Truth-conditional semantics
        |
        +--> Standard conditional: infer dependence and possible exclusivity
        |
        +--> Biscuit conditional: infer independence of antecedent and consequent

A model that treats all false-antecedent cases as vacuously true may be formally disciplined and pragmatically tone-deaf. A model that treats all conditionals as biconditionals may look impressive on standard conditionals while failing biscuit conditionals. Both failure modes can create high benchmark moments and low business reliability. Elegant, in the way a glass door is elegant before someone walks into it.

The experiment asks the model to judge consequences, not to write explanations

The experimental design is deliberately simple. The authors built a truth-value judgment task in four languages: Catalan, English, Italian, and Spanish. For each language, they created 54 experimental prompts: 27 standard conditionals and 27 biscuit conditionals. They also included 18 filler trials to reduce the transparency of the task.

Each prompt gave a short context, a distractor sentence, and a conditional statement. Participants then judged whether the consequent was true based on the utterance.

For a critical standard conditional, the setup is like this:

The lawn has not been mowed. Mary said: “If Paul mows the lawn, he will get €50.” Is it true that Paul got €50?

The target human-like pragmatic answer is “no.” The listener treats the €50 as depending on mowing the lawn.

For a critical biscuit conditional, the setup is like this:

John is not hungry. Mary says: “If John is hungry, there is pizza in the oven.” Is it true that there is pizza in the oven?

The target pragmatic answer is “yes.” The pizza’s existence is not conditional on John’s hunger.

This design matters because it does not reward long decorative reasoning. The task is not “please sound thoughtful about conditionals.” It asks whether the agent chooses the right interpretation when literal semantics and pragmatic enrichment pull in different directions.

The human side included 364 recruited participants before exclusions and 313 participants in the final analysis. The model side included 25 LLMs spanning commercial and open models, dense and mixture-of-experts architectures, and generative, hybrid, and reasoning-oriented systems. The models were given the same stimuli, without extra task-specific instructions beyond the question. Their free-text responses were manually mapped into truth judgments and scored against the target answers.

That last detail is important. The paper is closer to a controlled cognitive probe than to a leaderboard. The authors are not asking which brand sounds cleverest. They are asking which systems reproduce the human pattern at the semantics-pragmatics interface.

The main result: models survive the controls and stumble on pragmatic enrichment

The first major finding is not that LLMs fail everywhere. In control conditions, humans and LLMs do not meaningfully diverge. The models can handle the cases where the truth conditions are straightforward.

The break appears in the experimental conditions, where the right answer depends on pragmatic interpretation. The mixed-effects model shows a significant Prompt × Agent interaction: the experimental manipulation affects humans and LLMs differently. When the analysis focuses only on experimental trials, LLMs perform at an overall lower level than humans, while both humans and models show higher accuracy for standard than biscuit conditionals.

The useful interpretation is this: the models are not random. They are sensitive to some structure. But their sensitivity is weaker, less stable, and less human-like where context and speaker intention matter.

Figure 1 in the paper is the main evidence for this mechanism. Its purpose is not merely decorative reporting of accuracy bars. It shows that the gap is concentrated in the pragmatic cases rather than spread evenly across all conditions. That supports the paper’s core claim: LLMs can behave as reliable semantic operators while failing the pragmatic enrichments that humans naturally apply.

For business users, this is the uncomfortable part. If a model fails trivial logic, people notice quickly. If it succeeds at explicit logic but misses pragmatic force, the failure can be quieter. It may produce an answer that is defensible under one reading and wrong under the reading that matters operationally.

A policy chatbot that says “the clause does not explicitly prohibit this” may be technically correct and practically dangerous. A procurement assistant that treats “if needed, we can provide documentation” as a conditional availability statement may miss that the documentation already exists. A compliance assistant that reads a rule literally may ignore the purpose of the rule. The software will not look broken. It will look precise. That is the annoying version of broken.

Human readers enrich conditionals differently; LLMs often collapse them into one rule

The human data replicate a familiar asymmetry. Participants endorsed perfected readings for standard conditionals at more than 88% on average across languages. Biscuit readings appeared at around 57% on average. So humans do not apply pragmatic enrichment as a single mechanical switch. They strongly perfect standard conditionals and derive biscuit readings more moderately.

That distinction matters. Human reasoning here is not simply “always strengthen the conditional.” It is conditional-sensitive. The hearer asks, implicitly: is the antecedent functioning as a real precondition, or is it just a relevance cue for the listener?

Many LLMs do not make that distinction reliably. The authors identify two broad model behavior profiles.

The first group adheres closely to truth-table semantics. These models treat false-antecedent conditionals as vacuously true. That can look formally correct, but it fails when the task requires reasoning about whether the consequent should be inferred in context.

The second group adopts a broad biconditional strategy. These models strengthen conditionals into “if and only if” across the board. That helps with standard conditionals but blocks biscuit interpretations. In other words, they pass one pragmatic door by carrying a universal key that breaks the next lock.

A few models show more variable behavior, and one shows a positive response bias. But the dominant pattern is that models tend to simplify the interpretive space. Either everything is literal logic, or everything is biconditional. The paper calls the broader tendency Decontextualization Bias: a preference for formal or literal aspects of input over the contextual cues that guide human interpretation.

This term is useful because it avoids a vague complaint about “lack of understanding.” The proposed mechanism is more specific:

Human interpretation:
formal sentence + context + speaker goals + relevance assumptions -> enriched meaning

Common LLM pattern:
formal sentence + learned response regularity -> one dominant interpretation

The difference is not cosmetic. Business communication is full of relevance assumptions. “If you have questions, contact legal” means legal is available for questions, not that legal exists only under a question condition. “If the shipment is delayed, notify the customer” means delay triggers notification, not that notification is impossible for other reasons. “If approved by finance, the purchase may proceed” is a gate. “If you need the form, it is in the portal” is not.

A useful business AI system needs to know which kind of “if” it is seeing.

Model rankings are the tempting story, but not the durable one

The paper does report strong variation across individual models. In experimental trials, mean accuracy ranges from 0.000 for Distil-Bert to 0.833 for Llama3.3. Some models, including Llama3.3 70B Instruct and Kimi K2-Instruct-0905, maintain accuracy above 0.75 across languages. Others remain near chance regardless of language.

Figure 2 supports the model-heterogeneity finding. Its purpose is comparison across individual systems and languages. It shows that model identity matters much more than the four-language split in this task.

Figure 3 goes one level deeper. It compares model accuracy in critical standard and critical biscuit conditions. Its purpose is not just to rank models; it reveals interpretive strategy. A model may be high on standard conditionals and low on biscuit conditionals, which suggests biconditional overgeneralization rather than human-like pragmatics. Conversely, a weak or reversed contrast may signal a different response bias or a failure to track the conditional distinction.

This is why “model X scored higher than model Y” is the shallow takeaway. The operational question is what kind of error profile a model has.

Observed model pattern What it can look like What it probably means Business risk
Truth-table loyalty Formally consistent Treats false antecedents as vacuously true Misses implied dependence or speaker intent
Blanket biconditional reading Strong on standard conditionals Over-strengthens all “if” statements Treats courtesy/context cues as hard conditions
Positive response bias Cooperative and agreeable Says “yes” too often Confirms false assumptions in support or compliance workflows
Variable model-specific behavior Flexible at first glance Inconsistent strategy across cases Hard to certify without task-specific testing

This is also where evaluation needs to become less leaderboard-shaped. A business does not only need “the best model.” It needs to know whether the model’s failures match the organization’s risk surface. A legal assistant, a customer-support bot, and a sales-operations copilot do not share the same tolerance for pragmatic misreadings.

Reasoning branding, architecture labels, and open-source status do not explain the gap

The study also tests whether broad model categories predict performance. The answer is mostly no.

Open versus closed status does not significantly improve model fit. Dense versus mixture-of-experts architecture does not significantly improve model fit. Generative versus reasoning-oriented training does not significantly improve model fit. Table 4 reports small mean differences across these categories, but none becomes a reliable predictor of accuracy.

This part is a comparison across design labels, not a causal explanation of architecture. Its likely purpose is to test whether coarse labels can explain the observed model variation. They cannot.

That does not mean architecture, training data, or fine-tuning are irrelevant. It means the labels available to users are too coarse to forecast pragmatic competence. “Reasoning model” is not a warranty. “MoE” is not a pragmatic-reasoning certificate. “Open” is not a cognitive guarantee. These are procurement descriptors, not behavioral proofs.

For enterprise selection, this finding has a direct implication: do not buy reasoning competence by category. Test it on your own language acts.

This is especially important because pragmatic failures are domain-shaped. A model that handles conditional reasoning in general-purpose examples may still fail in a company’s internal policy style, contract templates, customer-service scripts, insurance exclusions, or regulatory correspondence. The paper’s result says that broad model metadata is a weak proxy. The business version is blunter: your vendor slide is not an evaluation suite.

The multilingual result is stable, but not a license to relax multilingual testing

The paper tests Catalan, English, Italian, and Spanish. Mean model accuracy is very similar across the four languages, ranging from 0.639 in English to 0.652 in Catalan. Adding language as a random intercept does not improve model fit, suggesting that the core pragmatic limitation in this task is more model-internal than language-specific.

This is a useful robustness-style result: the observed pattern is not simply an English artifact. It appears across the four tested languages.

But the paper also notes cross-linguistic leakage. Some models respond in English, and sometimes Spanish or other languages, when queried in another language. Catalan is especially vulnerable to this pattern, likely because it is the smaller speaker community among the tested languages and is close to Spanish.

So the practical lesson is two-sided:

Finding Business interpretation Boundary
Similar accuracy across the four tested languages The pragmatic issue is not confined to English The languages are typologically related
Cross-linguistic leakage in outputs Language control remains an operational risk Accuracy alone may hide unacceptable response-language behavior
Smaller language effect than model effect Model-specific testing matters Broader low-resource and non-Indo-European coverage is still needed

For multilingual business automation, this distinction is familiar. A model may get the answer approximately right while violating the required language, register, legal terminology, or customer-facing tone. That is not a footnote in production. It is a workflow failure wearing a linguistics hat.

The business problem is not that LLMs lack logic; it is that business meaning is not just logic

The common reader misconception is that a model capable of formal reasoning, benchmark performance, or chain-of-thought-style explanation must reason like humans in context-sensitive language tasks. This paper pushes against that assumption.

The direct result is narrow and controlled: in sentence-level conditional judgment tasks across four related languages, humans reliably enrich conditionals pragmatically, while LLMs show lower, more variable, and often rule-like behavior in the critical pragmatic cases.

The business inference is broader but should be made carefully. The paper does not prove that LLMs will fail every contract, support ticket, or policy task. It does show why high-stakes deployments should test pragmatic interpretation directly rather than assuming it from general reasoning ability.

A practical evaluation program should include at least three layers:

Evaluation layer What to test Example business question
Formal rule handling Can the model identify explicit conditions and consequences? “If approval is missing, should the request proceed?”
Pragmatic enrichment Can the model distinguish hard conditions from relevance cues? “Does ‘if you need the form, it is attached’ imply the form may not be attached?”
Escalation discipline Does the model flag ambiguity instead of forcing a convenient reading? “Could this clause support more than one operational interpretation?”

The third layer is often the neglected one. Humans do not always resolve pragmatic meaning with certainty. They also ask follow-up questions, infer speaker intent, inspect surrounding documents, and sometimes escalate to legal, compliance, or management. A useful AI system should not merely choose an interpretation; it should know when the interpretation is doing too much work.

For Cognaptus-style business automation, the lesson is to build pragmatic checks into the workflow:

  1. Classify the conditional type. Is the “if” statement a gate, a promise, a relevance cue, or an informational convenience?
  2. Separate literal consequence from inferred consequence. Make the model state which part follows formally and which part is inferred from context.
  3. Require evidence from surrounding text. A clause-level interpretation should not float away from the document, even if it sounds sensible.
  4. Use domain-specific red-team prompts. Test policy, contract, and customer-support examples where literal and pragmatic readings diverge.
  5. Escalate when the inferred meaning changes liability, eligibility, money, or customer rights. This is not philosophical caution. It is cheaper than cleaning the mess later.

The ROI is not “AI becomes human.” That phrase should be retired with other minor corporate crimes. The ROI is earlier detection of cases where formal logic and business meaning diverge.

What each piece of evidence supports, and what it does not

The paper is strongest when read as a mechanism study, not as a universal verdict on intelligence. The following table keeps the evidence in its proper lane.

Evidence component Likely purpose What it supports What it does not prove
Truth-value judgment task Main evidence design Separates semantic logic from pragmatic enrichment Full performance in open-ended business dialogue
Human vs LLM mixed-effects comparison Main evidence LLMs diverge from humans in experimental pragmatic conditions That all LLM reasoning is weak
Experimental-only agent comparison Main evidence The gap is concentrated where pragmatic interpretation is required That models have zero sensitivity to conditional type
Figure 1 Main evidence visualization Controls are easier; pragmatic cases expose the gap The exact cause inside model internals
Figure 2 Model comparison Individual model identity matters substantially Stable ranking across all reasoning tasks
Figure 3 Strategy diagnosis Some models over-favor standard/biconditional readings while biscuit readings remain hard A complete taxonomy of model reasoning strategies
Table 4 Design-label comparison Open/closed, dense/MoE, and training type do not reliably predict performance That training and architecture details are irrelevant
Four-language comparison Robustness across tested languages Pattern is not only English-specific Generalization to all languages or low-resource settings

This is the right level of confidence. The paper gives a sharp diagnostic of a particular reasoning seam. It does not give a final theory of all LLM cognition, and it does not need to. A good diagnostic does not explain the universe. It finds the leak.

The boundary: controlled sentences are a warning signal, not a production benchmark

The limitations are not decorative. They shape how the result should be used.

First, the tested languages are Catalan, English, Italian, and Spanish. They are not identical, but they are still typologically related. The study cannot settle how models handle pragmatic conditionals in non-Indo-European languages, lower-resource languages, or specialized business registers.

Second, the model categories are not perfectly balanced. The finding that broad labels do not predict performance is valuable, but it should not be read as a causal claim that architecture or training never matters. More likely, the relevant causal factors are buried inside data composition, instruction tuning, reinforcement learning, model-family decisions, and inference behavior that users cannot observe from the label.

Third, the task uses controlled sentence-level judgments. This is a strength for isolating the mechanism, but it is not the same as a full enterprise workflow. In longer documents, additional context might help the model. Or it might create new failure modes by adding distraction, conflicting clauses, and retrieval noise. Anyone who has seen a policy document knows context is not always a gift.

The practical boundary is simple: use this paper as a design warning, not as a plug-and-play benchmark. It tells teams what to test, not which model to buy tomorrow morning.

The article’s takeaway: “if” is a small word with expensive consequences

The paper’s most useful contribution is not the model leaderboard. It is the mechanism: LLMs can process the formal shape of conditionals while failing to reproduce the pragmatic enrichments that humans use when interpreting real communication.

That gap is easy to underestimate because formal logic feels more rigorous than pragmatics. But in business, rigor without context can become a polite form of error. A model may be perfectly literal and practically wrong. It may strengthen every conditional and become accidentally restrictive. It may follow a truth table and miss the speaker’s intention. None of these failures requires the model to be obviously stupid. That is why they matter.

The right response is not to abandon LLMs for language-heavy work. It is to stop treating “reasoning” as a product label and start treating it as a testable behavior under task-specific conditions. For contract review, policy automation, compliance support, customer operations, and multilingual workflows, the relevant question is not whether the model can reason in general.

The relevant question is whether it knows when “if” means a gate, when it means a courtesy, and when the safest answer is: ask for context before pretending the sentence was simpler than it was.

Cognaptus: Automate the Present, Incubate the Future.


  1. Paolo Morosi, Nikoleta Pantelidou, Fritz Günther, Elena Pagliarini, and Evelina Leivada, “Tracing the ongoing emergence of human-like reasoning in Large Language Models,” arXiv:2605.21299, submitted May 20, 2026. https://arxiv.org/abs/2605.21299 ↩︎