Reasonable Doubts: Why AI Reasoning Is Not a Solo Act

Opening — Why this matters now

AI reasoning has become the software industry’s favorite magic word. Every product now claims to “reason,” usually after adding a longer prompt, a larger model, and a pricing page with the emotional warmth of a hospital bill.

But three recent arXiv papers point to a more useful conclusion: reasoning is not a single capability that lives inside one heroic model. It is becoming a system architecture.

That matters for business because most companies do not need an AI model that wins philosophical arguments with itself for 8,000 tokens. They need systems that can interpret messy inputs, solve the right problem, know when enough reasoning has been done, and verify whether the output is actually correct. In other words: not just “smarter AI,” but a disciplined reasoning workflow.

The three papers in this cluster approach the problem from different angles:

Tandem studies how a large model and a smaller model can collaborate so that expensive reasoning is used only where it adds value.¹
Rethinking Math Reasoning Evaluation argues that brittle symbolic answer checking misjudges model performance and proposes a more robust LLM-as-a-judge framework.²
Grounding vs. Compositionality challenges a core assumption in neuro-symbolic AI: that once symbols are grounded, compositional reasoning will naturally follow.³

Read together, they suggest a practical shift: AI reasoning should be designed as a stack, not purchased as a monolith.

Naturally, this is less glamorous than saying “the model thinks.” It is also much closer to how deployable systems actually get built.

The Research Cluster — What these papers are collectively asking

These papers are not about the same benchmark, model family, or architectural tradition. One is about large-small model collaboration, one about evaluation, and one about neuro-symbolic compositional generalization. The common question is deeper:

What has to be made explicit for AI reasoning to become reliable, efficient, and useful outside demos?

The answer differs by layer.

Tandem makes reasoning effort explicit. Instead of asking a large model to produce a full long reasoning trace every time, it asks the large model to generate compact “thinking insights” and lets a smaller model complete the task. A classifier watches the smaller model’s uncertainty signals to decide whether more guidance is needed.

The evaluation paper makes correctness judgment explicit. It argues that symbolic answer matching often fails when mathematically correct answers appear in different formats, units, approximations, or notations. Its proposed judge pipeline independently solves the problem, validates reference answers, and then evaluates model predictions with randomized grouping and voting.

The neuro-symbolic paper makes reasoning objectives explicit. It shows that grounding symbols is necessary but not sufficient for compositional generalization. A model trained only to map visual inputs to symbols does not automatically learn how those symbols interact through rules. The proposed Iterative Logic Tensor Network performs better because multi-step reasoning is directly trained and architecturally supported.

This gives us a useful business translation:

Practical AI reasoning is not “ask a bigger model.” It is “design the right handoff between perception, reasoning, control, and verification.”

The Shared Problem — What the papers are reacting to

The shared enemy is not stupidity. It is hidden assumptions.

The AI industry often assumes that one sufficiently capable model can handle everything: understand the input, infer the intent, perform the reasoning, format the output, and judge correctness. These papers attack that assumption from three sides.

Hidden assumption	Why it breaks	Paper that exposes it
A larger reasoning model should do the whole task	Long reasoning traces are expensive, slow, and often unnecessary	Tandem
Correctness can be checked by exact symbolic matching	Equivalent answers can appear in different forms, units, or notation	Rethinking Math Reasoning Evaluation
Grounding symbols automatically creates compositional reasoning	Recognition does not teach the model how symbols interact through rules	Grounding vs. Compositionality

This is why the papers fit together despite their surface differences. They all say: if a capability matters, do not hide it inside a vague model call. Expose it. Instrument it. Train it. Evaluate it.

That sounds boring. Good. Boring is where production systems begin.

What Each Paper Adds

Paper	Direct research problem	Main mechanism	Key finding	Best role in the article
Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning	How to preserve strong reasoning while reducing inference cost	Large model generates structured Goal / Planning / Retrieval / Action insights; small model completes the answer; sufficiency classifier controls when to stop	A 32B–7B collaboration outperforms the 32B model on MATH while using about 59% of its computational cost; classifier transfers to HumanEval without retraining	Technical implementation example; cost-control layer
Rethinking Math Reasoning Evaluation	How to evaluate math reasoning when symbolic answer matching is brittle	LLM judge independently solves the question, validates the dataset answer, then judges predictions using grouped/randomized evaluation	Meta-evaluation F1 rises from 0.741 for the symbolic baseline to 0.969 for the proposed pipeline	Governance and verification layer
Grounding vs. Compositionality	Whether symbol grounding naturally produces compositional reasoning	Iterative Logic Tensor Network with explicit multi-step reasoning objective	Full LTN achieves 51.2% overall accuracy vs. 11.3% for grounding-only baseline across compositional generalization tasks	Conceptual foundation; architecture warning

A useful way to read this table is not “which paper is better?” That would be the least interesting question. The better question is: what layer of the reasoning stack does each paper force us to take seriously?

The Bigger Pattern — What emerges when we read them together

The bigger pattern is that AI reasoning is becoming modular.

Not modular in the old enterprise sense, where “modular” means six vendors, three dashboards, and one procurement migraine. Modular in the engineering sense: separate capabilities should have separate responsibilities, metrics, and failure modes.

Layer 1: Grounding — what does the system think it is seeing?

The neuro-symbolic paper starts at the bottom of the stack. Its task domain uses visual logic puzzles rendered as grayscale images. The system must map perceptual input into symbolic representations, then reason over those symbols.

The important result is not simply that the proposed Iterative Logic Tensor Network performs better. The important result is why it performs better.

The paper shows that both the grounding-only baseline and the full model can learn the basic perceptual task on training digits. Yet the grounding-only model fails badly under compositional generalization. The full LTN, trained jointly on grounding and multi-step reasoning, performs much better across entity, relational, and rule composition tasks. In one entity-composition test, the full model solves 31 out of 50 puzzles while the grounding-only baseline solves 4 out of 50. In aggregate, the full model reaches 51.2% overall accuracy versus 11.3% for the grounding-only baseline.

The direct research conclusion is clear: grounding is necessary, but not enough.

The business interpretation is broader: classification is not understanding. An invoice parser that extracts fields is not yet an accounting assistant. A customer-service classifier that labels “refund request” is not yet a resolution agent. A document-review model that identifies clauses is not yet a legal reasoning system.

The symbols are only the beginning. The real question is what the system can do with them.

Layer 2: Reasoning execution — who should do the hard thinking?

Tandem moves from representation to execution. Its premise is practical: modern reasoning models can produce long traces, and those traces are expensive. If every task triggers a full premium-model reasoning path, the deployment economics become unpleasant very quickly. “Unpleasant,” in this context, means the CFO eventually discovers the token bill.

Tandem’s design splits labor:

The large model acts as a mentor.
It produces compact thinking insights: Goal, Planning, Retrieval, and Action.
The smaller model processes those insights and generates the final response.
A sufficiency classifier decides whether the current guidance is enough or whether the large model should continue.

The key insight is not merely “use a smaller model.” Many systems already do crude routing. Tandem is more interesting because it uses the smaller model’s own uncertainty signals—perplexity and entropy statistics—to decide whether the guidance is sufficient.

That is a subtle but important shift. The smaller model is not just a cheap worker. It becomes part of the control loop.

For business systems, this suggests a practical design principle:

Expensive reasoning should be allocated dynamically, not uniformly.

A claims-processing assistant should not invoke the most expensive model for every routine reimbursement. A legal intake system should not generate a full legal memo for every contact-form submission. A finance automation tool should not perform deep anomaly analysis on every ordinary invoice.

The question is not “large model or small model?” The question is: what evidence tells us that the cheap path is enough?

Tandem’s paper provides one answer for reasoning benchmarks. Production systems will need domain-specific versions of the same idea: confidence signals, exception thresholds, escalation policies, and audit logs.

Layer 3: Verification — how do we know the answer is actually right?

The evaluation paper attacks a different weakness: even when a model gives a correct answer, the evaluator may mark it wrong.

In mathematical reasoning, standard evaluation pipelines often use symbolic comparison. That works when outputs are clean and exactly formatted. It breaks when the answer is mathematically equivalent but expressed differently: decimals versus fractions, units included versus omitted, textual time formats, equivalent equations, or acceptable approximations.

The paper’s proposed pipeline uses an LLM judge, but it is not the lazy version of “ask another model if this looks okay.” It includes safeguards:

The judge first answers the question independently without seeing the dataset answer.
It then validates the generated answer against the ground-truth answer.
It evaluates model predictions against the validated answer.
It uses grouped prediction evaluation, randomization, and shuffling to reduce positional bias.
It tests the pipeline through manually labeled meta-evaluation.

The reported F1 improvement is large: 0.741 for the SimpleRL symbolic baseline versus 0.969 for the proposed configuration.

The direct research conclusion is that robust semantic evaluation can correct many failures of brittle symbolic matching.

The business interpretation is more delicate: LLM judges are useful, but not magic auditors.

For business workflows, this matters because many AI systems will not output neat, exact answers. They will output contract interpretations, reconciliation explanations, customer-response drafts, risk summaries, escalation reasons, and compliance notes. Exact matching is useless there. But pure LLM judgment is also risky unless the evaluation process itself is governed.

The paper is valuable precisely because it does not pretend that “LLM-as-a-judge” is automatically trustworthy. It explicitly addresses confirmation bias, positional bias, non-determinism, and reference-answer ambiguity.

That is the lesson for business deployment: if AI checks AI, the checking system needs its own controls.

A reasoning-system stack

Putting the three papers together gives a stack like this:

Reasoning-system layer	Core question	Research signal	Business design equivalent
Grounding / representation	Does the system represent the task in a usable structure?	Grounding alone does not produce compositional reasoning	Data extraction, schema design, entity linking, document structure, process state
Reasoning objective	Has the system been trained or designed to perform multi-step inference?	Explicit iterative reasoning improves compositional generalization	Workflow logic, dependency tracking, stepwise plans, decision policies
Reasoning allocation	How much expensive reasoning is needed for this case?	Tandem dynamically stops large-model guidance when sufficient	Model routing, escalation thresholds, cost-aware orchestration
Verification / judgment	How is correctness checked beyond brittle rules?	LLM-based evaluation handles diverse valid answers better than symbolic matching	QA agents, audit review, exception validation, semantic consistency checks
Governance / monitoring	Are the control signals themselves reliable?	LLM judges and classifiers have their own biases and limits	Evaluation logs, human review, drift monitoring, error taxonomy

This stack is the real story. The industry talks about model intelligence. These papers talk about system discipline.

That distinction is not academic hair-splitting. It is the difference between a demo that impresses a board meeting and a workflow that survives Monday morning operations.

Business Interpretation — What changes in practice

Here is the clean business takeaway:

Companies should stop asking, “Which model should we use?” as the first question. They should ask, “Which reasoning functions must be explicit in this workflow?”

The model choice still matters. But it is downstream of architecture.

1. AI automation should be designed around task states, not prompts

The neuro-symbolic paper shows that a model needs more than recognition. It needs a representation that supports reasoning. In business terms, this means AI workflows need structured task states.

For example, a customer support agent should not merely summarize a complaint. It should maintain a state such as:

Field	Example
Customer intent	Refund request
Evidence supplied	Order ID, photo, delivery timestamp
Policy constraints	Refund window, damaged-goods rule
Missing information	Whether item was used
Risk level	Medium
Next action	Ask for photo of packaging or escalate

This is not glamorous. It is much more valuable than a beautiful paragraph that quietly forgets the refund policy.

2. Expensive reasoning should be reserved for uncertainty, ambiguity, and stakes

Tandem suggests that not every problem deserves the full reasoning budget. In a business system, this becomes a routing design.

Case type	Cheap path	Expensive path	Human path
Routine, low-risk	Small model + rules	Not needed	Sample audit only
Ambiguous but low-stakes	Small model + structured guidance	Large model provides plan or critique	Optional review
High-value or regulated	Large model + retrieval + verifier	Required	Human approval required
Novel or adversarial	Triage only	Diagnostic reasoning	Escalation required

The point is not to worship small models. The point is to avoid using premium reasoning as a substitute for workflow design.

3. Evaluation needs to match the output type

The evaluation paper is directly about math, but the lesson travels well. Different business outputs require different checking methods.

Output type	Bad evaluation method	Better evaluation method
Extracted invoice fields	“Looks good” review	Exact match + tolerance + source-location evidence
Customer email draft	Keyword matching	Policy compliance, tone, factual consistency, escalation check
Financial variance explanation	Single LLM opinion	Data reconciliation + narrative consistency + reviewer sign-off
Contract clause summary	Generic summary score	Clause coverage, obligation extraction, risk flag validation
Forecast commentary	Accuracy of prose	Assumption tracking, scenario consistency, source freshness

The uncomfortable lesson is that evaluation is itself a product feature. If a vendor cannot explain how its AI output is evaluated, it is not selling automation. It is selling confidence theater.

4. “AI agents” need internal division of labor

These papers also clarify what a serious agentic system should look like. It should not be a single agent with a heroic system prompt and a vague instruction to “think step by step.”

A production-grade business agent usually needs at least five roles:

Agent role	Function	Paper connection
Perception / intake agent	Converts messy inputs into structured state	Grounding layer
Reasoning planner	Determines strategy and dependencies	Tandem’s planning insight; LTN reasoning objective
Execution agent	Performs routine steps efficiently	Tandem’s small-model executor
Verifier / judge	Checks correctness and policy compliance	LLM-as-a-judge evaluation
Escalation controller	Decides when to ask for more reasoning or human review	Tandem’s sufficiency classifier; governance layer

This is where the business ROI appears. Not in replacing every employee with one giant model, which is both crude and legally exciting in the worst way. The ROI comes from narrowing human attention to the cases where judgment, accountability, and exception handling actually matter.

5. ROI should be measured by avoided reasoning waste, not just labor saved

Many AI automation pitches still measure value as “hours saved.” That is useful, but incomplete.

This research cluster points to another metric: reasoning waste avoided.

Reasoning waste appears when:

a large model performs deep reasoning for trivial cases;
a small model handles cases it cannot reliably understand;
outputs are accepted without semantic verification;
humans review everything because the AI system cannot rank uncertainty;
brittle evaluation rejects correct outputs or accepts wrong ones;
the workflow has no state representation, so every step starts from scratch.

A better ROI model should include:

$$ \text{AI Workflow Value} = \text{Labor Saved} + \text{Error Cost Avoided} + \text{Cycle Time Reduced} - \text{Inference Cost} - \text{Review Cost} - \text{Failure Cost} $$

The papers do not provide this business equation. This is my extrapolation. But it follows naturally from their shared message: reasoning has costs, and those costs must be controlled by architecture.

Limits and Open Questions

The papers are useful, but they do not solve deployment by themselves.

1. Benchmark reasoning is not business reasoning

Math benchmarks and synthetic visual logic puzzles are controlled environments. Business workflows are not. Real inputs are incomplete, political, contradictory, multilingual, and occasionally written by people who treat email subject lines as avant-garde poetry.

The research shows mechanisms, not turnkey enterprise systems.

2. LLM judges reduce brittleness but introduce new governance problems

The evaluation paper is careful about judge bias, but business systems will need more. A verifier used in finance, healthcare, legal operations, or regulated customer communications must have audit trails, reviewer override, version control, and clear failure categories.

An AI judge without governance is just another unaccountable model, now wearing a tiny judge wig.

3. Cost-aware routing needs domain-specific calibration

Tandem uses perplexity and entropy features from the smaller model to estimate whether guidance is sufficient. That is elegant for the experimental setting. But enterprise deployment will need calibration against business-specific risk: transaction value, customer importance, regulatory exposure, novelty, confidence, and historical error patterns.

The technical classifier is only one part of the control system.

4. Grounding and reasoning may need to be jointly designed per workflow

The neuro-symbolic result warns against treating extraction and reasoning as separate afterthoughts. In business workflows, the schema you choose determines what the AI can reason about.

A poor invoice schema creates poor payment reasoning. A weak customer-intent taxonomy creates weak support automation. A shallow contract representation creates shallow legal review.

Data structure is not a clerical detail. It is the skeleton of reasoning.

5. We still need better operational metrics for reasoning systems

Accuracy is not enough. Cost is not enough. Latency is not enough. Human review rate is not enough.

A practical reasoning system needs a dashboard that tracks:

Metric	Why it matters
Escalation rate	Whether the system knows when it is uncertain
False confidence rate	Whether it acts certain while wrong
Verification disagreement	Whether judges, rules, and humans conflict
Reasoning cost per resolved case	Whether orchestration is economically sane
Rework rate	Whether downstream users trust and reuse outputs
Exception aging	Whether difficult cases get stuck
Policy override frequency	Whether rules or prompts are misaligned with operations

This is where many AI pilots quietly fail. They measure output volume, not reasoning quality.

Conclusion

The most important message from this research cluster is simple: AI reasoning is moving from model capability to system architecture.

The neuro-symbolic paper shows that grounding does not automatically produce compositional reasoning. Tandem shows that reasoning can be divided between large and small models if the system knows when guidance is sufficient. The evaluation paper shows that correctness itself needs a robust semantic verification layer, because brittle symbolic checks can misread valid answers.

Together, they point to a more mature design philosophy:

represent the task explicitly;
train or design the reasoning process explicitly;
allocate expensive reasoning dynamically;
verify outputs semantically;
govern the control signals, not just the final answer.

For businesses, this means the future of AI automation will not be one giant model sitting on top of a messy workflow, radiating confidence. It will be a coordinated system of intake, grounding, reasoning, execution, verification, and escalation.

Less myth. More machinery.

That is not less intelligent. That is what intelligence looks like when it has to show up for work.

Cognaptus: Automate the Present, Incubate the Future.

Zichuan Fu et al., “Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning,” arXiv:2604.23623, 2026. https://arxiv.org/abs/2604.23623 ↩︎
Erez Yosef et al., “Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity,” arXiv:2604.22597, 2026. https://arxiv.org/abs/2604.22597 ↩︎
Mahnoor Shahid and Hannes Rothe, “Grounding vs. Compositionality: On the Non-Complementarity of Reasoning in Neuro-Symbolic Systems,” arXiv:2604.26521, 2026. https://arxiv.org/abs/2604.26521 ↩︎

Opening — Why this matters now#

The Research Cluster — What these papers are collectively asking#

The Shared Problem — What the papers are reacting to#

What Each Paper Adds#

The Bigger Pattern — What emerges when we read them together#

Layer 1: Grounding — what does the system think it is seeing?#

Layer 2: Reasoning execution — who should do the hard thinking?#

Layer 3: Verification — how do we know the answer is actually right?#

A reasoning-system stack#

Business Interpretation — What changes in practice#

1. AI automation should be designed around task states, not prompts#

2. Expensive reasoning should be reserved for uncertainty, ambiguity, and stakes#

3. Evaluation needs to match the output type#

4. “AI agents” need internal division of labor#

5. ROI should be measured by avoided reasoning waste, not just labor saved#

Limits and Open Questions#

1. Benchmark reasoning is not business reasoning#

2. LLM judges reduce brittleness but introduce new governance problems#

3. Cost-aware routing needs domain-specific calibration#

4. Grounding and reasoning may need to be jointly designed per workflow#

5. We still need better operational metrics for reasoning systems#

Conclusion#