Balance Sheets Meet Brain Cells: Why Financial Reasoning Still Trips Up AI

A balance sheet does not care how confident a model sounds.

That is the useful cruelty of accounting. A number either reconciles, a subtotal either belongs where it belongs, treasury stock is either treated correctly, and a rule either applies or it does not. Fluent explanation is welcome, but it is not evidence. It is the garnish. The meal is verification.

That distinction is exactly why FinRule-Bench is interesting. The paper introduces a benchmark for testing whether large language models can audit structured financial statements under explicit accounting principles, using real 2024 Form 10-K statements, human-curated rules, deterministic validators, and controlled rule-aware error injection.¹ The headline finding is not that LLMs are useless in finance. That would be lazy and, worse, comforting. The sharper finding is that models can often detect that something is wrong while still failing to say exactly what is wrong, which rule was violated, and where the violation occurs.

For financial automation, that difference is not cosmetic. It is the difference between a helpful assistant and an operational risk wearing a nice interface.

The important result is the gap between detection and diagnosis

The paper evaluates four models—GPT-4o, Gemini 2.5 Pro, Gemini 2.0 Flash, and LLaMA 3.3—across three task types. The easiest task is rule verification: given one financial table and one accounting rule, decide whether the table complies. The next task is rule identification: given a table and a set of rules, identify the single violated rule. The hardest task is joint rule diagnosis: detect whether violations exist and then identify all violated rules with record-level attribution.

The overall results tell a neat, slightly uncomfortable story.

Task stage	What the model must do	Strong observed result	Business reading
Rule verification	Check one table against one supplied rule	GPT-4o few-shot + causal-counterfactual reasoning reaches 0.738 accuracy	Single-rule compliance checks are plausible as assisted review
Rule identification	Select the violated rule from a supplied rule set	GPT-4o few-shot reaches 0.689 accuracy; few-shot + CR falls to 0.661	Choosing the right rule is harder than noticing non-compliance
Diagnosis Step-1	Decide whether a multi-violation case has violations	GPT-4o few-shot + CR reaches 0.801 accuracy	Models can often raise the alarm
Diagnosis Step-2	Identify all violated rules with record-level localization	GPT-4o few-shot + CR reaches only 0.342 exact match	Audit-ready completeness remains weak

This is the core evidence-first reading of the paper: detection survives better than diagnosis.

That sounds obvious until we remember how financial AI products are usually sold. Many product demos show a model reading a filing, explaining financial metrics, or flagging inconsistencies. Those demos test whether the model can be directionally useful. FinRule-Bench tests whether it can be complete under a rule system. Directionally useful is not the same as audit-ready. The spreadsheet has no patience for vibes.

The difference between Step-1 and Step-2 in joint diagnosis is especially important. A model that says “yes, there is a violation” may still miss one of several violated rules, confuse the rule category, or point to the wrong record. In a compliance workflow, that is not a minor formatting issue. It changes what the human reviewer investigates, what gets escalated, and what risk remains hidden.

FinRule-Bench tests rule-governed auditing, not generic financial chat

The benchmark is built around four canonical financial statement types: Balance Sheets, Cash Flow Statements, Income Statements, and Statements of Equity. The dataset includes 16 statement-specific rules, 1,117 ground-truth records, 4,385 single-violation records, and 400 multi-violation records. The rules cover accounting patterns such as arithmetic consistency, hierarchical aggregation, structural classification, conditional applicability, and multi-section dependencies.

The construction matters because the paper is not simply asking, “Can an LLM answer finance questions?” It is asking a narrower and more operational question: given a structured financial table and explicit rules, can the model enforce those rules completely?

That is a different problem from common financial QA. A QA benchmark may reward extracting the right number from a passage or performing a calculation. A summarization benchmark may reward accurately condensing disclosures. Those are useful tasks, but they do not require exhaustive rule enforcement. Auditing does.

FinRule-Bench therefore uses deterministic validators as the source of ground truth. A validator checks whether a table satisfies a rule. Error-injected records are created through minimal, controlled edits that flip validator outcomes while preserving the original table structure. In plain terms: the benchmark starts from clean financial statements, introduces targeted rule violations, and then asks whether models can find them.

That design removes one common escape route. If a model fails, the failure is not mainly about messy OCR, strange formatting, or invented financial data. The tables are represented in Markdown, the rules are supplied, and the evaluation is deterministic. The model’s job is to reason over the table and the rule set.

The benchmark also deliberately stays within individual statement types. It does not test cross-statement reconciliation, such as tying net income from the income statement into retained earnings or reconciling cash movements across statements. That boundary is important. The results are already sobering before the task expands to full audit workflows.

The task ladder exposes where “financial reasoning” breaks

The three tasks are not just three benchmark rows. They are a ladder of operational responsibility.

First, rule verification asks for a binary judgment under one rule. This is closest to a checklist item. If the model receives a balance sheet and a rule such as “Treasury stock should be reported as a deduction from total shareholders’ equity,” it must decide whether the statement complies.

Second, rule identification removes that comfort. The model sees the table and the rule set, then must choose which rule has been violated. This requires discrimination among competing accounting principles. It is no longer enough to sense that something looks wrong; the model must name the governing rule.

Third, joint rule diagnosis adds the part auditors actually care about: multiple violations can occur simultaneously, and the model must identify all of them with record-level localization. This is where many LLM results degrade sharply. The model may find part of the answer, but partial coverage is not complete diagnosis. A junior auditor who finds one error and misses the other two is not “almost done.” They are still creating review work.

The paper’s strict exact-match metric for Step-2 diagnosis is therefore not unfairly harsh. It reflects the operational nature of the task. In auditing and compliance, the question is rarely “Did the system notice any smoke?” The question is “Did it identify every material fire, connect each one to the right rule, and tell us where to inspect?”

Causal-counterfactual prompting helps—but not everywhere

One of the more useful parts of the paper is its treatment of causal-counterfactual prompting. The authors compare zero-shot prompting, few-shot prompting, and few-shot prompting with causal-counterfactual reasoning. In the third setting, exemplars show the causal condition that creates a violation and a minimal counterfactual repair that would remove it.

This is not a new model architecture. It is a prompt design strategy. The paper is careful about that. Causal-counterfactual prompting does not enforce reasoning consistency, does not guarantee better accuracy, and does not score explanations after the fact. It conditions the model through examples.

That distinction matters because the results are mixed in an informative way.

For rule verification, few-shot prompting generally helps. Adding causal-counterfactual structure can improve some cases, but it also adds token cost and does not uniformly improve performance. For rule identification, the extra reasoning scaffold can even hurt discrimination. The model needs to choose one violated rule, and a verbose causal frame may add noise rather than clarity. Apparently, making a model think longer is not the same as making it think better. A shocking development, except not really.

For joint rule diagnosis, causal-counterfactual prompting is more defensible. The paper reports that it gives larger relative gains for Step-2 exact-match localization, especially where multiple constraints interact. That makes sense. When the task requires identifying several linked rule failures, examples that show causes and minimal repairs can guide the model toward more systematic coverage.

The practical lesson is not “always use more elaborate prompting.” It is “match the reasoning scaffold to the failure mode.” If the task is simple compliance checking, extra prompt machinery may be wasteful. If the task is exhaustive multi-rule diagnosis, the extra cost may buy useful coverage.

The dominant failures are incomplete coverage and mislocalization

The paper’s error analysis is more valuable than the aggregate leaderboard. Aggregate accuracy says who won. Error analysis says what kind of system you are allowed to build.

In joint diagnosis, the models frequently detect that violations exist but fail to identify all violated rules or localize them correctly. The paper identifies partial detection, mixed errors, false positives, false negatives, and mislocalization as recurring failure modes. Conditional and multi-record constraints are especially difficult because they require the model to determine when a rule applies and how records depend on each other.

This is a familiar pattern in enterprise AI. The system is good enough to appear useful, but not consistent enough to be trusted without workflow design. It produces a helpful first pass, then leaves the expensive part—verification, completeness, and exception handling—to humans.

That does not make the system worthless. It changes the product design.

A finance AI system built around these results should not be positioned as an autonomous auditor. It should be positioned as a triage layer, reviewer assistant, or compliance workbench component. The model can help surface candidate violations, propose rule mappings, and generate review notes. But the system should preserve deterministic checks, structured outputs, rule-level confidence signals, and human approval for final judgment.

The key is to treat LLM output as a hypothesis, not a conclusion.

Rule complexity changes the business value of the model

Not all financial rules are equally hard. The paper’s rule complexity analysis separates arithmetic, structural, conditional, and multi-record rules. That distinction is useful for business deployment because it maps directly to automation scope.

Rule type	What makes it hard or easy	Reasonable automation posture
Arithmetic consistency	Often reducible to explicit calculations	Use deterministic code first; use LLMs mainly for table interpretation or explanation
Structural classification	Requires understanding sections, labels, and hierarchy	Use LLMs as assistants, but validate against schema rules
Conditional applicability	Requires deciding whether a rule is triggered	Require stronger review, because the hard part is applicability rather than calculation
Multi-record or multi-section constraints	Requires tracking dependencies across rows, sections, or periods	Treat as high-risk diagnosis; combine LLMs with validators and human review

This table is where the benchmark becomes business-relevant. It suggests that “AI for financial reasoning” is too broad a product category. A useful product roadmap should separate tasks by rule type and risk.

Arithmetic checks should rarely be delegated to an LLM as the primary engine. Deterministic computation is cheaper, more reliable, and easier to audit. The LLM may still help extract table structure, explain a failed check, or route the issue to the correct reviewer.

Structural and conditional rules are more interesting. They involve semantic interpretation: what section does a line item belong to, whether a label is equivalent to a required concept, or whether a disclosure condition has been triggered. This is closer to the zone where LLMs provide value, but also where they can quietly misclassify context.

Multi-record diagnosis is the hardest and potentially the most valuable, because it resembles real compliance review. It is also exactly where the paper shows current models struggle to achieve complete exact-match localization. That creates a product opportunity, but not the naive one. The value is not “replace auditors.” The value is “reduce reviewer search cost while making incompleteness visible.”

The benchmark is a warning against the wrong evaluation metric

The likely misconception FinRule-Bench attacks is simple: if a model performs well on financial QA, table reasoning, or anomaly detection, it must be ready for accounting-rule auditing.

No. That inference skips the hardest part.

Financial QA tests whether a model can retrieve or compute an answer. Anomaly detection tests whether something looks abnormal. Rule-based auditing tests whether the model can apply a formal rule system exhaustively and attribute violations precisely. These capabilities overlap, but they are not the same.

This is why Step-2 exact match matters. Many businesses would be tempted to evaluate a financial AI assistant by asking whether it flagged the right document, produced a plausible explanation, or detected at least one inconsistency. Those metrics are too forgiving. They can hide partial coverage.

A better evaluation stack for financial AI should include at least four layers:

Detection: Did the system notice that a violation exists?
Rule attribution: Did it identify the violated accounting principle?
Coverage: Did it find all simultaneous violations?
Localization: Did it map each violation to the correct record or line item?

Only the first layer looks impressive in a demo. The other three decide whether the system is useful in production.

Cost is part of the reasoning problem

The paper also tracks token usage and cost, which is not a side issue. In enterprise workflows, a method that improves accuracy by a few points while tripling prompt length may be unacceptable unless the task is high-risk enough to justify it.

The results show that structured prompting is not uniformly cost-effective. For rule verification and rule identification, additional reasoning overhead often produces limited or inconsistent gains. For joint diagnosis, especially Step-2 localization, the gains are more meaningful but come with substantial token cost.

This should push teams away from one-size-fits-all prompting. A sensible financial AI system would route tasks by risk and complexity:

simple arithmetic checks go to deterministic validators;
ordinary single-rule checks use short prompts plus structured output;
ambiguous structural or conditional cases receive richer examples;
multi-violation diagnosis gets the heavier reasoning scaffold and human review.

That architecture is less glamorous than “one agent reads the filing and audits everything.” It is also less likely to embarrass you in front of a partner, regulator, or CFO, which is traditionally considered a product feature.

What this paper directly shows, and what Cognaptus infers

The paper directly shows that current LLMs perform better on isolated rule verification than on rule identification and joint multi-violation diagnosis. It shows that diagnosis failures are often driven by incomplete coverage and mislocalization rather than total inability to detect anomalies. It also shows that causal-counterfactual prompting can help in some settings, especially diagnosis, but is not a free improvement across all tasks.

Cognaptus infers three practical design principles from this.

First, financial AI evaluation should be task-specific. A model that is acceptable for summarizing filings may be unacceptable for auditing rule compliance. Procurement teams should not let a vendor substitute general financial reasoning benchmarks for rule-governed compliance tests.

Second, production systems should separate language interpretation from deterministic validation. The LLM can help interpret line items, generate hypotheses, and explain rule implications. Validators should handle anything that can be expressed as a deterministic check. This is not anti-LLM. It is pro-not-getting-fired.

Third, human review should focus on the failure modes the benchmark exposes: missed rules, wrong rule selection, and wrong record localization. A reviewer interface should not simply show a model’s final answer. It should show which rules were checked, which records were implicated, which rules were not confidently resolved, and whether a deterministic validator agrees.

Boundaries: this is not yet real-world audit automation

FinRule-Bench is a focused benchmark, not a full audit simulation. The rules are supplied. The tables are clean and complete. The benchmark does not test rule discovery, noisy or missing data, domain knowledge retrieval, supervised fine-tuning, or cross-statement reconciliation. It also reflects the institutional context of publicly available corporate disclosures, primarily U.S.-centric Form 10-K filings.

Those boundaries do not weaken the paper. They make the result more pointed. Even in a controlled setting—where the rules are given, the tables are structured, and deterministic validators define the ground truth—models still struggle with complete multi-rule diagnosis.

A messier real-world workflow would add more difficulty, not less.

That is the part businesses should take seriously. The benchmark does not prove that LLMs cannot be useful in financial compliance. It proves that usefulness must be engineered around their failure modes.

The future financial AI stack will be diagnostic, not just conversational

The old financial AI demo was conversational: upload a filing, ask a question, receive an answer. That still has value. But FinRule-Bench points toward a more serious architecture: diagnostic financial AI.

A diagnostic system does not merely respond. It checks. It maps rules to records. It separates detection from attribution. It distinguishes partial coverage from complete coverage. It knows when a deterministic validator should overrule a fluent explanation. It treats “I found something” as the beginning of review, not the end.

That is the real business lesson of the paper. Financial reasoning is not one capability. It is a chain of capabilities, and the chain breaks where models must be complete, not merely plausible.

LLMs can talk about balance sheets. They can often find suspicious patterns. They can sometimes explain the governing rule. But when the job is to identify every violated rule and point to the exact record, the brain cells still trip.

The spreadsheet, as usual, remains unimpressed.

Cognaptus: Automate the Present, Incubate the Future.

Arun Vignesh Malarkkan, Manan Roy Choudhury, Guangwei Zhang, Vivek Gupta, Qingyun Wang, Yanjie Fu, and Denghui Zhang, “FinRule-Bench: A Benchmark for Joint Reasoning over Financial Tables and Principles,” arXiv:2603.11339, 2026, https://arxiv.org/pdf/2603.11339. ↩︎

The important result is the gap between detection and diagnosis#

FinRule-Bench tests rule-governed auditing, not generic financial chat#

The task ladder exposes where “financial reasoning” breaks#

Causal-counterfactual prompting helps—but not everywhere#

The dominant failures are incomplete coverage and mislocalization#

Rule complexity changes the business value of the model#

The benchmark is a warning against the wrong evaluation metric#

Cost is part of the reasoning problem#

What this paper directly shows, and what Cognaptus infers#

Boundaries: this is not yet real-world audit automation#

The future financial AI stack will be diagnostic, not just conversational#