TL;DR for operators

Tax AI should not be judged by whether the model can produce a confident answer in fluent prose. That is how one builds a very polite liability machine.

The useful pattern in this paper is architectural: let the language model translate statutory text and taxpayer facts into executable Prolog; let a symbolic solver compute the result; reject outputs that fail execution or disagree across independent attempts; then evaluate the system using an error-cost ledger, not just accuracy.1 The paper’s strongest practical message is therefore not “LLMs can do tax”. It is: high-stakes rule automation becomes more credible when the model is demoted from final authority to structured translator.

The headline numbers matter because they change product economics. On the 100 numerical SARA tax cases, direct answering is often expensive once errors are priced as liability. A symbolic route with o3, without gold statute encodings, reached 75 correct answers, 15 incorrect answers, 10 abstentions, and a break-even price of $47.43 ± $22.16. With manually translated gold statutes and retrieved exemplar cases, GPT-5 few-shot parsing reached 86 successes, 9 failures, 5 abstentions, and a break-even price of $15.78 ± $10.34. That is the sort of result operators notice, preferably before the compliance department does.

The catch is equally important. The best results depend on synthetic cases, clean statutory scope, manually curated formal logic, retrieved examples, and a U.S.-specific cost model. This is pilot architecture, not a green light to replace tax professionals with a chatbot and a prayer.

Tax software fails when the answer cannot defend itself

Tax work is not just arithmetic with paperwork attached. It is a chain of eligibility rules, exceptions, definitions, numerical thresholds, document interpretation, and audit exposure. A wrong answer is not merely “low benchmark performance”. It can mean overpayment, underpayment, penalties, customer complaints, and an awkward meeting with legal.

That is why the common LLM pitch is weak here. A model can explain its answer in a voice that sounds sober enough to wear a tie, but fluent explanation is not the same as executable proof. Chain-of-thought may be persuasive to a reader while still being causally loose, selectively faithful, or just wrong in a better font.

The paper tackles that gap by reframing statutory tax reasoning as semantic parsing. The model’s job is not to “reason about tax” in open-ended prose. Its job is to translate rules and facts into a formal program. The solver’s job is to execute that program. The system’s job is to refuse when the program does not run, times out, or fails a consistency check.

That change sounds technical. Operationally, it is the whole article.

The mechanism has three gears: translate, execute, abstain

The paper compares three core ways to solve the numerical subset of SARA, a benchmark built from edited U.S. federal tax-code sections and hand-crafted cases. SARA contains 376 cases overall; the authors focus on the 100 cases requiring numerical tax-obligation calculations because they are harder than binary “does this rule apply?” judgments.

The first method is direct calculation. The model receives the statute text and the case facts, then outputs the tax obligation. This is the familiar chatbot route: read, reason, answer. It is also the route most likely to confuse confident language with operational control.

The second method is zero-shot parsing into a symbolic solver. The model receives the plain-text statutes and case facts, but instead of returning the answer directly, it generates a Prolog program encoding the relevant rules and facts. SWI-Prolog then executes the query. If the generated program fails to execute into the proper format or hangs for more than 10 seconds, the system abstains.

The third method is few-shot parsing with gold statutes. Here, the system already has manually translated Prolog versions of the statutes. It also retrieves the five most logically similar cases and gives the model their gold Prolog translations as examples. The model then parses the new case facts into the same formal style. The paper calls these examples “precedents” only informally. They are not binding legal precedent; they are structured exemplars showing how similar fact patterns were formalised.

The mechanism can be written as a product pipeline:

Stage What happens Operational meaning
Intake User facts and statutory text enter the system The model sees the raw mess, not the final authority
Parsing LLM converts rules or facts into Prolog The model becomes a translator into a controlled representation
Execution Prolog computes the tax obligation The answer must follow the executable trace
Refusal Invalid programs, timeouts, or disagreements are rejected Silence is priced as a feature, not a failure
Audit The generated program and query can be inspected Review moves from “trust me” to “run the trace”

This is why mechanism-first is the right reading. The paper is not mainly a leaderboard story. It is a division-of-labour story.

The cost model makes wrong confidence expensive

Accuracy alone is too clean for tax. It treats all wrong answers as equal, which is convenient for papers and not how invoices, penalties, or angry customers work.

The authors build a break-even price metric that approximates the cost an automated tax advisor would need to absorb if it were accountable for mistakes. They use U.S. Internal Revenue Code §6662: a 20% penalty on substantial understatement of income tax. For personal or household cases, the paper treats understatement as substantial when it exceeds the greater of 10% of the true obligation or $5,000. Overstatement costs the excess tax paid. Refusal costs $270, standing in for the average U.S. filing cost reported by the IRS and used as a proxy for human/professional completion cost.

In simplified form, the per-case cost is:

$$ \text{cost} = \frac{1}{N}\sum_{i=1}^{N} \begin{cases} -(\Delta y_i), & \Delta y_i < 0 \ 0.2(\Delta y_i), & \Delta y_i > \max(5000,\ 0.1y_i) \ 270, & \text{if refused} \ 0, & \text{otherwise} \end{cases} $$

where $\Delta y_i = y_i - \hat{y}_i$, so positive values mean the system understated tax owed and negative values mean it overstated tax owed.

This metric is not perfect, but it is much more useful than raw exact match for business interpretation. It asks the deployment question: if the vendor had to eat the cost of its mistakes and hand off uncertain cases, what would it need to charge?

That framing also explains why abstention is not an embarrassment. In ordinary benchmarks, refusal looks like missing output. In a liability-bearing product, refusal can be cheaper than false confidence. The “always abstain” baseline costs $270. The absurd “always predict $0” baseline gets 5 cases right, 95 wrong, and produces a break-even price of $16,227.11 ± $7,805.94. A system that never admits uncertainty may look decisive. So does a cliff.

Without gold statutes, Prolog already changes the economics

The first main evidence block is Table 2: methods without manually supplied gold statute encodings. This is the more immediately interesting setting for operators because it does not assume the organisation has already formalised its rulebook.

The results are uneven, which is precisely why they are useful. Some smaller models fail badly when asked to generate executable Prolog. Llama-70B parsed, for example, produced only 1 correct answer, 43 incorrect answers, 56 abstentions, and a break-even price of $252,027.73 ± $414,049.97. That is not a typo; it is a reminder that “add symbolic solver” is not magic dust. Bad formalisation can be worse than direct answering because the wrong program still executes with great confidence.

But the stronger models show why the architecture is worth attention. OpenAI o3 in parsed mode achieved 75 correct, 15 incorrect, and 10 abstentions, with a $47.43 ± $22.16 break-even price. That beat o3 direct answering, which produced 56 correct and 44 incorrect with a far higher $6,431.84 ± $2,637.94 break-even price. The difference is not just accuracy. It is the combination of better structured computation and a refusal path.

DeepSeek-R1 shows a slightly different lesson. Its direct mode reached 74 correct and 26 incorrect, with a break-even price of $304.29 ± $225.57. But requiring two direct attempts to agree — Direct + Direct — produced 66 correct, 12 incorrect, 22 abstentions, and a much lower $94.20 ± $59.76. Self-consistency here acts as a selectivity filter: fewer answers, fewer expensive mistakes.

So the first result is not “Prolog always wins”. It is sharper: symbolic execution plus abstention can convert model uncertainty into operational selectivity. That selectivity is what lowers liability-adjusted cost.

Gold statutes turn tax reasoning into event extraction

The second main evidence block, Table 3, changes the task. With gold statutes and retrieved exemplar cases, the model no longer has to invent the statutory formalisation from scratch. It must map the new case facts into an existing Prolog representation.

That is a different job. It is closer to structured extraction than legal reasoning. And once the task is reframed that way, performance improves dramatically for the strongest systems.

GPT-5 few-shot achieved 86 successes, 9 failures, 5 abstentions, and a break-even price of $15.78 ± $10.34. GPT-4.1 Few-Shot + Few-Shot achieved 81 successes, 5 failures, 14 abstentions, and $40.08 ± $15.87. o3 few-shot achieved 81 successes, 13 failures, 6 abstentions, and $60.26 ± $58.93.

These results should be read with the right boundary. Table 3 is not directly comparable to Table 2, because Table 3 assumes up-front human work: manually translated statutes and representative cases. That work is expensive, slow, and domain-specific. The paper is explicit about this.

Still, for a business with repeated filings, repeated claims, repeated eligibility checks, or repeated policy calculations, the economics may be attractive. The up-front cost of formalising rules becomes infrastructure. Once built, each new case becomes a cheaper parsing-and-execution problem.

That is the strategic point. The value is not in asking a model to rediscover the law every time. The value is in building a formal rule layer once, then using language models to map messy user facts into it.

The chat-versus-reasoning split is a deployment lesson

One of the paper’s more useful findings is counterintuitive: reasoning models are not always better.

The authors compare chat-optimised and reasoning-optimised models across direct solving, zero-shot parsing, and few-shot parsing. Reasoning models tend to do better at direct calculation and at zero-shot parsing, where the model must identify relevant statutory structure and produce a complete program. That makes sense. These settings require more inferential work.

But in the few-shot gold-statute setting, chat models can outperform reasoning models. The paper’s explanation is plausible: once the formal statute layer exists, the task is no longer “think deeply through the law”. It is “imitate the formalisation pattern faithfully”. Long reasoning traces may distract from that kind of close translation.

That matters for product design. A system does not need one heroic model doing everything. It may need different model behaviours at different points in the workflow:

Workflow task Better-suited behaviour Why
Direct tax calculation from raw text Reasoning-heavy model More arithmetic and rule selection
Zero-shot statute/case formalisation Strong reasoning model Must infer formal structure without gold examples
Case-fact parsing into existing gold statutes Chat or instruction-following model Faithful pattern imitation can beat elaborate reasoning
Acceptance decision Independent agreement or executable validation Selectivity matters more than eloquence

The obvious enterprise mistake would be to standardise on one model because it wins a broad benchmark. The paper suggests a less glamorous but better answer: assign models by function. Use reasoning where reasoning is actually the bottleneck. Use smaller or chat-tuned parsers where faithful translation is the bottleneck. Let the solver settle the arithmetic.

The figures mostly support the mechanism, not a second thesis

The paper’s figures and tables serve different purposes, and mixing them together would blur the argument.

Paper component Likely purpose What it supports What it does not prove
Figure 2: method diagram Implementation detail and conceptual map The architecture shifts from direct answering to parsing, execution, and refusal It does not show performance by itself
Table 2: no gold statutes Main evidence Symbolic parsing and self-consistency can lower break-even cost without manual gold rules It does not show that all models can formalise statutes reliably
Table 3: gold statutes and exemplars Main evidence with deployment boundary Up-front formalisation plus exemplar retrieval can sharply improve cost and success rates It does not remove the cost of building the formal rule base
Figure 3: chat vs reasoning performance Comparison across model types Reasoning and chat models excel in different task variants It does not mean chat models are generally better
Figure 4: method mixtures Robustness/selectivity test Independent agreement can reduce failures by accepting fewer cases It does not prove the optimal gating policy for real tax products
Figure 5: parsing success by reasoning-model size Exploratory scaling signal Larger reasoning models improve at zero-shot parsing into Prolog It does not prove that scale alone will solve legal formalisation

Figure 5 is especially worth handling carefully. It shows standalone parsing success improving with model size among reasoning models: the smaller Qwen and Llama reasoning variants solve almost none in that setting, while DeepSeek-R1 and o3 improve substantially, with o3 reaching 75 correct in parsed mode. That suggests future gains in semi-automated rule formalisation. It does not prove that entire tax codes can now be safely distilled into Prolog by a model. The jump is promising. The production conclusion still needs human review, regression tests, and versioned rule governance. Tedious, yes. Also the difference between software and litigation.

What this means for compliance products

The natural business extension is not limited to tax. The same pattern appears anywhere rules, calculations, evidence, and accountability collide.

In tax filing, the rule layer is statutory code and filing guidance. In benefits eligibility, it is programme rules and income thresholds. In insurance, it is underwriting logic, policy exclusions, and claims rules. In procurement or customs, it is classification rules, tariffs, documentation requirements, and exceptions. The shared shape is simple: inputs are messy, rules are formal-ish, mistakes cost money, and the organisation needs to explain the decision after the fact.

Cognaptus would infer three product principles from the paper.

First, separate translation from authority. The model may parse facts, retrieve comparable cases, draft explanations, and propose formal encodings. But the executable rule layer should remain the authority for calculation where possible.

Second, make refusal economically visible. A refusal is not just a UX event. It is a cost. Track refusal rates, handoff costs, error costs, and downstream review time as one ledger. That is how operators decide whether the system is cheaper than the manual workflow.

Third, store the trace. A tax assistant that cannot reproduce the program, rule version, input facts, execution query, and acceptance decision is not an auditable assistant. It is a chatbot with a memory problem.

A practical architecture would look like this:

  1. Structured intake: collect user facts with validation, entity normalisation, and document references.
  2. Rule retrieval: select relevant statutory or policy modules.
  3. Case retrieval: retrieve structurally similar gold-formalised examples where available.
  4. LLM parser: translate facts into Prolog predicates aligned to the rule base.
  5. Symbolic execution: run the query with timeouts and deterministic logging.
  6. Agreement gate: require independent parse agreement, direct/parsed agreement, or targeted human review for high-impact cases.
  7. Explanation layer: render a user-facing answer from the executable trace, not from improvised prose.
  8. Liability ledger: price errors, refusals, review time, and inference cost continuously.

The point is not to worship Prolog specifically. Prolog is useful here because legal rules and tax predicates map naturally into declarative logic. In other domains, the solver might be a rules engine, a constraint solver, a spreadsheet-calculation engine, Datalog, Catala, or another formal backend. The pattern is what matters: natural language in, executable representation in the middle, accountable answer out.

Where the evidence stops

The paper’s limitations are not ornamental. They define where the result can be used.

The benchmark is SARA, not the full tax world. SARA uses nine edited sections of the U.S. federal tax code and synthetic, hand-crafted cases. That is valuable because it creates a solvable and inspectable benchmark. It is also far cleaner than real filing, where forms are incomplete, users misunderstand questions, statutes interact with administrative guidance, and edge cases arrive wearing ordinary clothes.

The strongest results require gold statute programs and gold case exemplars. That means humans have already done a significant amount of formalisation work. For large-volume domains, this may be justified. For fast-changing or low-volume domains, the economics are less obvious.

The cost model is U.S.-specific and simplified. It includes overstatement, substantial-understatement penalties, refusal cost, and non-substantial understatement as zero in the formula. It excludes many real-world frictions: interest, amended returns, audit stress, customer support, reputational damage, professional liability insurance, and jurisdictional variation. That does not make the metric useless. It makes it a disciplined approximation, not an actuarial product model.

The model stack is also frontier-heavy. The best results come from OpenAI models, including GPT-5 as an auxiliary experiment, and from large DeepSeek variants. The authors note the concern that reliance on closed-source systems may create choke points. Operators should treat open-model replication and domain-specific smaller parsers as part of the roadmap, not a charitable side quest.

Finally, auditability is improved but not solved. A Prolog trace proves that the solver followed the generated program. It does not prove that the generated program perfectly captured the statute or the taxpayer’s facts. The trust boundary moves upstream: from “did the model reason correctly?” to “did the model formalise correctly?” That is progress. It is not absolution.

The real lesson is not bigger reasoning, but bounded authority

The misconception to kill is that better chain-of-thought tax answers are enough. They are not. High-stakes domains need systems that can show how a result was produced, reject fragile cases, and price the cost of being wrong.

This paper’s useful contribution is to make that architecture concrete. It turns tax reasoning into semantic parsing, routes calculation through Prolog, introduces refusal through execution failure and self-consistency, and evaluates the result using liability-adjusted break-even price. That combination is more operationally meaningful than another “LLM beats benchmark” headline. We have enough of those. Some of them should probably file estimated taxes.

For businesses, the near-term opportunity is not autonomous tax advice with no human in the loop. It is human-supervised, solver-backed automation for repetitive rule-and-calculation workflows. The model handles language. The symbolic layer handles obligation. The gate handles doubt. The audit trail handles the inevitable question: “Why did the system say that?”

That is a quieter claim than “AI will replace tax professionals”. It is also a better one.

Cognaptus: Automate the Present, Incubate the Future.


  1. William Jurayj, Nils Holzenberger, and Benjamin Van Durme, “Language Models and Logic Programs for Trustworthy Tax Reasoning,” arXiv:2508.21051, https://arxiv.org/abs/2508.21051↩︎