Prolog & Paycheck: When Tax AI Shows Its Work

TL;DR

Neuro‑symbolic architecture (LLMs + Prolog) turns tax calculation from vibes to verifiable logic. The paper we analyze shows that adding a symbolic solver, selective refusal, and exemplar‑guided parsing can lower the break‑even cost of an AI tax assistant to a fraction of average U.S. filing costs. Even more interesting: chat‑tuned models often beat reasoning‑tuned models at few‑shot translation into logic — a counterintuitive result with big product implications.

Why this matters for operators (not just researchers)

Most back‑office finance work is a chain of (1) rules lookup, (2) calculations, and (3) audit trails. Generic LLMs are great at (1), decent at (2), and historically bad at (3). This work shows a practical path to auditable automation: translate rules and facts into Prolog, compute with a trusted engine, and price the risk of being wrong directly into your product economics.

Design principle: If you can’t show your work, you can’t ship the answer.

The setup (in plain English)

Task: Compute tax liabilities from statute snippets and case facts.
Symbolic backbone: SWI‑Prolog executes logic programs that encode statutes and case facts.
Models tested: Open‑weight families (Qwen2.5 32B, Llama 3.3 70B, DeepSeek‑V3/R1) and OpenAI models (GPT‑4.1, o3, GPT‑5).
Methods compared:
1. Direct: LLM reads the statutes + case and outputs the number.
2. Parsed: LLM turns text → Prolog; Prolog computes the number (with timeouts/refusals).
3. Few‑shot with gold statutes: LLM sees correct statute logic + several solved precedent cases, then parses the new case into Prolog.
4. Self‑consistency: Accept only if two independent attempts agree (e.g., Direct + Parsed).
Economic metric: Break‑even price — what you’d need to charge per user if your product pays for customers’ over/under‑payment penalties plus the cost of refusals (standing in for human prep time).

The punchlines you can use tomorrow

Make refusal a feature, not a bug. When Prolog execution fails or two methods disagree, defer. Selectivity cuts downside costs dramatically.
Let the symbolic core carry the burden of proof. Force the model to produce a logic program; the Prolog engine guarantees the computation matches the trace.
Chat models are sleeper champs for parsing. In few‑shot parsing with gold statutes + exemplars, chat‑tuned models often outperform reasoning‑tuned ones — they imitate formal patterns with fewer distractions.
Exemplar retrieval is leverage. Ranking similar precedent cases and showing their gold parses materially improves success rates and lowers cost.

Numbers that change product decisions

Below are distilled results showing the lowest break‑even price each family achieved under different setups. (Lower = better economics.)

Model family / setup	No gold statutes (best)	With gold statutes + exemplars (best)
OpenAI (o3/GPT‑4.1/GPT‑5)	o3 Parsed ≈ $47	GPT‑5 Few‑shot ≈ $16; GPT‑5 FS+FS ≈ $29
DeepSeek (V3/R1)	R1 Direct+Direct ≈ $94	V3 Few‑shot ≈ $469; R1 Direct+FS ≈ $178
Llama 3.3	R1‑70B Direct ≈ $1,257	Llama‑70B Few‑shot ≈ $1,917
Qwen 2.5	R1‑32B Direct ≈ $505	R1‑32B Few‑shot ≈ $7,783

Interpretation: Even without any gold encodings, a neuro‑symbolic pipeline (LLM→Prolog) with strict refusals can beat the U.S. average filing cost by a wide margin. With gold statute programs + precedents, best‑in‑class few‑shot parsers push expected costs down another order of magnitude.

Product heuristic: Ship Direct+Parsed with self‑consistency now; add gold encodings and precedent retrieval where volumes justify the upfront cost.

Why “chat beats reasoning” (sometimes)

Parsing ≠ deep math. Few‑shot parsing is pattern imitation under long context. Chat models lean toward faithful mimicry; reasoning models generate long chains that can push the key text “into the middle,” hurting focus.
Latency & UX edge. Chat‑tuned models typically answer faster — useful for interactive intake flows (“Does this W‑2 belong here?”) while Prolog validates.

Takeaway: Use chat‑tuned models for text→logic translation and reasoning‑tuned models for raw calculation or zero‑shot statute selection. Let Prolog arbitrate.

A minimal architecture you can implement

Intake & facts → normalize entities and amounts.
Retriever → fetch top‑K precedent cases by structural similarity (when gold exists).
Parser (chat‑tuned) → emit Prolog facts for this case.
Rules → Prolog module of statutes (gold where available; else LLM‑parsed with guardrails).
Executor → SWI‑Prolog runs the query.
Self‑check → require agreement between two paths (e.g., Parsed + Direct) or abstain.
Explainer → render the final number with the exact Prolog clauses used.
Pricing → meter refusals and error liabilities into your per‑filing fee.

Risk & governance checklist (shiproom‑ready)

Refusal policy: Define timeouts, non‑execution, and disagreement thresholds.
Trace retention: Store the exact Prolog program and query; make it user‑visible.
Drift tests: Re‑run a fixed SARA‑like suite nightly; alert on deltas.
Edge‑case routing: Long or compositional cases trigger “human in loop.”
Statute refresh: Version rules; tie answers to rule snapshots.
Cost ledger: Track per‑case economic outcome against your guarantee.

Where else this pattern wins

Benefits eligibility & claims (social programs, rebates) — heavy on rules, high audit needs.
Insurance underwriting/pricing — tariff tables + conditions + provable calculation.
Procurement & tax classification (HS codes, VAT) — precedence and exceptions abound.

If the domain is “rules + arithmetic + liability,” a neuro‑symbolic core is your default choice.

What would we build at Cognaptus

TaxLogic Core: a Prolog module template + ingestion tooling.
Case Parser: Chat‑tuned LLM prompt pack for text→facts with structure checks.
Precedent Retriever: Instruction‑tuned reranker with domain descriptors.
Adjudicator: Self‑consistency orchestrator; confidence gating + deferral.
Explainer UI: Clause‑level provenance + downloadable rationale.

Business model: price per successful filing with a liability‑backed guarantee; add a low fee for abstentions that hand off to a human pro.

Final thought

The leap here isn’t a bigger model — it’s accepting that provability is a feature. By making the model speak Prolog and letting a symbolic engine compute, you align UX, compliance, and unit economics.

Cognaptus: Automate the Present, Incubate the Future

TL;DR#

Why this matters for operators (not just researchers)#

The setup (in plain English)#

The punchlines you can use tomorrow#

Numbers that change product decisions#

Why “chat beats reasoning” (sometimes)#

A minimal architecture you can implement#

Risk & governance checklist (shiproom‑ready)#

Where else this pattern wins#

What would we build at Cognaptus#

Final thought#