TL;DR
Tax law is full of brackets, caps, cliffs, phase-outs, and exceptions. Conveniently, those are also the places where software quietly breaks.
The paper behind this article introduces Synedrion, a multi-agent LLM framework for translating legal tax documents into executable software.1 Its most useful idea is not “use agents” in the vague conference-demo sense. It is more specific: split legal interpretation, code generation, senior review, and behavioural testing into separate roles, then use higher-order metamorphic testing to catch systematic errors that normal test cases and pairwise comparisons can miss.
On the hardest benchmark, involving 1099-R retirement distributions and penalties, GPT-4o-mini with chain-of-thought prompting reached 9% Partial Pass@10 and 0% worst@10. The same smaller model inside Synedrion with higher-order metamorphic testing reached 75% Partial Pass@10 and 69% worst@10. That is not a magic wand. It is an expensive test-and-repair loop. But for regulated software, expensive diagnosis is often cheaper than confident wrongness wearing a product badge.
The familiar failure: the software gets the direction right and the law wrong
A tax calculator can be wrong without looking obviously wrong.
If income rises and tax owed rises, a basic test may pass. If education expenses increase and a credit increases, another test may pass. If a deduction makes liability fall, lovely, the dashboard glows green. Everyone goes home to enjoy a small illusion of compliance.
The problem is that legal rules rarely ask only for direction. They ask for shape.
A progressive tax bracket is not merely monotonic. It has marginal rates. A credit is not merely increasing. It may rise quickly, then slowly, then stop. A penalty is not merely conditional. It may disappear under one of several exception codes. Tax law is not a smooth motivational poster; it is a collection of thresholds pretending to be a policy system.
That is the paper’s central insight. In legal-critical software, the hardest bugs often live in the relationship between cases, not in a single case. A test that checks whether “more input gives more output” can miss the bug where the software applies a flat rate across all income. Directionally correct. Legally wrong. Beautifully monotonic. Still broken.
Synedrion is built around that distinction.
Synedrion is not one clever prompt; it is a small software team made of agents
The framework uses LLM agents to mimic a development workflow for legal-critical software. The division of labour matters because the task itself is not one task.
A legal document must first be interpreted. Its rules must then be represented in a structured form. Functions must be generated from that representation. Those functions must be reviewed. Then the resulting behaviour must be tested against legal expectations. Lumping all of that into one prompt is less “automation” than “asking one intern to be counsel, engineer, QA, and scapegoat.”
Synedrion separates the work:
| Agent role | What it does | Operational meaning |
|---|---|---|
| TaxExpertAgent | Converts tax documents into structured JSON policy specifications and function descriptions | Turns messy legal text into a typed intermediate representation |
| CoderAgents | Generate Python functions from the JSON specifications | Keeps code generation grounded in explicit rules |
| SeniorCoderAgent | Reviews candidate implementations and coordinates revisions | Adds a software-review loop rather than accepting the first fluent answer |
| MetamorphicAgent | Generates behavioural tests and counterexamples | Tests whether the program obeys legal relationships across related cases |
This is the first important business translation: the policy spec becomes a first-class asset.
Instead of burying rates, caps, and thresholds inside generated code, Synedrion pushes them into structured JSON. That makes the legal rule layer inspectable, versionable, and easier to update. For tax software, that matters because the law changes. For benefits, lending, insurance, healthcare reimbursement, and compliance workflows, it matters for the same dull but expensive reason: yesterday’s correct constant is tomorrow’s incident report.
The agents are not interesting because they have names. They are interesting because they create checkpoints. The TaxExpertAgent produces structured specifications. The coders implement. The senior coder reviews. The testing agent attacks the behavioural surface. It is less “AI genius” and more “finally, someone gave the model a job description.”
The real mechanism: testing relationships, not answers
The paper’s mechanism-first contribution is higher-order metamorphic testing.
Metamorphic testing is used when the exact answer is hard to know, but the relationship between answers is knowable. In tax software, the exact liability for a complicated taxpayer profile may require legal interpretation. But we may know that two otherwise identical taxpayers should behave consistently when one crosses a known threshold, receives a capped credit, or qualifies for an additional deduction.
Traditional metamorphic testing often compares pairs of cases. For example:
If income increases, tax should not decrease.
That is useful. It is also insufficient.
A broken program can pass that test while still applying the wrong marginal rate. Imagine a flat 12% tax implementation. Higher income still produces higher tax. Pairwise monotonicity smiles approvingly. The statute does not.
Higher-order metamorphic testing compares multiple related cases and examines the rate of change across them. The paper focuses on three behavioural categories:
| Higher-order relation | What it checks | Example legal pattern |
|---|---|---|
| Proportional increase | Whether output changes at the expected rate across increments | A benefit or liability should scale consistently within a rule segment |
| Threshold jump | Whether behaviour changes when an input crosses a legal boundary | Marginal tax rate changes at a bracket threshold |
| Saturation | Whether output stops changing after a cap is reached | A credit reaches its maximum and should no longer increase |
The American Opportunity Tax Credit gives a clean example. It covers 100% of the first $2,000 of qualified education expenses, 25% of the next $2,000, and then caps at $2,500. A pairwise test can verify that more qualifying expense does not reduce the credit. But the legally meaningful test is about slope: the first segment should rise faster than the second, and after the cap the line should flatten.
That is why this paper should not be read as “agents write tax code.” The more precise reading is: \ast\astagents generate code, then another agent generates structured counterexamples that expose whether the code has learned the law’s geometry\ast\ast.
Very glamorous. Also, actually useful.
Read the experiments as a layered argument, not a leaderboard
The paper’s experiments use six IRS-derived benchmarks, progressing from simpler bracket and standard deduction calculations to more complex rules around EITC, child credits, education credits, itemized deductions, and finally 1099-R retirement distributions and penalties.
The evaluation uses a hand-authored Tax Year 2021 reference implementation and symbolic execution to generate test cases. The authors report \ast\astPartial Pass@10\ast\ast, \ast\astPartial Pass@1\ast\ast, and \ast\astworst@10\ast\ast. The important one for business readers is worst@10, because production risk is rarely governed by the best-looking run in a lab notebook.
The experiments should be interpreted by purpose:
| Evidence component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Zero-shot and chain-of-thought baselines | Main evidence against direct prompting | Legal tax code generation degrades sharply as rule complexity increases | That all prompt-only methods are hopeless |
| Agentic framework without metamorphic testing | Main evidence for role decomposition | Structured agents improve correctness substantially, including for smaller models | That role splitting alone gives dependable compliance |
| MT versus HMT comparison | Mechanism evidence / ablation | Higher-order behavioural tests improve robustness beyond pairwise directional tests | That the three HMT categories cover all legal behaviours |
| Agent ablation table | Ablation | TaxExpertAgent and MetamorphicAgent are major contributors | That the exact agent design is optimal |
| Token usage figure | Implementation cost evidence | Verification sharply increases token consumption | That the approach is economically justified in every domain |
That last column matters. A paper can demonstrate a promising engineering pattern without proving a procurement case. We do not need to pretend otherwise. This is Cognaptus, not a vendor webinar with better lighting.
The scoreboard: small models improve because the system around them improves
The baseline results are unsurprising in the broad direction and still useful in the details.
With zero-shot prompting, strong models do well on simpler scenarios but struggle on the hardest one. Claude 3.5 reaches \ast\ast29% PP@10\ast\ast and \ast\ast14% worst@10\ast\ast on Scenario 6. GPT-4o reaches \ast\ast18% PP@10\ast\ast and \ast\ast5% worst@10\ast\ast. GPT-4o-mini reaches only \ast\ast2% PP@10\ast\ast and \ast\ast0% worst@10\ast\ast.
Chain-of-thought helps some larger models, especially on consistency, but it does not rescue smaller models under complex legal logic. On Scenario 6, GPT-4o-mini improves to \ast\ast9% PP@10\ast\ast, while worst@10 remains \ast\ast0%\ast\ast. Claude 3.5 reaches \ast\ast31% PP@10\ast\ast and \ast\ast15% worst@10\ast\ast.
Then the agentic design changes the shape of the results.
With Coder + SeniorCoder + TaxExpert agents, GPT-4o-mini reaches \ast\ast62% PP@10\ast\ast and \ast\ast45% worst@10\ast\ast on Scenario 6. That is already a large jump before adding metamorphic testing. The implication is clear: much of the failure is not raw model intelligence alone. It is task decomposition, structured intermediate representation, and review.
Adding regular 4-ary metamorphic testing improves GPT-4o-mini on Scenario 6 to \ast\ast68% PP@10\ast\ast and \ast\ast55% worst@10\ast\ast. Adding higher-order metamorphic testing improves it further to \ast\ast75% PP@10\ast\ast and \ast\ast69% worst@10\ast\ast.
For stronger models, the same pattern holds. GPT-4o with HMT reaches \ast\ast93% PP@10\ast\ast and \ast\ast88% worst@10\ast\ast on Scenario 6. Claude 3.5 with HMT reaches \ast\ast95% PP@10\ast\ast and \ast\ast93% worst@10\ast\ast.
The tempting headline is that smaller models can outperform frontier baselines. True, but incomplete. The better headline is: \ast\asta weaker model inside a disciplined validation loop can beat a stronger model asked to freehand the job\ast\ast.
That is a very different lesson. One is about model shopping. The other is about system design.
The TaxExpertAgent is doing more than “summarising the law”
The ablation results show that the TaxExpertAgent is not decorative domain flavour. It is doing serious engineering work.
In the GPT-4o-mini baseline, Scenario 3 has \ast\ast12% PP@1\ast\ast under coder-only generation. With Coder + SeniorCoder + TaxExpert, the same scenario reaches \ast\ast97% PP@1\ast\ast. On the harder Scenario 6, GPT-4o-mini moves from \ast\ast4% PP@1\ast\ast in coder-only mode to \ast\ast78% PP@1\ast\ast with the agentic structure before metamorphic testing.
The mechanism is not mysterious. The TaxExpertAgent converts legal text into structured JSON and function-level descriptions. That means the coders do not need to simultaneously infer the statute, decide the data structure, remember the edge cases, and implement the function.
For business teams, this suggests a useful architecture principle: \ast\astput the strongest model, or the most human review, at the policy extraction layer\ast\ast.
The paper includes a heterogeneous-agent result that supports this direction without fully resolving it. Using GPT-4o for the TaxExpertAgent while keeping the CoderAgent and SeniorCoderAgent on GPT-4o-mini produced \ast\ast73% PP@10\ast\ast on Scenario 6, compared with \ast\ast62%\ast\ast for all GPT-4o-mini agents and \ast\ast83%\ast\ast for all GPT-4o agents. The authors do not exhaustively test all combinations, so the result is directional. Still, it matches operational intuition: spend intelligence where legal interpretation is hardest, not equally across every step because the budget spreadsheet enjoys symmetry.
The MetamorphicAgent turns QA into counterexample generation
The MetamorphicAgent is the paper’s sharpest operational idea.
Instead of merely scoring generated code after the fact, it generates test cases around legally meaningful behavioural structures. When it finds a violation, it produces counterexamples: the inputs, the observed outputs, and the reason the behaviour does not match the expected relation. The SeniorCoderAgent then uses those counterexamples to repair the code.
This creates a loop:
- Legal text becomes structured policy.
- Policy becomes code.
- Code is tested against relational expectations.
- Violations become concrete repair instructions.
- The revised code is tested again.
That loop is more important than the individual LLM calls. It changes the system from “generate once and hope” to “generate, attack, repair.” For regulated domains, that is the difference between a demo and an engineering process.
It also changes what “explainability” can mean. Instead of explaining only why a model produced a line of code, the system can explain that a family of cases violates a threshold-jump relation or fails to saturate after a cap. That is closer to how auditors, QA teams, and policy specialists think.
A compliance officer may not care whether an LLM “reasoned deeply.” Quite right. But they may care that cases just below, at, and above a statutory threshold produce the expected change in marginal behaviour. One is a vibe. The other is evidence.
The business pattern: policy-as-data, code-as-implementation, tests-as-behavioural contracts
The practical value of the paper is not limited to tax preparation. The mechanism generalises wherever rules are legalistic, threshold-heavy, and difficult to exhaustively oracle-test.
A business-ready adaptation would look like this:
| Layer | What to build | Why it matters |
|---|---|---|
| Policy ingestion | Convert statutes, guidance, contracts, forms, or internal rules into typed policy data | Separates legal content from implementation logic |
| Rule versioning | Tie every policy spec to a tax year, policy date, jurisdiction, or contract version | Makes updates auditable and regression testing narrower |
| Code generation | Generate functions that read from policy data rather than hardcoding thresholds | Reduces silent drift when rules change |
| Behavioural test families | Generate cases around thresholds, phase-outs, caps, exception codes, and eligibility cliffs | Tests the legal shape of the software, not only isolated examples |
| Counterexample repair | Feed failing relational cases back into the implementation loop | Turns QA output into structured developer input |
| Human review | Review policy specs, high-risk relations, and unresolved counterexamples | Keeps accountability where it belongs: with the organisation |
The inference for business use is straightforward but bounded. Cognaptus would not describe this as a way to remove legal experts or QA teams. That would be adorable, in the way a forklift manual is adorable when used as a parachute.
The better interpretation is that agentic systems can \ast\astamplify expert review\ast\ast by generating candidate implementations and surfacing relationship-level failures. The human expert no longer has to invent every test case manually. They can review the policy representation, approve the metamorphic relations, inspect counterexamples, and decide whether the generated repair is legally correct.
That is a more credible path to productivity than “the LLM read the statute, so we’re compliant now.”
Where this applies beyond tax
The paper briefly points toward poverty-management systems and other legal-critical software. That extension is reasonable, provided the domain has enough structured behavioural expectations to test.
Good candidate domains include:
\ast \ast\astBenefits eligibility and means testing\ast\ast, where thresholds, household composition, income bands, and exception rules determine outcomes. \ast \ast\astInsurance and lending rules\ast\ast, where pricing tiers, caps, grace periods, exclusions, and eligibility cut-offs must behave consistently. \ast \ast\astHealthcare claims adjudication\ast\ast, where coverage limits, deductibles, authorisation rules, and benefit caps create saturation and threshold patterns. \ast \ast\astPayroll and employment compliance\ast\ast, where overtime, tax withholding, leave entitlements, and contribution ceilings must obey jurisdiction-specific rules. \ast \ast\astFinancial compliance workflows\ast\ast, where reporting obligations and risk classifications change at specific transaction or customer thresholds.
The common structure is not “law.” It is \ast\astrule-bound computation under incomplete oracle knowledge\ast\ast.
If every input has a simple, known correct answer, ordinary unit tests may be enough. If the correct answer is hard but relationships between answers are clear, metamorphic testing becomes attractive. If those relationships involve slopes, cliffs, caps, and discontinuities, higher-order metamorphic testing becomes more attractive.
Legal systems, being run by humans and then encoded by other humans, contain plenty of cliffs. Naturally.
The cost is real: verification eats tokens
The paper is admirably clear that the stronger testing loop is not free.
In Scenario 1, basic agentic code generation uses \ast\ast18,457 tokens\ast\ast, while adding metamorphic testing increases this to \ast\ast77,438 tokens\ast\ast. In Scenario 6, basic code generation uses \ast\ast111,081 tokens\ast\ast, while higher-order metamorphic testing raises the total to \ast\ast450,134 tokens\ast\ast.
This matters for implementation planning. The budget spike is not mainly “the model wrote code.” It is the cost of generating, describing, running, and repairing against richer behavioural tests. In other words, the expensive part is the part that makes the system less foolish.
For production teams, the answer is not to avoid the cost. It is to allocate it intelligently.
Run higher-order metamorphic testing around high-risk rules: thresholds, caps, exception codes, eligibility cliffs, and historically buggy areas. Cache policy slices. Version test families by rule. Re-run only the affected suites when the law changes. Use smaller models where structured specs are already reliable, and reserve stronger models or human review for policy extraction and ambiguous legal interpretation.
Do not spend frontier-model money on every token just because the architecture diagram looks lonely without a large invoice.
The boundaries: this improves robustness, not legal truth itself
The paper’s limitations matter because they directly affect deployment.
First, the evaluation depends on a hand-authored Tax Year 2021 reference implementation. The authors stress-test boundaries and cross-reference overlapping cases with open-source tools, but they do not claim formal correctness. If the reference implementation has a bug, the evaluation inherits that weakness.
Second, higher-order metamorphic testing covers three categories: proportional increase, threshold jump, and saturation. Those are important, but they are not the full universe of legal behaviour. Interactions between credits, sequencing rules, multi-form dependencies, and jurisdiction-specific exceptions may require additional relation families.
Third, the experiments use six U.S. federal tax scenarios. They are meaningful and increasingly complex, but they are not proof that the method automatically transfers to every legal domain. Benefits eligibility, healthcare reimbursement, and lending compliance may have different data dependencies, ambiguity patterns, and institutional constraints.
Fourth, the paper reports the outcome of a single run of the full framework. That does not invalidate the result, but it does remind us to be careful about over-reading stability. In business language: promising prototype, not yet a universal benchmark standard.
Finally, none of this removes the need for governance. The system can generate policy specs, code, tests, and counterexamples. It cannot decide organisational risk appetite, legal accountability, or whether a policy interpretation is acceptable under the relevant authority. Those burdens remain stubbornly human. How inconsiderate of reality.
What leaders should take from the paper
The most useful lesson is not that GPT-4o-mini can beat a frontier baseline under the right setup, although that will be the snackable headline.
The lesson is that \ast\astregulated AI engineering should be built around verifiable behavioural contracts\ast\ast.
For legal-critical software, correctness is not just a number on a case. It is a pattern across cases. The system must behave properly below a threshold, at the threshold, above it, during phase-out, after saturation, and under exceptions. If an LLM-generated system cannot be tested across those relationships, then its fluent legal interpretation is not an asset. It is a liability with nicer syntax.
Synedrion shows a credible architecture for turning this problem into a workflow:
\ast Convert law into structured policy data. \ast Generate code from that data. \ast Test the behavioural shape of the generated functions. \ast Use counterexamples to repair the implementation. \ast Measure worst-case reliability, not just best-run success.
That is a serious contribution. Not because it makes agents look clever, but because it gives them a useful constraint: obey the law’s geometry.
Closing take: the bracket is the bug detector
The quiet genius of this paper is that it treats legal weirdness as a testing resource.
Brackets, caps, phase-outs, and exception codes are usually where developers get hurt. Synedrion turns them into probes. The ledges in the law become ledges in the test suite. The same structures that make tax software annoying also make it testable—if the system knows to look for rates of change rather than isolated answers.
That is the business message. Agentic LLMs become more useful when they stop pretending to be omniscient and start behaving like a disciplined engineering team: one agent structures the rule, another writes the code, another reviews it, and another tries to break it with legally meaningful counterexamples.
A little less magic. A lot more machinery. Exactly the right direction.
\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast
-
Sina Gogani-Khiabani, Ashutosh Trivedi, Diptikalyan Saha, and Saeid Tizpaz-Niari, “An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software,” arXiv:2509.13471, https://arxiv.org/abs/2509.13471. ↩︎