TL;DR

Agentic LLMs can translate legal rules into working software and audit themselves using higher‑order metamorphic tests. This combo improves worst‑case reliability (not just best‑case demos), making it a practical pattern for tax prep, benefits eligibility, and other compliance‑bound systems.


The Business Problem

Legal‑critical software (tax prep, benefits screening, healthcare claims) fails in precisely the ways that cause the most reputational and regulatory damage: subtle misinterpretations around thresholds, phase‑ins/outs, caps, and exception codes. Traditional testing stumbles here because you rarely know the “correct” output for every real‑world case (the oracle problem). What you do know: similar cases should behave consistently.

Implication for leaders: Your risk isn’t a flashy hallucination—it’s a quiet miscalculation that passes unit tests and ships to thousands of customers.


The Core Idea: Agents + Higher‑Order Metamorphic Testing

The paper’s framework, Synedrion, organizes LLMs into roles:

  • TaxExpertAgent turns statutes/publications into a validated JSON policy spec (brackets, caps, add‑ons, eligibility, exception codes). No hard‑coding.
  • CoderAgents (+ Senior Coder) generate Python functions that read from the JSON spec.
  • MetamorphicAgent doesn’t ask “is this case’s number correct?” It asks whether relationships across cases hold (e.g., monotonicity, threshold jumps, saturation).

Why “higher‑order” matters

Pairwise checks (A vs B) catch directionality (“higher income ⇒ higher tax”). But many bugs are systematic and still pass pairwise checks (e.g., a flat‑rate tax that’s monotonic but not progressive). Higher‑order relations compare rates of change across multiple points to expose:

  • Threshold jumps (marginal rate changes right after a bracket boundary)
  • Proportionality (slope stays within a bracket’s rate)
  • Saturation (credits stop increasing after the cap)

Concrete example: The American Opportunity Tax Credit pays 100% of the first $2k in expenses and 25% of the next $2k (max $2.5k). A higher‑order test checks that the slope drops from 100% to 25%, then flattens after $4k—not just that “more expense ⇒ more credit.”


What’s New vs. “Prompt the Model to Code”

Most “LLM‑codes‑from‑laws” demos optimize for best‑run accuracy and look great in a blog. In production, leaders need worst‑case guarantees. The agentic + HMT approach shifts the metric:

Approach Model Complex scenario (retirement distributions) – Worst‑case across 10 runs Best practical takeaway
Baseline CoT prompting Small model ~0% worst‑case; ~9% average pass Unreliable under complexity
Agentic (no MT) Small model ~45% worst‑case; ~62% average Structure + roles already help
Agentic + Metamorphic (4‑ary) Small model ~55% worst‑case; ~68% average Directional relations add robustness
Agentic + Higher‑Order MT Small model ~69% worst‑case; ~75% average Rates/threshold checks catch silent logic bugs
Agentic + Higher‑Order MT Frontier model ~88–93% worst‑case; ~93–95% average Near‑production reliability on hardest case

Why you care: Worst‑case uplift is the difference between “wow demo” and “deploy with confidence.”


The Operating Model You Can Borrow

Use this as an internal blueprint for any rule‑bound domain:

  1. Source → Policy JSON (single source of truth)
  • Parse statutes, guidance, forms, and edge cases into a typed JSON schema: brackets, thresholds, phase‑in/out, caps, add‑ons, exceptions, definitions.
  • Enforce schema validation (units, ranges, enumerations). Reject non‑conforming extractions.
  1. Code from Spec (no constants in code)
  • Functions read policy JSON; they never embed rates or thresholds.
  • Senior review loop selects between two independently generated implementations.
  1. Higher‑Order Metamorphic Test Harness
  • Curate families of cases around knees and ledges (just below/at/above thresholds; within phase‑out ramps; past caps; with/without exception codes).
  • Assert slope properties (within‑bracket proportionality), jump conditions (marginal rate increases at boundary), and flatlines (post‑cap saturation).
  • On violation, emit a counterexample (inputs + offending outputs) that feeds the repair loop.
  1. Counterexample‑Guided Repair Loop
  • Treat the testing layer as an auditor: fail fast, patch the JSON spec or the function, re‑run only the affected families.

Implementation Notes (that save money and time)

  • Heterogeneous agents pay off: Put your strongest model on the TaxExpertAgent (policy extraction), keep coders lighter. This boosts robustness without paying frontier‑model costs everywhere.
  • Token budgets spike at verification, not coding: Expect verification prompts/cases to dominate spend—especially at complex boundaries. Plan budget and caching accordingly.
  • Spec drift is inevitable: Lock a tax‑year/version into the JSON and version functions against it. Diff the JSON between releases; re‑run only affected metamorphic suites.

Where to Apply Beyond Taxes

  • Benefits eligibility/means testing: Detect off‑by‑one cap errors that systematically exclude the right people.
  • Lending/insurance pricing: Verify tiered pricing, grace periods, and exception codes behave correctly across the whole surface, not just happy paths.
  • Healthcare claims/adjudication: Encode coverage rules and ensure saturation/limits are enforced without regressions.

Regulators will love this: Higher‑order tests create explainable, relationship‑level guarantees (“these five cases obey the same marginal rate function”), which is far easier to audit than thousands of one‑off case assertions.


Pragmatic Risks & How to Mitigate

  • Oracle still partial: You can’t know every ground truth; design metamorphic suites to cover relationships that matter to law and fairness.
  • Coverage gaps: Start with three families (proportionality, threshold jump, saturation), then extend to interaction effects (e.g., credit stacking, sequencing rules) as incidents surface.
  • Compute/latency: Batch‑generate cases around shared thresholds and cache JSON‑spec slices to reduce repeated context.

A Closing Take

Agentic LLMs are not just better coders—they’re institutionalizing compliance as code. The decisive upgrade here is auditable relationships, not just accurate outputs. If your product lives under statutes, shift your QA from answers to invariants. That’s how you stop silent bugs before customers and regulators find them.

Cognaptus: Automate the Present, Incubate the Future