Cover image

Bracket Busters: When Agentic LLMs Turn Law into Code (and Catch Their Own Mistakes)

TL;DR Agentic LLMs can translate legal rules into working software and audit themselves using higher‑order metamorphic tests. This combo improves worst‑case reliability (not just best‑case demos), making it a practical pattern for tax prep, benefits eligibility, and other compliance‑bound systems. The Business Problem Legal‑critical software (tax prep, benefits screening, healthcare claims) fails in precisely the ways that cause the most reputational and regulatory damage: subtle misinterpretations around thresholds, phase‑ins/outs, caps, and exception codes. Traditional testing stumbles here because you rarely know the “correct” output for every real‑world case (the oracle problem). What you do know: similar cases should behave consistently. ...

October 1, 2025 · 5 min · Zelina
Cover image

Judgment Day for RAG: How L‑MARS Cuts Legal Hallucinations by Design

TL;DR — L‑MARS replaces single‑pass RAG with a judge‑in‑the‑loop multi‑agent workflow that iteratively searches, checks sufficiency (jurisdiction, date, authority), and only then answers. On a 200‑question LegalSearchQA benchmark of current‑year questions, it reports major gains vs. pure LLMs, at the cost of latency. For regulated industries, the architecture—not just the model—does the heavy lifting. What’s actually new here Most legal QA failures aren’t from weak language skills—they’re from missing or outdated authority. L‑MARS tackles this with three design commitments: ...

September 4, 2025 · 4 min · Zelina
Cover image

Put It on the GLARE: How Agentic Reasoning Makes Legal AI Actually Think

Legal judgment prediction (LJP) is one of those problems that exposes the difference between looking smart and being useful. Most models memorize patterns; judges demand reasons. Today’s paper introduces GLARE—an agentic framework that forces the model to widen its hypothesis space, learn from real precedent logic, and fetch targeted legal knowledge only when it needs it. The result isn’t just higher accuracy; it’s a more auditable chain of reasoning. TL;DR What it is: GLARE = Gent Legal Agentic Reasoning Engine for LJP. Why it matters: It turns “guess the label” into compare-and-justify—exactly how lawyers reason. How it works: Three modules—Charge Expansion (CEM), Precedents Reasoning Demonstrations (PRD), and Legal Search–Augmented Reasoning (LSAR)—cooperate in a loop. Proof: Gains of +7.7 F1 (charges) and +11.5 F1 (articles) over direct reasoning; +1.5 to +3.1 F1 over strong precedent‑RAG; double‑digit gains on difficult, long‑tail charges. So what: If you’re deploying LLMs into legal ops or compliance, agentic structure > bigger base model. Why “agentic” beats bigger The usual upgrades—bigger models, more RAG, longer context—don’t address the core failure mode in LJP: premature closure on a familiar charge and surface‑level precedent matching. GLARE enforces a discipline: ...

August 25, 2025 · 4 min · Zelina
Cover image

When AI Plays Lawmaker: Lessons from NomicLaw’s Multi-Agent Debates

When AI Plays Lawmaker: Lessons from NomicLaw’s Multi-Agent Debates Large Language Models are increasingly touted as decision-making aides in policy and governance. But what happens when we let them loose together in a legislative sandbox? NomicLaw — an open-source multi-agent simulation inspired by the self-amending game Nomic — offers a glimpse into how AI agents argue, form alliances, and shape collective rules without human scripts. The Experiment NomicLaw pits LLM agents against legally charged vignettes — from self-driving car collisions to algorithmic discrimination — in a propose → justify → vote loop. Each agent crafts a legal rule, defends it, and votes on a peer’s proposal. Scoring is simple: 10 points for a win, 5 for a tie. Two configurations were tested: ...

August 8, 2025 · 3 min · Zelina