Law & Order(ly Data): How LLMs Are Learning to Read Regulations Like Machines

Compliance has a familiar little horror story: everyone can find the rule, but nobody can safely operationalize it.

The document is searchable. The PDF is indexed. The chatbot can quote the right paragraph with the confidence of a junior associate who has just discovered Ctrl+F. And yet the actual business question still hangs in the air: who must do what, under which condition, subject to which exception, and with what consequence?

That is where most regulation-grounded AI systems quietly break. They treat legal text as content to retrieve, when the harder problem is preserving legal structure. A regulation is not merely a paragraph with keywords. It is a compressed bundle of targets, permissions, prohibitions, definitions, thresholds, exceptions, dates, citations, and sometimes penalties. Remove the structure and the rule may still sound correct while becoming operationally useless. Beautifully fluent nonsense, now with compliance risk attached.

The paper De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules takes this bottleneck seriously.¹ Its contribution is not another “ask the model to summarize the law” workflow. We have enough of those. Please stop feeding entire statutes to chatbots and calling it governance. De Jure proposes a pipeline that converts raw regulatory documents into structured, machine-readable rule units, then evaluates and repairs those units through a hierarchical LLM-as-judge process.

The useful business lesson is mechanism-first: better compliance QA comes not from retrieval alone, but from giving retrieval a better substrate. De Jure works because it changes the shape of the knowledge base before the question-answering system ever sees it.

The real bottleneck is rule shape, not document access

The tempting misconception is that regulatory AI is mainly a retrieval problem. Put HIPAA, the SEC Investment Advisers Act, or the EU AI Act into a vector database; retrieve the top passages; let a model answer. This works well enough for demos because demos reward plausible references. Compliance, unfortunately, is not a demo economy.

A regulatory answer fails when it drops a condition, flattens an exception, confuses a permission with an obligation, or loses the defined scope of a term. These are not cosmetic errors. If an obligation is retrieved without its exemption, the system overstates the rule. If a permission is classified as a requirement, the system invents duties. If a definition is paraphrased loosely, every downstream rule that depends on that term inherits the distortion. Law is annoyingly precise for a reason.

De Jure’s core move is to extract each section into three layers:

Layer	What it captures	Why it matters operationally
Section metadata	Citation, title, effective dates, notes	Anchors every rule to an auditable source provision.
Definitions	Terms, scope, text, cross-references	Prevents later rules from being interpreted with the wrong vocabulary.
Rule units	Rule type, targets, actions, conditions, constraints, exceptions, penalties, purpose, verbatim span	Turns legal prose into components that QA, policy checking, or workflow systems can inspect.

This is the difference between storing regulation as “text” and storing regulation as an asset with fields. The first can be searched. The second can be validated, versioned, compared, and queried. That is not a small distinction; it is the boundary between a compliance chatbot and a compliance system.

De Jure does not require a human-annotated gold dataset, domain-specific prompts, or a pre-designed logical formalism. That is important because many firms have plenty of documents and very little appetite for hiring experts to label every clause before automation begins. The system instead relies on a schema-driven extraction prompt and a separate judge model that scores output against explicit criteria.

The paper’s strongest claim is not that LLMs “understand” regulation. It is more practical: if the task is decomposed properly, an LLM pipeline can transform dense regulatory text into structured rule units with enough fidelity to improve downstream compliance QA. Less mystical. More useful.

The pipeline works because errors are repaired in dependency order

De Jure has four stages: pre-processing, rule generation, multi-criteria judgment, and selective repair.

Pre-processing converts source documents into structured Markdown, preserves section boundaries, removes formatting noise, splits the document by regulatory section markers, attaches metadata, and adds a SHA-256 fingerprint for traceability. This sounds like plumbing because it is plumbing. It is also the part of the building that prevents sewage from entering the boardroom.

Rule generation then asks a backbone LLM to produce JSON-like structured output. Each rule unit is decomposed into fields such as action, action object, method, conditions, constraints, exceptions, penalties, purpose, and verbatim source span. The prompt also instructs the model to return null for non-actionable sections, which matters because regulatory documents contain plenty of preambles, cross-reference tables, and explanatory material that should not inflate the rule base.

The judgment stage is where the paper becomes more interesting. Instead of using one holistic quality score, De Jure evaluates three layers in sequence:

metadata, across 6 criteria;
definitions, across 5 criteria;
rule units, across 8 criteria.

That gives 19 criteria in total. The judge scores each layer and provides natural-language critiques. If the normalized stage average falls below 90%, the system regenerates that stage, using the original source, the current extraction, the scores, and the judge’s critique. The retry budget is bounded, with a default maximum of three attempts. The best-scoring output is retained rather than blindly accepting the final attempt.

The ordering matters. Metadata and definitions are not decorative fields; they are the context on which rule extraction depends. A bad citation or a fabricated definition can quietly poison every later rule unit. De Jure therefore repairs upstream components before evaluating downstream rules. In compliance terms: fix the glossary before arguing about the obligation. Revolutionary, in the way that locking the door before leaving the house is revolutionary.

The appendix example makes this concrete. In a HIPAA § 164.306 extraction, the initial rule is structurally complete and non-hallucinated, but two fields are wrong: the label omits a multi-factor balancing requirement, and the rule type is misclassified as “clarification” rather than “definition-application.” The judge gives the extraction a failing normalized score of 0.55, identifies the affected criteria, and after repair the score rises to 0.90. Importantly, the repair changes only the defective fields. This example is not main statistical evidence; its likely purpose is implementation-level validation. It shows that the judge can localize defects instead of merely saying, “bad vibes, try again.”

The main evidence says fine-grained rule decomposition is the hard part

The paper evaluates De Jure on three regulatory corpora: the SEC Investment Advisers Act, the HIPAA Privacy Rule, and the EU AI Act. These are useful test cases because they stress different structures. SEC rules are rigidly indexed. HIPAA is exception-heavy. The EU AI Act is more discursive and principle-laden. In other words, the system is not only tested on one clean statute wearing a lab coat.

The first experiment uses the SEC corpus to compare four backbone models: Llama-3.1-8B-Instruct, Qwen3-VL-8B-Instruct, Claude-3.5-Sonnet, and GPT-5-mini, while keeping a separate fixed judge model. The pattern is more informative than the model leaderboard. Average performance declines from metadata to definitions to per-rule quality. Metadata is easiest because headings, citations, and dates are structurally visible. Definitions are harder because they require identifying genuine regulatory primitives. Rule units are hardest because they require recovering conditional triggers, constraints, exceptions, and legal function.

That gradient is the paper’s most important internal diagnostic. It confirms that the bottleneck is not generic extraction. It is the semantic decomposition of operational rules.

The model comparison is also practically relevant. GPT-5-mini leads the SEC overall result with an average of 4.85, while Qwen3-VL-8B is close at 4.82 and even leads on definition extraction. The paper interprets this as evidence that capable open-source models, when placed inside a structured judge-and-repair pipeline, can approach proprietary model performance. For firms with privacy, data residency, or vendor-risk constraints, this matters more than another “closed model wins benchmark” paragraph. Architecture can narrow the capability gap.

Non-hallucination scores are near-ceiling across the experiments. That sounds excellent, but it should be read carefully. The paper is not proving that hallucination has been solved in legal AI, a sentence that should cause immediate professional suspicion. It shows that schema-constrained extraction with verbatim source grounding and multi-criteria judging can reduce fabrication as a measured failure mode within this setup. The remaining difficulty is subtler: completeness, fidelity, accuracy, and actionability in the face of nested legal structure.

Cross-domain generalization is encouraging, but the decline pattern is the real signal

The second experiment applies De Jure without changing the prompts, schema, or model configuration across SEC, HIPAA, and the EU AI Act, using GPT-5-mini and Qwen3-VL-8B. Overall scores remain above 4.70 out of 5.00 across all domain-model combinations.

That headline is useful, but the score pattern tells the better story. The paper reports a monotonic decline from SEC to HIPAA to the EU AI Act, matching the expected increase in structural difficulty. SEC text is more rigidly indexed. HIPAA contains looser, exception-heavy permission structures. The EU AI Act interleaves binding obligations with broader governance language. A permissive judge might have given everything near-perfect scores. The fact that scores vary in a plausible direction suggests the judge is not merely rubber-stamping its friends. A low bar, yes, but an important one.

The per-stage results again point to the same mechanism. Definitions are a key source of domain variability, especially when boundaries are less rigid. Rule-unit extraction remains the universal hard stage. This is exactly what a compliance architect should expect: section metadata changes with document format, definitions change with legal style, but extracting “who must do what unless which exception applies” is hard everywhere.

Evidence item	Likely purpose	What it supports	What it does not prove
SEC extraction quality across four models	Main evidence	The structured pipeline improves rule extraction and exposes where model capability matters.	That all regulatory domains will perform similarly.
Cross-domain test on SEC, HIPAA, EU AI Act	Robustness / generalization test	The same schema and prompts transfer across three structurally different corpora.	That every jurisdiction, language, or document type is covered.
HIPAA RAG comparison against Datla et al.	Comparison with prior work / downstream utility	Better structured rules improve compliance QA against a prior extraction baseline.	That De Jure answers are legally authoritative.
Acceptance threshold, retry budget, chunking, trigger ablations	Ablation	Quality depends on threshold choice, retry budget, and coherent section-aware inputs.	That the selected parameters are globally optimal.
HIPAA repair example from 0.55 to 0.90	Implementation detail / qualitative validation	The judge can localize defects and guide targeted repair.	That every repair will be so clean.

This distinction matters because enterprise readers often ask the wrong question: “Which model should we use?” The paper’s better question is: “Which failure modes must the system isolate before the model’s answer becomes operational?”

The RAG result matters because the advantage grows with retrieval depth

The third experiment tests downstream utility through HIPAA compliance question answering. De Jure’s extracted rules are compared against the prior rule-extraction approach by Datla et al., with both rule sets used as knowledge bases in separate RAG systems. The authors restrict the comparison to overlapping HIPAA sections, generate 100 evaluation questions, retrieve at depths $k \in {1, 5, 10}$, generate answers with Claude 3.5 Sonnet, and use a pairwise judge with swapped answer ordering to reduce positional bias.

De Jure wins across every criterion and retrieval depth. The aggregated win rates are 73.75% at $k=1$, 77.42% at $k=5$, and 84.00% at $k=10$.

The increasing win rate is the key. If De Jure were merely producing nicer-looking snippets, its advantage might fade as more material is retrieved. Instead, the advantage grows as retrieval becomes broader. That suggests the rule units are complementary rather than redundant. The system is not only retrieving more; it is retrieving pieces that combine better.

Handling ambiguity is the most revealing criterion. At $k=1$, De Jure’s win rate on ambiguity is 53.50%, close to parity. At $k=10$, it reaches 84.00%. This makes sense. Ambiguous compliance questions often require multiple provisions: one rule sets the general requirement, another defines scope, another supplies an exception. A single retrieved rule cannot resolve that structure. Broader retrieval helps only if the retrieved objects are differentiated enough to compose. De Jure’s representation appears to make that composition easier.

Criterion	$k=1$	$k=5$	$k=10$	Interpretation
Completeness	78.00	80.50	83.50	Broader retrieval improves coverage.
Factual grounding	80.50	76.00	85.50	Grounding remains strong, though not perfectly monotonic.
Handling ambiguity	53.50	66.50	84.00	The clearest sign that structured rules compose across provisions.
Practical actionability	78.00	80.50	84.00	Rule decomposition improves usable guidance.
Regulatory precision	74.50	80.50	83.50	Scope, conditions, and exceptions are better preserved.
Overall preference	78.00	80.50	83.50	The judge consistently prefers De Jure-grounded answers.
Aggregated	73.75	77.42	84.00	The advantage widens with retrieval depth.

This is where the paper’s business relevance becomes concrete. A compliance assistant that retrieves paragraphs may be adequate for “show me the source.” It is not enough for “what are we allowed to do in this edge case?” Edge cases require structured composition. De Jure improves the object being retrieved, not just the retriever.

The ablations reveal the control knobs, not just footnote trivia

The ablation section is easy to skim. That would be a mistake. It identifies which parts of the pipeline actually move quality.

First, the acceptance threshold matters. Raising the threshold from 0.6 to 0.9 increases total quality from 4.48 to 4.67, with the largest gain in definitions: +0.30 points. This supports the mechanism-first interpretation. Definitions are where small quality changes matter because they influence later rule interpretation.

Second, retry budget has a non-linear effect. A single retry gives almost no improvement. The qualitative jump occurs at $r=2$, where Step 2 recovers 1.25 points, or 25.0%. After that, gains saturate, and the paper adopts $r=3$ as a conservative default. The practical message is not “retry forever.” It is “one retry may be fake thrift.” The candidate pool needs enough diversity for best-of-selection to escape a bad local basin, but additional retries eventually become expensive decoration.

Third, chunking matters early. De Jure’s section-aware chunking beats the Datla et al. chunking strategy by +0.16 overall, with the benefit concentrated in Step 1 metadata (+0.33). Step 3 barely changes (+0.01). That tells us chunking does not magically solve rule reasoning. It ensures the model receives coherent inputs. Downstream refinement can polish coherent material; it cannot resurrect context that was amputated before extraction began.

Fourth, regeneration trigger granularity is second-order. Whether regeneration is triggered by average score or by any individual criterion below a threshold matters little once retry budget is adequate. That is useful engineering information: spend effort on bounded candidate diversity and clean input segmentation before obsessing over trigger micro-design.

For implementation, the ablations translate into a simple hierarchy of attention:

Design choice	Operational consequence	ROI relevance
Section-aware chunking	Prevents incoherent inputs before extraction begins.	High leverage early; cheap compared with repairing downstream confusion.
Schema decomposition	Forces the model to separate obligations, conditions, exceptions, and source spans.	Converts text into auditable assets.
Hierarchical judging	Repairs metadata and definitions before rule units.	Reduces silent error propagation.
Retry budget	Provides enough candidate diversity for repair to work.	Main cost-quality knob.
Trigger granularity	Changes when repair starts.	Lower priority once retry budget is adequate.

This is a useful reminder for enterprise AI: system reliability often comes from boring sequencing decisions. Not every problem needs a new model. Some problems need the pipeline to stop eating soup with a fork.

What compliance teams can infer, and what they should not

The paper directly shows three things. First, De Jure can extract structured regulatory rule units across three high-stakes corpora without human annotations or domain-specific prompting. Second, hierarchical LLM judging and bounded repair can improve extraction quality, especially in the definition layer. Third, when De Jure’s rules are used in a HIPAA RAG QA task, the resulting answers are preferred over a prior extraction baseline, with the advantage increasing at broader retrieval depth.

Cognaptus would infer a practical architecture from this, but not a legal miracle.

A compliance-heavy firm could treat regulations as a versioned rule asset layer. Raw PDFs or HTML documents would be normalized into section-level chunks. Rule extraction would create structured JSON objects. Each object would retain citations, source spans, definitions, and rule type. A judging layer would score and repair fields. Human reviewers would then inspect high-risk or low-confidence outputs before rules are pushed into compliance QA, policy-checking workflows, AI governance documentation, or internal controls.

The business value is not “replace lawyers.” That phrase belongs in the same drawer as “blockchain for toothbrushes.” The value is cheaper first-pass structuring, better traceability, and more reliable retrieval inputs. Lawyers and compliance officers should spend less time manually decomposing obvious clauses and more time reviewing edge cases, conflict handling, and operational interpretation.

The operational pathway looks like this:

Paper result	Business interpretation	Boundary
No annotated gold data required	Lower setup cost for new regulatory corpora.	Expert review is still needed for high-stakes deployment.
Domain-agnostic schema transfers across SEC, HIPAA, EU AI Act	A single extraction architecture may support multiple compliance domains.	Tested on three corpora, not every document style or jurisdiction.
Hierarchical repair improves structured extraction	Quality control can be embedded into the pipeline rather than added as a final checklist.	LLM-as-judge reliability remains a dependency.
De Jure wins HIPAA QA against a prior baseline	Better rule assets can improve RAG answers.	The downstream test is still judge-based and limited to overlapping HIPAA sections.
Retry budget and chunking drive quality	Cost and performance can be tuned through pipeline design.	Optimal settings may shift by model, corpus, language, and risk tolerance.

This is especially relevant for AI governance. As AI systems become subject to AI-specific regulation, organizations will need internal systems that can map obligations to controls, documents, model inventories, risk assessments, and audit evidence. De Jure points toward that infrastructure layer: regulation first becomes structured data, then the structured data supports governance workflows.

The boundary is legal authority, judge reliability, and production cost

The limitations are not decorative. They define how the result should be used.

First, the evaluation relies heavily on LLM judges. The paper mitigates this by using decomposed criteria, a separate fixed judge model, pairwise answer comparison, and swapped answer ordering in the RAG experiment. Still, LLM-as-judge is not the same as expert legal adjudication. It is a scalable evaluation proxy, not a court.

Second, the corpora are strong but limited: finance, healthcare privacy, and AI governance. That gives meaningful structural variety, but not universal coverage. Tax regulation, multilingual legislation, local administrative rules, enforcement guidance, contracts, and case-law-heavy domains may behave differently.

Third, De Jure improves structured extraction; it does not solve full legal reasoning. A rule unit can preserve conditions and exceptions, but conflict resolution, jurisdictional hierarchy, temporal applicability, and business-specific factual mapping remain separate problems. “The system extracted the rule correctly” is not identical to “the company is compliant.” Anyone who confuses those two has probably also described a dashboard as a strategy.

Fourth, production economics matter. The pipeline uses generation, judging, and possible regeneration across multiple stages. The retry budget is bounded, and the ablations suggest where cost is worth spending, but deployment teams still need cost controls, caching, versioning, change detection, audit logs, and escalation thresholds.

Finally, human review does not disappear. It moves. Instead of manually reading every provision from scratch, experts review structured outputs, inspect low-confidence cases, resolve conflicts, and approve operational mappings. That is a better use of expensive judgment, not the end of judgment.

Machine-readable regulation begins before the chatbot

The existing conversation around enterprise AI still over-focuses on the final answer: did the assistant respond well? De Jure reminds us that the answer is downstream of the representation. If the knowledge base flattens obligations, conditions, and exceptions into generic chunks, the model is already handicapped before generation begins.

The paper’s real contribution is therefore infrastructural. It shows how regulatory text can be transformed into auditable rule units through schema-driven extraction, hierarchical judging, and selective repair. The strongest downstream result—the HIPAA QA win rate rising from 73.75% at $k=1$ to 84.00% at $k=10$—makes sense only when viewed through that mechanism. Better rule structure makes broader retrieval more useful because retrieved units can compose rather than collide.

For businesses, that points to a sober roadmap. Start by making regulation machine-readable. Preserve citations, definitions, targets, rule types, conditions, constraints, exceptions, and verbatim spans. Use judge-and-repair loops to catch field-level defects. Keep humans in the approval path. Then, and only then, ask a chatbot to answer compliance questions.

Law does not become operational because an LLM can quote it. It becomes operational when the organization can trace a question back through structured rules to source text and then to a reviewed business control.

Less courtroom drama. More orderly data.

Cognaptus: Automate the Present, Incubate the Future.

Keerat Guliani, Deepkamal Gill, David Landsman, Nima Eshraghi, Krishna Kumar, and Lovedeep Gondara, “De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules,” arXiv:2604.02276, 2026. https://arxiv.org/abs/2604.02276 ↩︎

The real bottleneck is rule shape, not document access#

The pipeline works because errors are repaired in dependency order#

The main evidence says fine-grained rule decomposition is the hard part#

Cross-domain generalization is encouraging, but the decline pattern is the real signal#

The RAG result matters because the advantage grows with retrieval depth#

The ablations reveal the control knobs, not just footnote trivia#

What compliance teams can infer, and what they should not#

The boundary is legal authority, judge reliability, and production cost#

Machine-readable regulation begins before the chatbot#