Paper Tigers or Compliance Cops? What AIReg‑Bench Really Says About LLMs and the EU AI Act

Audit queues have a special talent for turning urgency into fog.

A product team wants to ship. Legal wants assurance. Governance wants evidence. The vendor has supplied a beautifully formatted technical document, full of dataset sizes, risk controls, model validation steps, and the usual confidence perfume. Somewhere inside that document may be a real compliance gap. Or it may simply be written by someone who knows how to sound compliant. Naturally, someone asks the modern executive question: can we let an LLM take the first pass?

AIReg-Bench is interesting because it does not answer that question with vibes. It builds a ruler.¹ The benchmark asks whether language models can read technical documentation excerpts for high-risk AI systems and score their likely compliance with selected EU AI Act requirements in roughly the same way legally trained human experts do.

That is already more useful than another “LLMs in law” demo where a chatbot summarises a regulation and everyone applauds politely, as if PDF digestion were governance. The sharper result is this: several frontier models align surprisingly well with median expert judgments, but the paper also shows exactly why that alignment is not the same thing as legal authority.

In other words, AIReg-Bench does not crown LLMs as compliance cops. It gives compliance teams a calibrated paper tiger detector. Quite useful. Still not a badge.

The headline result is strong, but not as simple as “AI beats lawyers”

The benchmark contains 120 synthetic but expert-vetted technical documentation excerpts. Each excerpt describes a fictional but plausible high-risk AI system and is tied to one of five EU AI Act articles: Article 9 on risk management, Article 10 on data and data governance, Article 12 on record keeping, Article 14 on human oversight, and Article 15 on accuracy, robustness, and cybersecurity.

Six legal experts participated in annotation. Each excerpt received three human compliance scores on a 1–5 Likert scale, where 1 means very low probability of compliance and 5 means very high probability. The models were then given the same system descriptions, documentation excerpts, relevant AI Act article text, and task instructions, and asked to produce the same kind of score and justification.

The best-performing frontier models did not merely produce plausible-sounding legal prose. They often matched the median human score closely.

Model	Weighted agreement, $κ_w$	Rank correlation, $ρ$	Bias, model − human	MAE	Exact / over / under
Gemini 2.5 Pro	0.863	0.856	-0.225	0.458	60.0% / 10.0% / 30.0%
GPT-5	0.849	0.838	-0.067	0.450	57.5% / 16.7% / 25.8%
Grok 4	0.829	0.829	+0.242	0.475	58.3% / 30.8% / 10.8%
GPT-4o	0.775	0.842	+0.458	0.558	52.5% / 42.5% / 5.0%
o3 mini	0.624	0.798	+0.742	0.775	44.2% / 54.2% / 1.7%

The first useful interpretation is ordinal: the better models are not just guessing labels. They tend to preserve the human ranking of “more compliant” versus “less compliant” cases. Gemini 2.5 Pro leads on weighted agreement and rank correlation, and its score is within one point of the median expert score for all but 7 of 120 excerpts. That is not a parlour trick. For queueing, escalation, and second-review prioritisation, it is operationally meaningful.

The second interpretation is less flattering and more important. Models have different compliance personalities. Some are conservative; some are dangerously generous. GPT-4o and o3 mini often score documentation higher than the median human expert. o3 mini over-scores in 54.2% of excerpts and under-scores in only 1.7%. That is not “helpful optimism.” In compliance, that is how a green light becomes a liability with better typography.

The paper therefore gives procurement teams a better question than “which model is smartest?” The better question is: which error profile can our governance process tolerate?

A conservative model may create more false alarms and reviewer fatigue. An over-generous model may let weak documentation glide through because it sounds competent. The first problem wastes time. The second problem creates audit exposure. One is annoying. The other arrives with lawyers.

The real benchmark is not against perfection; it is against expert disagreement

The most quietly useful part of AIReg-Bench is that the human benchmark is not presented as divine truth carved into marble. The legal experts themselves disagree.

The paper reports Krippendorff’s alpha of 0.651 across annotators: moderate agreement, but not strong enough to pretend there is a single obvious answer in every case. Two annotators show opposing biases, one more severe in the negative direction and one in the positive direction. Removing them would raise agreement to 0.786, but the authors keep all annotations to avoid post hoc cleansing of inconvenient humans. Very inconsiderate of them to preserve reality.

This matters because it changes what “model alignment” means. The target is not a hard legal fact like “Article 10 says X.” The target is a median expert judgment under uncertainty, based on documentation that may be incomplete, ambiguous, subtly non-compliant, or simply hard to interpret.

That is exactly what makes the benchmark business-relevant. Real compliance work is rarely binary. It is structured uncertainty management. A provider document may acknowledge a data gap but describe mitigation. Is that enough? A system may rely on deployer oversight but provide limited built-in controls. Is that a reasonable allocation of responsibility or a disguised risk transfer? The answer often depends on legal interpretation, technical context, standards maturity, and institutional risk appetite.

AIReg-Bench uses Likert scores because “compliant / non-compliant” would be too clean for the actual problem. A 3 is not a failed model output. It may be the honest shape of the case.

For organisations, this points to the right deployment pattern. Do not use LLMs as verdict engines. Use them as consistency instruments. They can help identify which documents deserve faster review, which ones show obvious gaps, and which ones are ambiguous enough to require senior counsel. The value is not replacing judgment. It is making judgment less randomly distributed across tired humans, inconsistent templates, and Friday-afternoon document dumps.

The synthetic documents are a feature, and also the boundary

A benchmark for EU AI Act compliance has an immediate data problem: real technical documentation for high-risk AI systems is not conveniently lying around in public repositories. It is confidential, legally sensitive, and often commercially inconvenient. A charming combination.

AIReg-Bench solves this by generating synthetic documentation through a staged LLM pipeline. The pipeline first creates high-level system overviews across eight use cases, including traffic safety, gas delivery, education, exam proctoring, job hiring, job termination, emergency dispatch, and credit scoring. It then creates compliance profiles for a selected AI Act article. Finally, it generates provider-style technical documentation excerpts that embody those profiles without simply announcing “this is compliant” or “this is broken.”

That design choice is not just data fabrication for convenience. It is an attempt to create controlled evaluation material where the distribution of use cases, articles, and compliance scenarios can be steered. The authors deliberately aim for subtle, realistic violations rather than cartoon failures. The generated excerpts are also reviewed for plausibility by legal experts. The median plausibility score is 4 out of 5, and 276 of the 360 plausibility annotations are either 4 or 5.

That said, synthetic plausibility is not real-world messiness. A generated technical document may be coherent, realistic, and difficult enough to test model reasoning, while still lacking the glorious ugliness of actual enterprise evidence: contradictory annexes, stale model cards, internal screenshots, missing lineage tables, vendor claims copy-pasted from sales decks, and version histories that look like someone lost a fight with SharePoint.

So the right interpretation is narrow but useful. AIReg-Bench tells us something about model performance on plausible provider-style excerpts under controlled conditions. It does not prove performance on complete conformity assessment packages, adversarial documentation, multi-document evidence rooms, or live regulator dialogue.

That distinction is not pedantry. It is deployment design.

The ablations say context beats attitude

One of the more practical tests in the paper is the GPT-4o ablation suite. This is not the main evidence; it is a sensitivity test. Its purpose is to ask whether performance depends on prompt framing and access to legal context.

The answer: yes, materially.

The baseline GPT-4o prompt tells the model to be calibrated, objective, rigorous, and fair. Removing the tone instruction slightly weakens performance. Replacing it with a harsher instruction reduces positive bias, but it also reduces weighted agreement and rank correlation while increasing mean absolute error. Withholding the relevant AI Act article text causes a more severe drop: weighted agreement falls to 0.654 and MAE rises to 0.717.

This is the operational lesson hiding in plain sight. You cannot fix compliance review by telling the model to “be strict.” Strictness is not legal reasoning. It is a mood.

Good compliance assessment requires the model to anchor its score to the actual governing provision, the system description, and concrete evidence in the document. Without the article text, the model falls back toward generic AI governance instincts. Those instincts may sound responsible, but they are not the same as evaluating Article 9, 10, 12, 14, or 15.

For a company building internal AI governance tooling, the design implication is clear:

System design choice	What AIReg-Bench suggests	Business consequence
Include the exact legal provision in context	Performance drops when article text is withheld	Compliance templates should inject the relevant rule, not rely on model memory
Use “harsh” prompting to reduce generosity	Bias improves, but agreement and error worsen	Calibration cannot be solved by tone theatre
Require explanations alongside scores	Both humans and models produce justifications	Reviewers need inspectable reasoning, not just numbers
Measure over-scoring and under-scoring separately	Models differ sharply in bias direction	Procurement should evaluate risk posture, not only leaderboard rank

The boring version of this finding is “prompting matters.” The useful version is: legal context matters more than legal attitude.

Legal fine-tuning is not a magic robe

The paper also evaluates two Saul legal language models, Saul-7B-Instruct and Saul-54B-Instruct. This is a comparison with a specialised model class, not the main benchmark result.

The results are sobering. Saul-7B-Instruct performs poorly, with weighted agreement of 0.183 and MAE of 1.167. Saul-54B-Instruct improves substantially, reaching weighted agreement of 0.596 and rank correlation of 0.813, but still falls below the weakest general-purpose frontier model in the main evaluation on weighted agreement.

The lesson is not that legal fine-tuning is useless. The larger Saul model improves dramatically over the smaller one, and legal specialisation may matter more in future systems with better scale, retrieval, and task-specific tuning. But AIReg-Bench punctures the lazy procurement assumption that “legal model” automatically means “better legal judgment.”

A model badge is not an assessment method. Fine-tuning on legal material may help, but the compliance task here requires reading technical documentation, mapping facts to a specific statutory obligation, calibrating probability, and producing a structured justification. That is a composite capability. It is not solved by sprinkling legal text over a small model and calling it counsel.

The alternative annotator test is promising, not permission

The appendix includes an alternative annotator test based on whether a model could substitute for a human annotator while preserving overall labels. Treat this as a robustness or exploratory extension, not a second thesis.

The results are striking. Gemini 2.5 Pro has a winning rate of 1.0000 and an average advantage probability of 0.9111. GPT-5 and Grok 4 each reach a winning rate of 0.6667. Most other models score 0.0000 on winning rate, despite some having respectable average advantage probabilities.

This supports the broader finding that top models can behave like useful annotators under benchmark conditions. It does not imply that a company should replace legal reviewers with Gemini and a dashboard. The test operates over a discrete label setup, with a fixed dataset, a subset of items, and comparisons against remaining annotators. It asks whether labels stay close in a benchmark procedure. It does not test accountability, professional responsibility, adversarial manipulation, document completeness, or the ability to ask follow-up questions.

That last point is large. Real compliance assessments are not single-turn scoring exercises. They are investigations. Reviewers ask for missing logs, clarify intended use, challenge data representativeness, inspect fallback controls, and compare provider claims against engineering reality. AIReg-Bench intentionally narrows the task so it can be measured. Measurement requires simplification. Deployment punishes forgetting that simplification happened.

What the failure patterns teach compliance teams

Appendix J aggregates expert annotation themes and uses Gemini 2.5 Pro to identify recurring shortcomings by article. This is exploratory, but practically useful because it translates abstract compliance into common document failure modes.

For Article 9, weak documents often fail to show continuous lifecycle risk management and shift monitoring responsibility to users. For Article 10, they fail to mitigate known data issues such as representativeness gaps and bias, or they justify those gaps weakly. For Article 12, they omit intermediate decision logging, often citing performance or privacy concerns. For Article 14, they lack built-in human oversight controls and shift responsibility to deployers. For Article 15, they under-describe continuous robustness, cybersecurity, fallback, and monitoring measures.

Those patterns are exactly where business review should focus. The useful question is not “does the document sound mature?” It is: does the document show that the provider has operationalised the obligation?

A mature Article 10 discussion should not merely admit a data gap; it should show mitigation, evidence, and residual risk handling. A credible Article 14 section should not simply say “humans remain in control”; it should specify intervention points, override mechanisms, risk information, and escalation procedures. Article 12 logging is not a decorative audit trail. It is the difference between traceability and vibes with timestamps.

This is where LLMs can be valuable as reviewer assistants. They can be trained or prompted to look for recurring evidence patterns: missing mitigations, responsibility shifts, unsupported claims, untested fallbacks, and thin monitoring plans. That is narrower than “legal compliance AI.” It is also far more deployable.

The business case is triage, procurement, and governance memory

The paper directly shows that some frontier models can approximate median expert scoring on AIReg-Bench. Cognaptus would infer three practical uses from that result.

First, triage. A compliance team can use a high-performing, calibrated model to sort incoming documentation into likely low-risk, likely problematic, and ambiguous queues. Humans still decide. The model reduces review latency and helps scarce experts spend time where marginal judgment matters.

Second, procurement. AIReg-Bench provides a template for comparing models on a task that actually resembles compliance work. Weighted agreement, rank correlation, bias, MAE, exact match rate, over-scoring, and under-scoring are all more useful than a generic “legal reasoning” score. The Pareto frontier analysis in the paper also matters: model choice is not only quality; it is quality per unit cost.

Third, governance memory. Because the benchmark includes scores and textual justifications, it points toward a practical internal pattern: store model assessments, human overrides, rationale deltas, and final review outcomes. Over time, that creates a house calibration set. The organisation can measure whether its AI reviewer is drifting, over-approving, over-escalating, or failing on particular obligations.

That last point matters more than many executives realise. The first deployment question is not “can the model review documents?” It is “can we prove how the model behaved, when it was wrong, and how our humans corrected it?” A compliance assistant without audit memory is just a confident intern with API access. Adorable, until discovery.

A disciplined deployment pattern

A reasonable enterprise workflow would look like this:

Plausibility gate. Before compliance scoring, assess whether the document is coherent, complete enough, technically credible, and internally consistent.
Article-specific scoring. Inject the exact relevant legal provision and require a 1–5 probability-of-compliance score with evidence-linked rationale.
Bias-aware model selection. Prefer models whose over-scoring risk is acceptable for the organisation’s risk appetite; do not optimise only for average agreement.
Human escalation bands. Send low scores, mid-range ambiguity, thin rationales, and model disagreements to expert review.
House calibration. Compare model scores against internal legal reviewers on a recurring sample of the company’s own documentation.
Versioned governance. Re-run calibration when model versions, legal standards, templates, or product categories change.

This is not glamorous. That is why it might work.

The governance value comes from turning compliance AI into a measured control process. Inputs are logged. Prompts are standardised. Article text is versioned. Scores are compared against humans. Overrides are recorded. Thresholds are adjusted. Nobody declares a document compliant because a model emitted a tidy paragraph with “therefore” in it.

The boundary conditions are not footnotes; they are the product requirements

AIReg-Bench is scoped deliberately, and those boundaries should become product requirements for anyone using it as inspiration.

The documentation is synthetic, although expert-vetted for plausibility. The task is single-turn, although real assessments are multi-turn. The input is technical documentation excerpts, although real conformity assessment may involve source code, logs, staff interviews, testing evidence, risk registers, and post-market monitoring plans. The models are restricted to supplied materials, while human annotators may consult external sources. The legal target is also time-bound: the authors frame the benchmark as a snapshot of AI Act conformity assessment as of September 2025, before later standards, guidance, amendments, or court interpretations can fully settle the terrain.

None of these limitations makes the benchmark weak. They make it honest.

The mistake would be to treat a narrow benchmark as a broad deployment licence. AIReg-Bench gives us evidence that LLMs can approximate expert compliance judgments under controlled conditions. It does not show that they can assume professional responsibility, resist adversarial documentation, resolve unsettled legal interpretation, or conduct a complete conformity assessment.

Executives should like the paper precisely because it refuses to hand them a magic wand. Magic wands are difficult to audit.

The verdict: useful ruler, not rubber stamp

AIReg-Bench moves AI compliance tooling from anecdote to measurement. That is the contribution. The dataset gives researchers a way to compare models on EU AI Act compliance scoring. The generation pipeline offers a reusable method for creating plausible, controlled compliance samples where real documentation is scarce. The model evaluation shows that frontier systems can align closely with expert judgments, while the ablations and appendices show why the alignment depends on context, calibration, and scope.

For business users, the conclusion is neither “LLMs are useless for compliance” nor “LLMs can replace lawyers.” Both positions are comfortably lazy.

The better conclusion is more operational: LLMs can become useful compliance triage infrastructure when they are constrained by article-specific context, measured against human reviewers, monitored for bias, and kept away from final sign-off authority. They can help teams find weak documentation earlier, route ambiguous cases faster, and build a memory of review decisions across products and vendors.

That is not a compliance cop. It is a very fast junior reviewer with a scoreboard, a tendency profile, and hopefully an adult in the room.

Which, frankly, is already progress.

Cognaptus: Automate the Present, Incubate the Future.

Bill Marino, Rosco Hunter, Christoph Schnabl, Zubair Jamali, Marinos Emmanouil Kalpakos, Mudra Kashyap, Isaiah Hinton, Alexa Hanson, Maahum Nazir, Felix Steffek, Hongkai Wen, and Nicholas D. Lane, “AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance,” arXiv:2510.01474, 2025, https://arxiv.org/abs/2510.01474. ↩︎

The headline result is strong, but not as simple as “AI beats lawyers”#

The real benchmark is not against perfection; it is against expert disagreement#

The synthetic documents are a feature, and also the boundary#

The ablations say context beats attitude#

Legal fine-tuning is not a magic robe#

The alternative annotator test is promising, not permission#

What the failure patterns teach compliance teams#

The business case is triage, procurement, and governance memory#

A disciplined deployment pattern#

The boundary conditions are not footnotes; they are the product requirements#

The verdict: useful ruler, not rubber stamp#