When AI Plays Lawmaker: Lessons from NomicLaw’s Multi-Agent Debates

TL;DR for operators

NomicLaw is best read as an audit harness, not as a prototype parliament for machines. The paper puts ten open-source LLMs into a simplified lawmaking game: propose a rule, justify it, vote on one proposal, accumulate points, repeat. That mechanism turns vague questions about “AI deliberation” into measurable traces: self-voting, reciprocity, coalition switching, vote volatility, first-mover effects, winner mentions, and shifts in legal-rhetorical framing.¹

The operational result is useful because it cuts against a common demo illusion. A room full of AI agents can appear deliberative while merely echoing one model’s favourite rhetoric. In NomicLaw, heterogeneous groups self-voted less, switched votes more often, and used a wider mix of jurisprudential themes than homogeneous same-model groups. That is not proof of wisdom. It is proof that model diversity can make synthetic agreement harder to hide. Progress, admittedly, but let us not confuse “less mirror-like” with “legally sound.”

The strongest business use is pre-deployment stress testing for legaltech, regtech, policy drafting, compliance review, and governance tooling. A firm experimenting with multi-model drafting assistants could use NomicLaw-style metrics to ask: Are the models endorsing themselves? Are they anchoring on the first proposal? Are they changing reasons when they change votes? Are they converging on justice-sounding language while ignoring accountability, transparency, or harm? These are audit questions, not procurement slogans.

The boundary is equally important. The paper does not show that LLMs understand law, generate legally valid rules, or can replace human legal judgment. The experiments use simplified incentives, four AI-governance vignettes, a fixed open-source model pool, limited runs, automated thematic coding with partial human validation, and no expert assessment of legal quality. NomicLaw is a lens. It is not a legislature, thank heavens.

The real product is the loop, not the legal prose

Legal AI demos usually show the output: a policy clause, a risk memo, a draft rule, a neat paragraph that sounds as though it has survived three committees and a judicial review. NomicLaw asks a more useful question: what happens before the paragraph wins?

The framework is built around a deliberately simple loop. Each game starts with a legal vignette, drawn from AI-governance dilemmas such as an AI-created symphony, patterned discrimination, a self-driving collision, and social graph scanning. In each round, every agent proposes a new legal rule, justifies the rule, votes for exactly one proposal, and explains the vote. Self-voting is allowed. Winning proposals receive 10 points; tied or undecided outcomes receive 5 points. All agents can see prior proposals, votes, justifications, and cumulative scores. The game runs for five rounds per vignette.

That design matters because it creates a miniature incentive economy. The agents are not simply asked to answer a legal question. They are asked to persuade, endorse, defect, repeat, and adapt under visibility. The paper’s value begins there. Without the loop, “LLMs debate law” is just another prompt theatre. With the loop, the authors can observe whether agents back themselves, reward supporters, shift blocs, defer to winners, or reframe their arguments when voting.

The experiment compares two configurations. In the homogeneous setting, five agents use the same underlying LLM. Each of ten open-source models is tested once per vignette, producing 1,000 observations. In the heterogeneous setting, ten agents each use a different model, evaluated across six runs per vignette, for 24 runs and 1,200 agent-round observations. The model pool includes Phi4, Phi4-Reasoning, Phi4-Mini-Reasoning, Gemma3, Gemma2, Llama3, Llama2, Qwen3, Granite3.3, and DeepSeek-R1, orchestrated through Ollama under identical settings.

The paper is careful about inference. Homogeneous sessions are treated descriptively because same-model duplication does not provide the between-agent variation needed for formal hypothesis testing. Statistical tests are reserved for the heterogeneous condition. That distinction is not a minor methodological footnote. It stops the reader from treating every side-by-side chart as equally causal, which is a surprisingly useful habit in AI evaluation.

NomicLaw turns “trust” into observable behaviour

Trust is a dangerous word around machines. It tends to arrive wearing a lab coat and leave with someone approving an automated workflow they do not understand.

NomicLaw avoids some of that fog by translating trust-like behaviour into measurable interaction patterns. The metrics are not perfect proxies for human trust, but they are inspectable. That is already better than asking whether an LLM “seems collaborative”, a test that mostly measures the reader’s tolerance for polished prose.

Metric	What it measures in the game	What operators can learn	What it does not prove
Self-Vote Rate	How often an agent votes for its own proposal	Whether the system is producing self-serving or insular endorsement patterns	That low self-voting equals good legal judgment
Win Rate	How often a model’s proposal is selected	Which models persuade the group under the tested protocol	That the winning proposal is legally correct
Reciprocity Index	Whether agents return votes to prior supporters	Whether endorsement is becoming tit-for-tat	That reciprocity reflects genuine trust
Coalition Switch Rate	How often agents move in or out of winning blocs	Whether coalitions are fluid or locked-in	That fluidity improves rule quality
Vote Volatility	How often agents change vote targets between rounds	Whether deliberation disrupts earlier preferences	That changed votes are well reasoned
Edge Density	How dense the voting graph becomes	Whether the group is broadly cross-endorsing or clustering tightly	That dense agreement is reliable
Theme Consistency	Whether legal themes persist from proposal to vote	Whether agents reframe arguments across roles	That the themes are substantively valid

This is the mechanism-first reading of the paper: the numbers only make sense once the game is understood as an instrument. NomicLaw is not measuring legal wisdom directly. It is measuring the behavioural residue left by models trying to win a simplified lawmaking game.

That residue is where the interesting findings sit.

Heterogeneous groups behave less like mirrors

The most operationally relevant result is that model diversity changes the deliberation pattern.

In heterogeneous cohorts, Self-Vote Rates remain low, from 0.03 ± 0.07 for Gemma3 to 0.44 ± 0.00 for Llama3, with most models between 0.15 and 0.33. In homogeneous sessions, self-voting rises sharply for some models: Llama2 reaches 0.87 ± 0.30, Qwen3 0.74 ± 0.12, and Gemma2 0.55 ± 0.46.

The interpretation is not that heterogeneous panels become noble little democrats. It is that diversity disrupts reflexive self-endorsement. When every agent shares the same model backbone, the group has a stronger tendency to endorse familiar patterns. The committee becomes a mirror with name tags.

Coalition metrics tell a similar story, though with a useful twist. Heterogeneous groups show moderate reciprocity, with RI = 0.16 ± 0.06. Homogeneous groups jump to RI = 0.45 ± 0.14. At first glance, more reciprocity might sound better. In this game, it may also mean more tit-for-tat. Same-model agents are more likely to return support, creating a denser endorsement pattern that can look like trust but may simply be behavioural lock-in.

Vote volatility is higher under heterogeneity: agents revise their votes in 0.72 ± 0.24 of consecutive rounds, compared with 0.43 ± 0.28 under homogeneous pairing. Again, this is not automatically good. A group that changes its mind constantly may be adaptive, confused, or easily swayed. But for audit purposes, volatility is informative. It tells operators whether the system is actually responding to peer proposals or merely settling into a comfortable consensus after the first round.

Network density also separates the configurations. Heterogeneous edge density is low at 0.11 ± 0.00, while homogeneous edge density more than doubles to 0.25 ± 0.00. Dense voting in same-model groups may look collaborative, but the paper’s evidence suggests it can reflect intra-model endorsement rather than broad deliberative diversity.

The first-mover result completes the picture. In heterogeneous cohorts, first-mover win rate remains low at 0.12 ± 0.05. In homogeneous pairings it rises to 0.25 ± 0.08. Same-model groups are more vulnerable to anchoring on early proposals. Apparently, when five copies of a model hear a familiar-sounding first argument, they do not always become a legislature. Sometimes they become a very agreeable meeting.

Persuasion is uneven, and the hierarchy is measurable

The heterogeneous condition produces a clear win-rate hierarchy. Out of 120 voting rounds, DeepSeek-R1 wins 21 times, for a win rate of 0.175. Llama2 follows with 16 wins, or 0.133. Phi4-Reasoning wins 13 times; Granite3.3 and Phi4-Mini-Reasoning win 12 each. At the lower end, Qwen3 wins twice, while Gemma3 and Llama3 win once each. Thirty rounds are undecided.

The authors test whether wins are uniformly distributed and reject that null strongly: $\chi^2(9)=48$, $p=3 \times 10^{-7}$. Pairwise comparisons, adjusted with Benjamini–Hochberg, show DeepSeek-R1 significantly outperforming Gemma2, Gemma3, Llama3, Phi4, and Qwen3. Llama2 also outperforms Gemma2 and Qwen3, while both Phi4-Mini-Reasoning and Phi4-Reasoning significantly outperform Qwen3.

The logistic regression sharpens the point. Using DeepSeek-R1 as the reference, Gemma2, Gemma3, Llama3, Phi4, and Qwen3 have significantly lower odds of winning. Llama2 and Phi4-Reasoning do not differ significantly from DeepSeek-R1 in that model. A GEE robustness check adds vignette as a covariate and clusters on run; none of the vignette coefficients reach significance, and the estimated intra-cluster correlation is effectively zero. In plain English: within this simplified setting, model identity matters more than which of the four legal vignettes is being debated.

For business readers, the lesson is not “use DeepSeek-R1 for law”. That would be a procurement conclusion trying to escape from a sandbox. The better lesson is that multi-model legal workflows will not be neutral just because they contain multiple models. Some models may dominate the persuasive process. Others may rarely shape the final output. A system architect who averages outputs or lets agents vote may accidentally create a rhetorical leaderboard, then mistake the winner for truth.

This matters for governance workflows. In compliance review, policy drafting, or legal risk triage, the most persuasive model may become the system’s de facto senior partner. That seniority may be earned in the scoring protocol, not in the quality of legal reasoning. A grimly efficient way to manufacture authority, but manufacturing authority is still not the same as earning it.

The rhetoric shifts when models face unlike peers

The paper’s qualitative analysis asks what kinds of legal reasons the agents use. The authors classify proposed rules, proposal reasoning, and vote justifications into ten jurisprudential themes: justice, legality, accountability, transparency, consent, harm, rights, utility, responsibility, and solidarity. The coding is done with LLMs and checked against human annotations on a 10% sample of 220 observations. Agreement is strongest for voting justifications, with $\kappa \geq 0.82$; rule labelling is around $\kappa \approx 0.74$ on average; proposal reasoning is more ambiguous, with $\kappa = 0.71$ for Llama3 and $\kappa = 0.61$ for Gemma3.

This is a reasonable exploratory coding pipeline, not a divine jurisprudence detector. The human validation helps, but the coding still depends partly on model-based classification. Treat the themes as structured signals, not as final legal interpretation.

The aggregate pattern is still revealing. Across setups, justice and legality dominate. In heterogeneous groups, justice accounts for roughly 40–60% of themes across proposals, reasoning, and voting justifications, while legality contributes another 15–25%. Homogeneous runs amplify the pattern: justice alone often exceeds 70% in proposal stages, with legality at 20–30%.

That is exactly the kind of result legal AI teams should care about. “Fairness” and “rule of law” are comfortable rhetorical furniture. Models can sit there for hours. The question is whether they move when the facts demand it.

Heterogeneous groups do move more. In the self-driving collision vignette, harm rises to 30–40% of reasoning themes. In patterned discrimination, harm falls below 10%. Accountability appears most in social graph scanning, where traceability and responsibility concerns drive nearly 20% of justifications. Homogeneous groups dampen those context effects: harm rarely exceeds 15%, and accountability stays under 10% across vignettes.

Consent and solidarity are rarer but informative. In heterogeneous runs, consent appears in 10–15% of proposal reasoning, peaking in personal-choice vignettes, while solidarity stays below 5%. Under homogeneity, both nearly vanish. Utility and transparency are rare across conditions, generally below 5%, though transparency reaches up to 8% during voting in heterogeneous settings and utility reaches 7% in AI-created symphony proposals.

The proposal-to-vote theme results make the mechanism even clearer. Under heterogeneity, agents frequently change their normative frame when moving from proposing to voting. The paper reports theme-change rates of 78–99% across models. In homogeneous groups, theme-change rates drop to 11.2–17.2%. In other words, diverse groups force reframing. Uniform groups preserve stance.

This does not prove that heterogeneous panels reason better. It does suggest they expose more rhetorical surface area. For operators, that is valuable. A legal AI workflow that never leaves fairness-and-legality language may sound principled while missing harm, consent, accountability, or transparency. The model may be wearing a judge’s robe over a slogan generator. Stylish, perhaps. Not sufficient.

The appendix-style analyses are triangulation, not a second thesis

The paper also reports PCA and Ward hierarchical clustering over standardised voting-behaviour metrics. These analyses group models into three strategic families: “Collaborative Builders” such as DeepSeek-R1 and Llama2, with high reciprocity, coalition switching, and top win rates; “Competitive Soloists” such as Gemma2, Gemma3, and Llama3, with heavy self-voting, unstable alliances, and low wins; and “Stable Consistentists” such as Phi4 variants, Qwen3, and Granite3.3, clustering near the centre with cautious minority-position strategies.

The likely purpose of these analyses is exploratory triangulation. They help show that the behavioural metrics cohere into recognisable patterns. They do not replace the main evidence on self-voting, volatility, win rates, and thematic diversity. Nor do they prove stable model personalities. A “strategic family” in this paper means a pattern under one protocol, one incentive scheme, one model pool, and four vignettes. It should not be promoted into a universal personality test for LLMs. The industry already has enough pseudo-psychology, and most of it comes with dashboards.

Used carefully, however, the clustering idea is practical. A company deploying multiple models in legal or compliance workflows could classify agents by observed behaviour: dominant persuaders, self-endorsers, stable minority voices, volatile reframers, or low-engagement participants. That classification could inform orchestration. For example, a system might require a high-performing persuader’s proposal to be challenged by a model with a different thematic profile, or require human review whenever all agents converge too quickly around one rhetorical theme.

That is the business move: not to rank models once and freeze the workflow, but to detect unhealthy deliberation patterns as they emerge.

What the paper directly shows, and what business should infer

NomicLaw’s practical value comes from keeping three layers separate: the paper’s evidence, Cognaptus’s operational inference, and the remaining uncertainty.

Layer	What can be said responsibly
Direct paper result	Heterogeneous LLM groups in NomicLaw self-vote less, switch votes more, show lower first-mover advantage, and use a wider range of jurisprudential themes than homogeneous same-model groups.
Direct paper result	Win rates are uneven in the heterogeneous condition, with DeepSeek-R1 and Llama2 leading under the tested protocol and several models rarely winning.
Direct paper result	Homogeneous groups show more self-support, denser voting links, lower vote volatility, and narrower rhetorical patterns centred heavily on justice and legality.
Cognaptus inference	Multi-model legal AI systems should be audited for synthetic consensus, anchoring, self-endorsement, model dominance, and thematic blind spots before being used in serious drafting or review workflows.
Cognaptus inference	Diversity among models is useful only if it changes observable deliberation; a nominal “panel” of agents can still behave like one overconfident voice in surround sound.
Remaining uncertainty	The study does not assess whether winning proposals are legally valid, practically implementable, or preferred by legal experts.
Remaining uncertainty	The results may change with other models, incentive schemes, legal domains, jurisdictions, amendment procedures, human participants, or stronger semantic evaluation.

For a legaltech or regtech operator, the immediate use case is not to automate lawmaking. It is to build a diagnostic layer around AI-assisted drafting.

A NomicLaw-inspired internal evaluation could ask five questions before any AI-drafted policy leaves the sandbox:

Does the system converge because multiple models independently support an argument, or because the same reasoning pattern is being echoed?
Does a specific model dominate outcomes across scenarios, and is that dominance justified by expert review?
Do agents change their votes after seeing better arguments, or merely reward prior supporters?
Are the legal themes responsive to the facts, or do they default to safe-sounding fairness language?
Does early framing anchor the entire debate?

These questions turn “multi-agent deliberation” from a slideware phrase into an auditable process. Not glamorous, but governance rarely is.

The boundary: this is a simulator, not a bench trial

The paper’s limitations are not decorative. They materially affect how the findings should be used.

First, NomicLaw simplifies lawmaking. There are no amendment cycles, appeals, institutional constraints, procedural rules, public comments, lobby pressures, judicial review, or jurisdiction-specific doctrine. Real legal systems are not five rounds of proposal and voting, though one understands the temptation to simplify them.

Second, the incentive structure is artificial. Agents are rewarded for winning or tying, not for legal validity, enforceability, democratic legitimacy, or downstream welfare. That makes the framework excellent for studying persuasion and coalition behaviour. It makes it weak as a direct measure of rule quality.

Third, statistical inference is strongest only in the heterogeneous condition. Homogeneous runs are descriptive because same-model duplication limits independent variation. The heterogeneous setting has six runs per vignette, which is useful but still limited.

Fourth, the thematic analysis is partly automated. Human validation on 10% of observations improves credibility, especially with strong agreement in voting justifications, but proposal reasoning is more nuanced. Automated labels may miss subtle legal strategies or inherit model-specific biases.

Fifth, the study does not yet determine whether proposals are substantively distinct or merely rephrased. That matters. A system may appear diverse because ten models use different rhetorical wrappers around the same policy idea. Semantic clustering and expert review would be needed to separate genuine substantive diversity from paraphrase theatre.

Finally, the paper does not show legal understanding. The authors explicitly frame NomicLaw as a research framework and audit lens. High win rates, coalitions, and persuasive justifications should not be mistaken for legal reasoning in the human sense. They are behavioural signals from statistical systems under a simplified protocol.

That boundary is not a weakness if the tool is used correctly. In fact, it is the point. The safest use of NomicLaw is to reveal when legal AI systems are persuasive, insular, volatile, anchored, or rhetorically narrow before those traits reach a client, regulator, court, or board memo.

The useful lesson is not “AI lawmakers”; it is “audit the room”

The tempting headline is that LLMs can play lawmaker. The better headline is that LLMs can make agreement look more meaningful than it is.

NomicLaw gives researchers and operators a way to inspect the room: who proposes, who votes, who follows, who reframes, who wins, and who merely sounds principled while avoiding harder themes. Its findings suggest that model heterogeneity can reduce self-support and broaden debate, but also that persuasive dominance remains uneven and may be driven by the protocol rather than by legal merit.

For business, this points to a sober design principle. Multi-agent legal AI should not be trusted because several models agree. It should be trusted, if at all, because its agreement has survived structured challenge, theme-level audit, expert review, and tests for anchoring, self-endorsement, and rhetorical monoculture.

A committee of machines is still made of machines. NomicLaw’s contribution is to make that committee leave fingerprints.

Cognaptus: Automate the Present, Incubate the Future.

Asutosh Hota and Jussi P.P. Jokinen, “NomicLaw: Emergent Trust and Strategic Argumentation in LLMs During Collaborative Law-Making,” arXiv:2508.05344, 2025, https://arxiv.org/html/2508.05344. ↩︎

TL;DR for operators#

The real product is the loop, not the legal prose#

NomicLaw turns “trust” into observable behaviour#

Heterogeneous groups behave less like mirrors#

Persuasion is uneven, and the hierarchy is measurable#

The rhetoric shifts when models face unlike peers#

The appendix-style analyses are triangulation, not a second thesis#

What the paper directly shows, and what business should infer#

The boundary: this is a simulator, not a bench trial#

The useful lesson is not “AI lawmakers”; it is “audit the room”#