When AI Plays Lawmaker: Lessons from NomicLaw’s Multi-Agent Debates

Large Language Models are increasingly touted as decision-making aides in policy and governance. But what happens when we let them loose together in a legislative sandbox? NomicLaw — an open-source multi-agent simulation inspired by the self-amending game Nomic — offers a glimpse into how AI agents argue, form alliances, and shape collective rules without human scripts.

The Experiment

NomicLaw pits LLM agents against legally charged vignettes — from self-driving car collisions to algorithmic discrimination — in a propose → justify → vote loop. Each agent crafts a legal rule, defends it, and votes on a peer’s proposal. Scoring is simple: 10 points for a win, 5 for a tie. Two configurations were tested:

Homogeneous: All agents are the same model.
Heterogeneous: Each agent uses a different LLM.

Ten open-source models participated, including DeepSeek-R1, Llama2, Phi4 variants, Gemma, Qwen3, and Granite3.3.

What Emerged

1. Diversity disrupts echo chambers.

Heterogeneous groups had lower self-voting, more coalition changes, and richer thematic variety.
Homogeneous groups converged on justice/rule-of-law arguments, often ignoring scenario-specific nuances.

2. A clear performance hierarchy.

In mixed groups, DeepSeek-R1 and Llama2 dominated win rates.
Low performers (Gemma3, Llama3, Qwen3) rarely succeeded unless shielded by model uniformity.

3. Argument styles adapt to context — when models differ.

Diverse groups invoked Harm and Accountability more in risk-heavy cases.
Uniform groups stuck to procedural fairness, even when context called for other angles.

4. Strategic archetypes emerged.

Collaborative Builders: High reciprocity and wins (DeepSeek-R1, Llama2).
Competitive Soloists: Heavy self-voting, low wins (Gemma2/3, Llama3).
Stable Consistentists: Cautious, minority positions (Phi4 variants, Qwen3, Granite3.3).

5. First-mover advantage is real — but only in uniform groups.

Homogeneous sessions gave early proposals a 25% win rate vs. 12% in mixed.

Why It Matters for AI Governance

If LLMs are ever to assist in policy drafting, who we put in the room matters. Model diversity:

Reduces groupthink.
Increases adaptability to case-specific risks.
Surfaces more balanced jurisprudential perspectives.

At the same time, high win rates don’t imply sound legal reasoning — LLMs still rely on statistical mimicry, not true legal understanding. This makes audit metrics (like those in NomicLaw) vital for spotting shallow consensus and ensuring AI remains an assistant, not an arbiter.

The Business Takeaway

For organisations exploring AI-mediated decision tools — whether in compliance, governance, or internal policy-making — NomicLaw’s results suggest:

Mix your models to broaden the debate and limit bias.
Measure interaction patterns to detect strategic dominance or alliance lock-in.
Keep humans in the loop to interpret and validate outcomes.

The future of AI in governance isn’t about replacing lawmakers. It’s about designing systems where diverse AI perspectives help humans deliberate more effectively — and knowing when those systems fall into the traps of their own making.

Cognaptus: Automate the Present, Incubate the Future