Proofs at Scale: When 30,000 Agents Replace the Referee

Opening — Why this matters now

There is a quiet bottleneck in modern mathematics—and by extension, in any field that depends on complex reasoning: verification doesn’t scale.

The academic system still runs on trust. Referees don’t fully check proofs; they check whether proofs feel correct. That works—until it doesn’t. As the paper notes, even widely accepted results can contain errors simply because no one has the time (or incentive) to verify them line by line. fileciteturn0file0

Now insert AI into that equation.

Not as a theorem prover. Not as a co-pilot.

But as a distributed workforce of 30,000 agents, collectively formalizing an entire graduate-level textbook—in one week.

That’s not incremental progress. That’s a structural shift.

Background — From proving theorems to engineering proofs

Formal mathematics has always had an elegant promise: reduce trust to code.

Instead of believing a proof, you verify it through a proof assistant like Lean. The catch? It’s painfully expensive. Decades of effort have only produced partial coverage of mathematical knowledge (e.g., mathlib’s ~2.2 million lines). fileciteturn0file0

Historically, progress followed three phases:

Generation	Approach	Limitation
1. Specialized models	Solve isolated problems	No scalability to real repositories
2. Single-agent coding	Use LLMs as theorem provers	Bottlenecked by sequential work
3. Multi-agent systems	Parallel agents on shared codebase	Coordination chaos

The paper focuses on solving the last problem: coordination.

Because scaling intelligence is easy. Scaling collaboration is not.

Analysis — The system that actually works

Instead of inventing a novel AI architecture, the authors did something almost disappointingly pragmatic:

They copied software engineering.

The core insight

Treat formalization as a software project, not a reasoning problem.

This leads to a surprisingly simple architecture:

Component	Role
Sketcher agents	Convert textbook into definitions + theorem statements
Prover agents	Fill in proofs
Reviewer agents	Enforce correctness and code quality
Maintainer agents	Refactor and resolve structural issues
Triage / Scan / Progress agents	Manage coordination and direction

And the infrastructure?

Git-based version control (with branches and PRs)
Merge queues
Issue tracker (as a filesystem)
Continuous integration (build must pass)

In other words: GitHub, but populated entirely by AI workers.

Why this matters

Most “agentic AI” discussions obsess over reasoning chains.

This system proves something more uncomfortable:

You don’t need better reasoning. You need better organization of mediocre reasoning at scale.

Findings — What happens when you scale agents

1. The headline result

Metric	Result
Textbook size	~500 pages
Time	1 week
Agents	~30,000
Code generated	~130,000 lines
Theorems formalized	340 / 340
Cost	~$100K

All target theorems were successfully formalized and verified. fileciteturn0file0

Not approximated. Not summarized.

Proved.

2. The real story: efficiency vs chaos

The system wasn’t clean. It was… industrial.

From the data (Table 2 and 3 in the paper):

Outcome Type	Observation
Successful merges	Thousands of small PRs (~100 lines each)
Aborted agents	Massive early inefficiency due to system issues
Blocked agents	Dependency issues created bottlenecks
Reviewer loops	Iterative refinement was essential

A striking insight:

Nearly half of compute was wasted early on due to orchestration inefficiencies.

This isn’t an AI problem.

It’s an operations problem.

3. The cost curve is already competitive

The system cost roughly:

~$200 per page
~$300 per theorem

That’s comparable—or cheaper—than hiring expert mathematicians.

And the authors estimate 3–10x efficiency improvements are achievable without better models. fileciteturn0file0

Translation: we’re still early in the cost curve.

4. The biggest failure mode: coherence

Scaling agents introduces a subtle but critical issue: semantic fragmentation.

Examples from the paper:

The same concept defined multiple times across different modules
Slightly incorrect definitions still producing valid local proofs
Agents solving the wrong problems (“rabbit holes”)

This leads to a paradox:

The system can be locally correct and globally inconsistent.

Which is… uncomfortably similar to large organizations.

Implications — This isn’t about math

It’s tempting to view this as a niche result in formal mathematics.

It isn’t.

This is a blueprint for AI-native production systems.

1. Knowledge work becomes orchestration

The bottleneck is no longer intelligence.

It’s:

Task decomposition
Dependency tracking
Coordination protocols

In short: management.

Except your workforce is non-human.

2. Verification becomes programmable

If proofs can be verified this way, so can:

Financial models
Legal contracts
Regulatory compliance
Engineering systems

The implication is profound:

Trust shifts from institutions to verification pipelines.

3. Infrastructure > models

The paper quietly makes a strategic claim:

The limiting factor is not model intelligence, but system orchestration.

This aligns with what we’re already seeing in AI markets:

Models commoditize
Workflows differentiate

The winners won’t be those with the smartest models.

They’ll be the ones with the best agent architectures.

4. The compounding effect: data → models → data

Formalized mathematics creates:

High-quality training data
Better reasoning models
More formalization

A feedback loop.

Which suggests something slightly unsettling:

AI systems may eventually become the primary producers and validators of knowledge.

Conclusion — From mathematicians to machine-managed knowledge

The paper ends with a deceptively simple statement:

The bottleneck is no longer reasoning—it’s coordination. fileciteturn0file0

That sentence should make you pause.

Because it reframes the entire AI landscape.

We’re not moving toward smarter individual systems.

We’re moving toward structured collectives of moderately capable systems that, when orchestrated correctly, outperform human teams in both speed and cost.

Mathematics is just the test case.

The real target is everything else.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From proving theorems to engineering proofs#

Analysis — The system that actually works#

The core insight#

Why this matters#

Findings — What happens when you scale agents#

1. The headline result#

2. The real story: efficiency vs chaos#

3. The cost curve is already competitive#

4. The biggest failure mode: coherence#

Implications — This isn’t about math#

1. Knowledge work becomes orchestration#

2. Verification becomes programmable#

3. Infrastructure > models#

4. The compounding effect: data → models → data#

Conclusion — From mathematicians to machine-managed knowledge#