Opening — Why this matters now
There is a quiet bottleneck in modern mathematics—and by extension, in any field that depends on complex reasoning: verification doesn’t scale.
The academic system still runs on trust. Referees don’t fully check proofs; they check whether proofs feel correct. That works—until it doesn’t. As the paper notes, even widely accepted results can contain errors simply because no one has the time (or incentive) to verify them line by line. fileciteturn0file0
Now insert AI into that equation.
Not as a theorem prover. Not as a co-pilot.
But as a distributed workforce of 30,000 agents, collectively formalizing an entire graduate-level textbook—in one week.
That’s not incremental progress. That’s a structural shift.
Background — From proving theorems to engineering proofs
Formal mathematics has always had an elegant promise: reduce trust to code.
Instead of believing a proof, you verify it through a proof assistant like Lean. The catch? It’s painfully expensive. Decades of effort have only produced partial coverage of mathematical knowledge (e.g., mathlib’s ~2.2 million lines). fileciteturn0file0
Historically, progress followed three phases:
| Generation | Approach | Limitation |
|---|---|---|
| 1. Specialized models | Solve isolated problems | No scalability to real repositories |
| 2. Single-agent coding | Use LLMs as theorem provers | Bottlenecked by sequential work |
| 3. Multi-agent systems | Parallel agents on shared codebase | Coordination chaos |
The paper focuses on solving the last problem: coordination.
Because scaling intelligence is easy. Scaling collaboration is not.
Analysis — The system that actually works
Instead of inventing a novel AI architecture, the authors did something almost disappointingly pragmatic:
They copied software engineering.
The core insight
Treat formalization as a software project, not a reasoning problem.
This leads to a surprisingly simple architecture:
| Component | Role |
|---|---|
| Sketcher agents | Convert textbook into definitions + theorem statements |
| Prover agents | Fill in proofs |
| Reviewer agents | Enforce correctness and code quality |
| Maintainer agents | Refactor and resolve structural issues |
| Triage / Scan / Progress agents | Manage coordination and direction |
And the infrastructure?
- Git-based version control (with branches and PRs)
- Merge queues
- Issue tracker (as a filesystem)
- Continuous integration (build must pass)
In other words: GitHub, but populated entirely by AI workers.
Why this matters
Most “agentic AI” discussions obsess over reasoning chains.
This system proves something more uncomfortable:
You don’t need better reasoning. You need better organization of mediocre reasoning at scale.
Findings — What happens when you scale agents
1. The headline result
| Metric | Result |
|---|---|
| Textbook size | ~500 pages |
| Time | 1 week |
| Agents | ~30,000 |
| Code generated | ~130,000 lines |
| Theorems formalized | 340 / 340 |
| Cost | ~$100K |
All target theorems were successfully formalized and verified. fileciteturn0file0
Not approximated. Not summarized.
Proved.
2. The real story: efficiency vs chaos
The system wasn’t clean. It was… industrial.
From the data (Table 2 and 3 in the paper):
| Outcome Type | Observation |
|---|---|
| Successful merges | Thousands of small PRs (~100 lines each) |
| Aborted agents | Massive early inefficiency due to system issues |
| Blocked agents | Dependency issues created bottlenecks |
| Reviewer loops | Iterative refinement was essential |
A striking insight:
Nearly half of compute was wasted early on due to orchestration inefficiencies.
This isn’t an AI problem.
It’s an operations problem.
3. The cost curve is already competitive
The system cost roughly:
- ~$200 per page
- ~$300 per theorem
That’s comparable—or cheaper—than hiring expert mathematicians.
And the authors estimate 3–10x efficiency improvements are achievable without better models. fileciteturn0file0
Translation: we’re still early in the cost curve.
4. The biggest failure mode: coherence
Scaling agents introduces a subtle but critical issue: semantic fragmentation.
Examples from the paper:
- The same concept defined multiple times across different modules
- Slightly incorrect definitions still producing valid local proofs
- Agents solving the wrong problems (“rabbit holes”)
This leads to a paradox:
The system can be locally correct and globally inconsistent.
Which is… uncomfortably similar to large organizations.
Implications — This isn’t about math
It’s tempting to view this as a niche result in formal mathematics.
It isn’t.
This is a blueprint for AI-native production systems.
1. Knowledge work becomes orchestration
The bottleneck is no longer intelligence.
It’s:
- Task decomposition
- Dependency tracking
- Coordination protocols
In short: management.
Except your workforce is non-human.
2. Verification becomes programmable
If proofs can be verified this way, so can:
- Financial models
- Legal contracts
- Regulatory compliance
- Engineering systems
The implication is profound:
Trust shifts from institutions to verification pipelines.
3. Infrastructure > models
The paper quietly makes a strategic claim:
The limiting factor is not model intelligence, but system orchestration.
This aligns with what we’re already seeing in AI markets:
- Models commoditize
- Workflows differentiate
The winners won’t be those with the smartest models.
They’ll be the ones with the best agent architectures.
4. The compounding effect: data → models → data
Formalized mathematics creates:
- High-quality training data
- Better reasoning models
- More formalization
A feedback loop.
Which suggests something slightly unsettling:
AI systems may eventually become the primary producers and validators of knowledge.
Conclusion — From mathematicians to machine-managed knowledge
The paper ends with a deceptively simple statement:
The bottleneck is no longer reasoning—it’s coordination. fileciteturn0file0
That sentence should make you pause.
Because it reframes the entire AI landscape.
We’re not moving toward smarter individual systems.
We’re moving toward structured collectives of moderately capable systems that, when orchestrated correctly, outperform human teams in both speed and cost.
Mathematics is just the test case.
The real target is everything else.
Cognaptus: Automate the Present, Incubate the Future.