Opening — Why this matters now

There is a quiet bottleneck in modern mathematics—and by extension, in any field that depends on complex reasoning: verification doesn’t scale.

The academic system still runs on trust. Referees don’t fully check proofs; they check whether proofs feel correct. That works—until it doesn’t. As the paper notes, even widely accepted results can contain errors simply because no one has the time (or incentive) to verify them line by line. fileciteturn0file0

Now insert AI into that equation.

Not as a theorem prover. Not as a co-pilot.

But as a distributed workforce of 30,000 agents, collectively formalizing an entire graduate-level textbook—in one week.

That’s not incremental progress. That’s a structural shift.


Background — From proving theorems to engineering proofs

Formal mathematics has always had an elegant promise: reduce trust to code.

Instead of believing a proof, you verify it through a proof assistant like Lean. The catch? It’s painfully expensive. Decades of effort have only produced partial coverage of mathematical knowledge (e.g., mathlib’s ~2.2 million lines). fileciteturn0file0

Historically, progress followed three phases:

Generation Approach Limitation
1. Specialized models Solve isolated problems No scalability to real repositories
2. Single-agent coding Use LLMs as theorem provers Bottlenecked by sequential work
3. Multi-agent systems Parallel agents on shared codebase Coordination chaos

The paper focuses on solving the last problem: coordination.

Because scaling intelligence is easy. Scaling collaboration is not.


Analysis — The system that actually works

Instead of inventing a novel AI architecture, the authors did something almost disappointingly pragmatic:

They copied software engineering.

The core insight

Treat formalization as a software project, not a reasoning problem.

This leads to a surprisingly simple architecture:

Component Role
Sketcher agents Convert textbook into definitions + theorem statements
Prover agents Fill in proofs
Reviewer agents Enforce correctness and code quality
Maintainer agents Refactor and resolve structural issues
Triage / Scan / Progress agents Manage coordination and direction

And the infrastructure?

  • Git-based version control (with branches and PRs)
  • Merge queues
  • Issue tracker (as a filesystem)
  • Continuous integration (build must pass)

In other words: GitHub, but populated entirely by AI workers.

Why this matters

Most “agentic AI” discussions obsess over reasoning chains.

This system proves something more uncomfortable:

You don’t need better reasoning. You need better organization of mediocre reasoning at scale.


Findings — What happens when you scale agents

1. The headline result

Metric Result
Textbook size ~500 pages
Time 1 week
Agents ~30,000
Code generated ~130,000 lines
Theorems formalized 340 / 340
Cost ~$100K

All target theorems were successfully formalized and verified. fileciteturn0file0

Not approximated. Not summarized.

Proved.


2. The real story: efficiency vs chaos

The system wasn’t clean. It was… industrial.

From the data (Table 2 and 3 in the paper):

Outcome Type Observation
Successful merges Thousands of small PRs (~100 lines each)
Aborted agents Massive early inefficiency due to system issues
Blocked agents Dependency issues created bottlenecks
Reviewer loops Iterative refinement was essential

A striking insight:

Nearly half of compute was wasted early on due to orchestration inefficiencies.

This isn’t an AI problem.

It’s an operations problem.


3. The cost curve is already competitive

The system cost roughly:

  • ~$200 per page
  • ~$300 per theorem

That’s comparable—or cheaper—than hiring expert mathematicians.

And the authors estimate 3–10x efficiency improvements are achievable without better models. fileciteturn0file0

Translation: we’re still early in the cost curve.


4. The biggest failure mode: coherence

Scaling agents introduces a subtle but critical issue: semantic fragmentation.

Examples from the paper:

  • The same concept defined multiple times across different modules
  • Slightly incorrect definitions still producing valid local proofs
  • Agents solving the wrong problems (“rabbit holes”)

This leads to a paradox:

The system can be locally correct and globally inconsistent.

Which is… uncomfortably similar to large organizations.


Implications — This isn’t about math

It’s tempting to view this as a niche result in formal mathematics.

It isn’t.

This is a blueprint for AI-native production systems.

1. Knowledge work becomes orchestration

The bottleneck is no longer intelligence.

It’s:

  • Task decomposition
  • Dependency tracking
  • Coordination protocols

In short: management.

Except your workforce is non-human.


2. Verification becomes programmable

If proofs can be verified this way, so can:

  • Financial models
  • Legal contracts
  • Regulatory compliance
  • Engineering systems

The implication is profound:

Trust shifts from institutions to verification pipelines.


3. Infrastructure > models

The paper quietly makes a strategic claim:

The limiting factor is not model intelligence, but system orchestration.

This aligns with what we’re already seeing in AI markets:

  • Models commoditize
  • Workflows differentiate

The winners won’t be those with the smartest models.

They’ll be the ones with the best agent architectures.


4. The compounding effect: data → models → data

Formalized mathematics creates:

  1. High-quality training data
  2. Better reasoning models
  3. More formalization

A feedback loop.

Which suggests something slightly unsettling:

AI systems may eventually become the primary producers and validators of knowledge.


Conclusion — From mathematicians to machine-managed knowledge

The paper ends with a deceptively simple statement:

The bottleneck is no longer reasoning—it’s coordination. fileciteturn0file0

That sentence should make you pause.

Because it reframes the entire AI landscape.

We’re not moving toward smarter individual systems.

We’re moving toward structured collectives of moderately capable systems that, when orchestrated correctly, outperform human teams in both speed and cost.

Mathematics is just the test case.

The real target is everything else.


Cognaptus: Automate the Present, Incubate the Future.