Proofs at Scale: When 30,000 Agents Replace the Referee

Mathematics has a management problem.

That sounds less romantic than saying it has a reasoning problem, but romance is not usually where bottlenecks hide. A proof can be brilliant, a referee can be diligent, and still the verification system can fail for the boring reason that nobody has enough time to check everything line by line. The paper Automatic Textbook Formalization takes that bottleneck seriously and then does something unusually concrete: it reports a multi-agent system that formalized a 500-plus-page graduate algebraic combinatorics textbook into Lean, with all 340 target definitions and theorems proved, in about one week.¹

The headline number is almost too convenient for public discussion: roughly 30,000 Claude 4.5 Opus agents, 130,000 lines of Lean, 5,900 declarations, and an estimated cost of about $100,000 with caching. It is tempting to read this as a story about AI models becoming mathematical geniuses. That would be tidy. It would also miss the useful part.

The paper is not mainly a triumph of isolated theorem-proving intelligence. It is a case study in converting model capability into production through repository-level discipline: task decomposition, git branches, pull requests, independent review, issue tracking, merge queues, status scans, and late-stage coherence cleanup. In other words, the breakthrough looks less like a lonely mathematician and more like a strangely efficient software company staffed by tireless interns who occasionally duplicate a core definition three times and wander off to build Pfaffians when nobody asked.

That is not a criticism. That is the point.

The real unit of progress is the repository, not the proof

Lean formalization turns mathematical verification into a software-engineering task. A theorem statement becomes code. A proof becomes code. A build either passes or it does not. That creates a valuable property most business workflows do not have: an unforgiving verifier.

But the paper makes clear that a verifier alone is not enough. A single agent can explore a proof, search a library, edit files, and run Lean. That works for small problems. It does not scale naturally to a full textbook because the hard part is no longer only proving a local statement. The hard part is keeping the whole repository coherent while thousands of local changes are made in parallel.

The authors therefore reject a purely single-file decomposition approach. In earlier experiments, they asked an agent to scaffold theorem statements and supervise prover subagents. That worked only for simpler material because the context became too limited. The final system instead treats formalization like a collaborative software project.

The design is almost aggressively ordinary:

Mechanism	What it does in the paper	Why it matters beyond mathematics
Short-lived git branches	Agents work on small changes without directly corrupting main	Parallel work needs isolation before integration
Pull requests	Agent contributions are reviewed before merge	Output quality is managed as a process, not assumed from the model
Mathematical review	Checks whether statements match the source material	Semantic fidelity needs a separate control layer
Engineering review	Checks naming, conventions, documentation, and API design	Maintainability is a first-class quality dimension
Merge queue	Keeps the main branch buildable	Coordination must protect the shared state
File-system issue tracker	Agents report blockers, refactors, and unresolved tasks	Communication can be simple if it is inspectable and versioned
Scan, triage, and progress roles	Agents monitor global issues and target completion	Large workflows need monitoring agents, not only doing agents

The elegance here is not architectural novelty. It is choosing a substrate that agents already know how to use. Git, files, terminal tools, builds, and issue lists are not glamorous. They are, however, legible to current coding agents. The system does not require agents to understand an exotic coordination protocol. It makes them participate in a familiar one.

This is the first business-relevant lesson. When people ask whether agents can automate complex work, they often imagine one super-agent performing an entire job. The paper suggests a different answer: complex work becomes automatable when it can be represented as many small, reviewable, mergeable changes against a shared artifact.

The future of agentic automation may look less like a chatbot and more like continuous integration with opinions.

The paper’s main result is large, but the mechanism explains why it was possible

The experiment formalized a graduate textbook in algebraic combinatorics. The source was chosen carefully: it was more than 500 pages, largely not already present in Lean’s mathlib, but still accessible from existing Lean infrastructure. The authors also emphasize contamination controls: agents did not have internet access, and the selected textbook was intended to avoid the trivial result of copying existing formalizations.

The system designated 340 definitions and theorems in the LaTeX source as target items. That target list matters because a textbook does not have one central theorem. Without explicit targets, activity can masquerade as progress. With explicit targets, progress can be measured.

The final numbers are straightforward:

Result	Reported value	Interpretation
Target definitions and theorems	340 / 340 proved	The system completed the designated formalization targets
Source scale	500+ pages	This is textbook-scale, not contest-problem scale
Lean output	130K lines	The output is a substantial repository, not a demo file
Lean declarations	5,900	The project created a reusable formal library around the text
Runtime	About one week	Parallelism changed the project timeline
Agent count	30,046 by role table	The result depends on massive distributed execution
Estimated cost	About $100K with caching	Already close to human expert cost, but still expensive to scale directly

The 340/340 result is the obvious headline. The more interesting detail is how granular the work became. The paper reports that pull requests were kept small, with agent-submitted changes generally around the scale of a few hundred lines or less. That is not incidental. Small changes make automated review possible. They also reduce the blast radius when an agent does something locally plausible and globally unfortunate.

This is where the mechanism-first reading matters. A summary can say “30,000 agents formalized a textbook.” A useful interpretation asks why this did not collapse into chaos. The answer is not that every agent was reliable. The answer is that the system made unreliability survivable.

Agent roles are not decoration; they encode different failure modes

The role design in the paper is easy to skim past, but it is one of the central contributions.

Sketcher agents translate chunks of the textbook into Lean definitions and theorem statements, usually leaving proof gaps. Prover agents fill in those gaps. Reviewer agents are split into mathematical and engineering review. Maintainer agents handle broader issues and refactoring. Triage agents mark stale or resolved issues. Scan agents look for global codebase problems. Progress agents monitor whether the system is still advancing on the intended target list.

That division is not bureaucratic ornament. Each role corresponds to a different kind of failure.

A proof can fail because the local tactic does not work. That is a prover problem.

A proof can fail because the theorem statement was translated incorrectly. That is a mathematical review problem.

A proof can pass but leave behind awkward APIs, duplicated definitions, or conventions that make future proofs harder. That is an engineering review or maintainer problem.

The whole system can appear busy while drifting toward irrelevant tasks. That is a progress-monitoring problem.

The paper’s token table shows how heavily the system depended on non-prover labor. Maintainer agents consumed the largest total token volume among agent types, more than prover agents. Reviewers, triage, scan, and progress agents also consumed meaningful compute. The takeaway is uncomfortable for anyone selling “autonomous agents” as a simple replacement for human workers: once the task becomes real, the support structure becomes part of the product.

In normal business language, the system did not just hire workers. It hired workers, reviewers, coordinators, project managers, quality assurance, and operations staff. Then it paid all of them in tokens.

The cost story is really an operations story

The paper estimates about 83 billion input tokens and 561 million output tokens across the full formalization. Without token caching, the authors estimate the run would cost about $430,000. With a heuristic caching estimate, they put the cost at roughly $100,000, including about $14,000 for output tokens.

Those numbers need two readings.

The first reading is economic. About $100,000 for a full graduate-textbook formalization may already be comparable to, or cheaper than, assembling a team of human experts for a project of similar scale. The paper frames this as evidence that automatic formalization is moving toward economic viability.

The second reading is operational. The run was not efficient. The authors explicitly describe major waste: early restarts, NFS bottlenecks, git worktree timeouts, merge queue congestion, aborted agents, duplicated work, unnecessary attempts at exercises and cited theorems, and poor dependency tracking. Around half of the input tokens were spent on aborted agents during early orchestration problems.

That makes the reported cost both impressive and misleading. It is impressive because the system worked despite waste. It is misleading if treated as a stable cost floor. The authors believe a 3–10x cost reduction is plausible without better models, simply by improving orchestration: cleaner dependency tracking, better prompts against unnecessary work, more conservative parallelism, improved relaunch policies, and a more mature coordination layer.

This is one of the most important business implications of the paper. The near-term ROI frontier for agents may not come from waiting for a smarter model. It may come from reducing operational waste around the model.

Cost driver in the paper	Likely business analogue	Practical control
Aborted agents after restarts	Failed automation runs after tool or data errors	Resumable workflows and checkpointing
Agents blocked by unresolved dependencies	Staff or bots waiting on upstream documents, approvals, or data	Explicit dependency graphs
Agents working on exercises or cited theorems	Automation solving adjacent but non-required tasks	Scope guards and target lists
Merge queue congestion	Too many parallel changes competing for one approval path	Batch review, staging, prioritization
Duplicate definitions	Teams reinventing the same entity, KPI, data model, or policy term	Shared ontology and schema governance

The mildly embarrassing part is also the useful part: the paper’s inefficiencies look familiar. They are not exotic AI problems. They are operations problems wearing Lean syntax.

Coherence is the bottleneck after local correctness improves

A Lean proof that passes is locally verified. That does not mean the whole project is semantically beautiful, reusable, or even consistently designed.

The paper’s most instructive failure cases involve coherence. In one case, an $n$-partition data type was defined independently in three different chapters. The definitions were mathematically equivalent and locally useful, but the duplication damaged repository coherence. A shared definition was later factored out, temporarily increasing the count of versions before migration reduced the mess. There is a joke in there about standards, but the paper already noticed it.

The harder example is Bender–Knuth involutions. The source textbook uses informal combinatorial explanation, examples, and diagrams. The agents produced two different definitions in different proof contexts. These were not simply equivalent duplicates. One version applied to full rows; another could be restricted to an initial prefix of columns. Worse, one version ignored matching conditions, while another implemented matching but selected entries in the wrong direction.

Here the system encountered a dangerous pattern: broken definitions can still support some locally valid proofs. Where lemmas remained true under a broken definition, prover agents happily proved them. Where lemmas became false, agents reported issues or counterexamples, but they did not automatically infer the global conceptual repair.

This is the paper’s quiet warning. Verification checks the code you wrote, not the intention you failed to encode.

For mathematics, that means the review surface shrinks from every proof line to the semantic fidelity of definitions and theorem statements. That is still a major gain. But it is not zero work. The authors say they manually spot-checked key statements and definitions and released a side-by-side blueprint website so the community can compare the formalization against the source. This is a serious boundary condition: 340/340 proved does not imply every statement was translated with perfect semantic fidelity.

For business automation, the analogue is direct. If an AI system correctly executes the wrong policy, validates the wrong KPI, or summarizes the wrong contractual obligation, the pipeline can be internally consistent and externally wrong. Formal verification helps only when the thing being verified matches the thing the organization actually means.

The rabbit-hole problem is what happens when agents are helpful in the wrong direction

The paper reports another failure mode that deserves more attention than the headline result: agents pursued irrelevant formalizations despite instructions not to.

In a textbook, not every mentioned result is a target. Some theorems are cited. Some statements are exercises. Some background material is useful context but not part of the formalization goal. The agents sometimes ignored that boundary. Once an irrelevant statement was formalized and left with proof gaps, prover agents picked it up, issues were created, and compute was spent on tasks that were not part of the intended project.

The paper’s example is Kasteleyn’s formula for domino tilings. Agents started developing broad and incomplete theory around Pfaffians and the FKT algorithm, pulling resources away from target proof translation.

This is not a “model hallucination” in the usual sense. The work was mathematically adjacent. It was plausibly useful. It was also out of scope.

That distinction matters. In business workflows, agents will not only fail by producing nonsense. They will fail by producing plausible, polished, and unnecessary work. They will research the background instead of filing the compliance report. They will build a richer dashboard instead of reconciling the metric. They will write a strategic memo when the required output is a signed exception log.

The cure is not merely stricter prompting. The paper suggests a process answer: explicit target lists, progress agents, issue triage, status reports, and a staged tightening of project coherence. Early in the project, some freedom helps parallel exploration. Near release, the screws need tightening. The authors describe this as a kind of simulated annealing: broad exploration first, harmonization later.

That is a useful operating principle for agentic systems. Do not demand perfect global coherence at the start. Do not allow creative wandering at the end.

The paper’s evidence is a case study, not a universal law

The paper’s figures and tables serve different purposes, and mixing them together would overstate the result.

Evidence in the paper	Likely purpose	What it supports	What it does not prove
340/340 target items proved	Main evidence	The system completed the designated formalization target list	Perfect semantic translation of every source statement
130K Lean lines and 5,900 declarations	Scale evidence	The output is a substantial repository-scale formalization	That all code is optimally designed or reusable
Agent role token table	Implementation and cost analysis	Large-scale success required non-prover roles, especially maintenance and review	That this exact role mix is optimal
Outcome table	Operational diagnosis	Waste came from aborted, blocked, and failed agents as well as successful merges	A clean estimate of mature-system cost
Prover vs maintainer comparison	Exploratory system analysis	Maintainers handled broader refactoring and coordination-file work	Whether prover agents should be eliminated
Coherence failure examples	Qualitative failure analysis	Local correctness can coexist with global semantic inconsistency	A quantified failure rate across all definitions
Cost caching appendix	Heuristic economic estimate	Caching plausibly reduces cost from no-cache estimates	Exact billing reconstruction from logs

This matters because the paper is unusually open about what was learned during the run. It is not a clean benchmark where a frozen system is compared against a fixed baseline. The authors adjusted prompts, added roles, clarified review policies, tuned orchestration, and launched status agents manually at points. That makes the study more like an engineering report than a controlled ablation suite.

That is not a weakness for our purposes. Engineering reports often teach more about deployment than polished benchmarks do. But it does mean the result should be read as: “Here is a working architecture and its observed failure modes,” not “Here is the final optimal algorithm for all agent collaboration.”

The business lesson is an agent operating system, not an AI miracle

The immediate domain is formal mathematics. The broader lesson is about complex knowledge work where many small actions must converge into one auditable artifact.

A law firm producing a due-diligence memo, a bank validating model-risk documentation, an insurer reviewing claims policy changes, a manufacturer reconciling engineering specifications, or a public agency checking regulatory submissions all face versions of the same structure:

The work is too large for one pass.
Many local tasks can be separated.
Local correctness is not enough.
The final artifact must be coherent.
Review must focus on the few points where errors are expensive.
Progress must be measured against explicit targets, not against activity.

The paper’s workflow maps surprisingly well onto that structure.

Paper mechanism	Business translation	Value created	Boundary
Target theorem list	Explicit output acceptance checklist	Prevents activity from replacing progress	Requires expert design upfront
Sketcher agents	First-pass extraction or drafting	Converts raw source material into structured work items	Draft quality can distort downstream work
Prover agents	Evidence builders or task completers	Fill gaps using tools and source context	Can optimize locally against wrong assumptions
Mathematical review	Domain-faithfulness review	Checks whether output means the right thing	Still needs expert judgment
Engineering review	Process and maintainability review	Keeps artifacts reusable and consistent	May slow throughput if overdone
Issue tracker	Shared blocker memory	Makes coordination inspectable	Can become bureaucracy if not actively cleaned
Progress agents	Target completion monitoring	Reduces drift and rabbit holes	Needs reliable target definitions
Merge queue	Controlled integration	Protects the main artifact	Can become a bottleneck

The inference for Cognaptus-style business automation is clear but bounded. The product is not “ask an agent to do the job.” The product is a workflow architecture where agents operate against files, tests, review gates, dependency maps, and human-readable issue trails.

That is less magical. It is also much more likely to survive contact with an actual organization.

The verifier boundary is where mathematics differs from ordinary business work

Lean gives this project a decisive advantage: the proof checker can reject invalid proofs. In many business workflows, there is no equivalent kernel. A compliance memo, investment report, procurement review, or legal-risk summary can be plausible without being formally verifiable.

This boundary should not be softened. The paper’s success depends on a domain where correctness can be machine-checked after the right statement has been encoded. Business workflows often lack that final binary test. They may have partial validators: schema checks, reconciliation rules, audit trails, numerical tests, contract clause matchers, policy checklists, or human review. These are useful, but they do not equal Lean.

The right business response is not despair. It is layering.

Use deterministic checks wherever possible. Use review agents for consistency and style. Use specialist agents for domain issues. Use human reviewers at the semantic choke points. Use logs and version control so failures can be traced. Use explicit target lists so the system cannot declare victory by generating impressive debris.

The paper shows what happens when AI agents are embedded in a domain with a hard verifier. The business opportunity is to approximate that verifier with narrower, domain-specific controls.

That is where practical advantage will sit: not in claiming that agents are trustworthy, but in designing systems where trust is not a personality trait.

What this changes about AI strategy

The paper ends with an argument that large-scale formalization may become constrained less by human expert labor and more by inference compute, capital, and orchestration. For mathematics, that is a serious claim. For business, it points to a more general shift.

When model capability is scarce, strategy centers on access to the best model. When model capability becomes broadly available, strategy moves to process design: how work is decomposed, what tools agents can use, what they are forbidden to touch, how review is sequenced, how shared state is protected, and how progress is measured.

That shift is already visible in this case study. The system’s likely 3–10x efficiency gain does not require Claude 5, Lean 6, or a new theorem-proving breakthrough. It requires less wasted work, fewer blocked agents, better dependency tracking, and cleaner orchestration.

In less polite language: the agents were expensive partly because the office was messy.

The best near-term AI systems will not be the ones that merely summon more agents. They will be the ones that know when not to launch them, what task to give them, how to detect blocked work, when to merge, when to refactor, and when to stop an eager worker from building a cathedral next to the requested shed.

Conclusion: the referee is not replaced by one agent, but by a workflow

The phrase “30,000 agents replace the referee” is memorable, but slightly misleading if taken literally. The referee is not replaced by a single artificial judge. The referee’s impossible job is redistributed across a formal proof checker, source-statement review, engineering review, issue tracking, status monitoring, and a repository process that keeps local work from destroying global coherence.

That is more interesting than the slogan.

The paper’s achievement is not that AI suddenly understands an entire graduate textbook the way a human expert does. It is that a large swarm of agents, when placed inside a disciplined software-engineering scaffold, can produce a complete machine-checked formalization at a scale and speed that changes the economics of verification.

The remaining bottleneck is also clear. Models can prove local statements. Proof assistants can verify formal code. What still needs careful design is the layer between them: the operating system of agentic knowledge work.

For mathematics, that operating system may help build the formal library needed for routine verification of research. For business, it offers a more sober template for automation: do not buy an agent and hope for intelligence. Build a process where small pieces of intelligence can be coordinated, checked, merged, and corrected.

That is less cinematic than the AI replacing the referee.

It is also how the work actually gets done.

Cognaptus: Automate the Present, Incubate the Future.

Fabian Gloeckle, Ahmad Rammal, Charles Arnal, Remi Munos, Vivien Cabannes, Gabriel Synnaeve, and Amaury Hayat, “Automatic Textbook Formalization,” arXiv:2604.03071v1, April 2026. https://arxiv.org/abs/2604.03071 ↩︎

The real unit of progress is the repository, not the proof#

The paper’s main result is large, but the mechanism explains why it was possible#

Agent roles are not decoration; they encode different failure modes#

The cost story is really an operations story#

Coherence is the bottleneck after local correctness improves#

The rabbit-hole problem is what happens when agents are helpful in the wrong direction#

The paper’s evidence is a case study, not a universal law#

The business lesson is an agent operating system, not an AI miracle#

The verifier boundary is where mathematics differs from ordinary business work#

What this changes about AI strategy#

Conclusion: the referee is not replaced by one agent, but by a workflow#