Formal Verification

Unsolvable by Design: Turning AI Plans Into Security Guarantees

Failure should be boring Approval workflows are supposed to be boring. A client submits documents, a system checks the required conditions, and an approval either happens or does not happen. Boring is good. Boring means the process does not accidentally approve a case while also escalating it as problematic. The trouble begins when a workflow is written as a best-effort model of reality. Someone encodes the actions. Someone else adds an exception. A third person adds a shortcut because the quarterly dashboard prefers speed over philosophy. Eventually, a sequence exists that should not exist. It does not look like a bug when inspected locally. Each action seems defensible. The path as a whole is the problem. ...

Proofs at Scale: When 30,000 Agents Replace the Referee

Mathematics has a management problem. That sounds less romantic than saying it has a reasoning problem, but romance is not usually where bottlenecks hide. A proof can be brilliant, a referee can be diligent, and still the verification system can fail for the boring reason that nobody has enough time to check everything line by line. The paper Automatic Textbook Formalization takes that bottleneck seriously and then does something unusually concrete: it reports a multi-agent system that formalized a 500-plus-page graduate algebraic combinatorics textbook into Lean, with all 340 target definitions and theorems proved, in about one week.1 ...

When Less Proves More: The Case for Minimalist AI Theorem Provers

When Less Proves More: The Case for Minimalist AI Theorem Provers Proof is a good place to test AI humility. In ordinary business writing, a model can sound confident, cite familiar patterns, and still be quietly wrong. The error may not surface until the contract is signed, the policy memo is circulated, or the spreadsheet has already acquired the authority of a sacred object. In formal theorem proving, the arrangement is less polite. The model writes code. Lean compiles it. The compiler either accepts the proof or sends it back covered in red ink. ...

$Cover image$

Proof Over Probabilities: Why AI Oversight Needs a Judge That Can Do Math

Agents now do things. That sounds obvious, but it is the entire problem. A chatbot can be wrong and mostly embarrass itself. An agent can book the wrong hotel, leak the wrong file, fabricate the wrong report, or move through a workflow with the quiet confidence of a junior employee who has just discovered automation and has not yet discovered liability. ...

Skeletons in the Proof Closet: When Lean Provers Need Hints, Not More Compute

Compute is a very convenient alibi. When an AI system fails, the modern reflex is to ask for more of it: more samples, more tokens, more search, more GPUs, more patience from whoever is paying the invoice. This habit is not always wrong. Sometimes the model really does need another attempt. Sometimes the winning answer is hiding in sample number 47. ...

When Coders Prove Theorems: Agents, Lean, and the Quiet Death of the Specialist Prover

A coder does not trust a program because it sounds plausible. A coder runs it, reads the error message, changes the implementation, tests again, searches the library, asks a colleague, splits the problem, and keeps going until the machine stops complaining. That mundane loop is the interesting part of Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics.1 The headline result is easy to market: with Claude Opus 4.5 as the base model, Numina-Lean-Agent solves all 12 Putnam 2025 problems in Lean, matching the reported perfect score of AxiomProver. Nice. The trophy cabinet sparkles. ...

Vibe Coding a Theorem Prover: When LLMs Prove (and Break) Themselves

A theorem prover is a terrible place to let an LLM improvise Code review is forgiving compared with theorem proving. In ordinary software, a language model can produce code that looks clean, passes a few tests, and still hides a slow-burning defect somewhere behind an edge case. Annoying, yes. Catastrophic, sometimes. But the social contract is familiar: tests catch some errors, humans catch others, production catches the rest. Very elegant. Very modern. Very expensive. ...

When Solvers Guess Smarter: Teaching SMT to Think in Functions

When Solvers Guess Smarter: Teaching SMT to Think in Functions Timeouts are where formal verification quietly loses its glamour. A team writes a specification. A solver receives the formula. Everyone expects the machine to answer a clean question: is this system safe, satisfiable, contradictory, or not? Then the solver thinks. And thinks. And returns nothing useful before the clock runs out. ...

When Fairness Fails in Groups: From Lone Counterexamples to Discrimination Clusters

Imagine two fairness bugs. In the first, changing a protected attribute while holding everything else constant shifts a model’s output enough to trigger one unfair decision. In the second, the same underlying applicant profile can fracture into nineteen meaningfully different score bands as protected attributes change. A conventional pairwise fairness test records both as violations. One counterexample each. Very tidy. Also not especially useful. ...

How to Make Neural Networks Talk: Register Automata as Their Unexpected Interpreters

How to Make Neural Networks Talk: Register Automata as Their Unexpected Interpreters Prices move. Sensors drift. Users click, pause, return, disappear, and sometimes behave exactly like a Markov chain with a caffeine problem. Modern sequence models are good at turning such streams into decisions. A recurrent network or transformer can look at a run of numbers and say: buy, flag, reject, approve, alert. What it usually cannot do is explain the rule it has learned in a form that a risk team, engineer, or auditor can actually inspect. ...