What happens when we ask the smartest AI models to do something truly difficult—like solve a real math problem and prove their answer is correct?
That’s the question tackled by a group of researchers in their paper “Mathematical Proof as a Litmus Test.” Instead of testing AI with casual tasks like summarizing news or answering trivia, they asked it to write formal mathematical proofs—the kind that leave no room for error. And the results? Surprisingly poor.
🧠 Why Use Math Proofs to Test AI?
Math proofs are the ultimate test of careful thinking. There’s no room for vague answers. To pass, an AI must show it truly understands the logic step-by-step—just like a top student in a math competition. This makes proofs a perfect “litmus test” for checking if AI really knows what it’s talking about.
The researchers used a system called Lean, where every step of the proof must be correct and verified by a computer. This way, we don’t have to guess if the answer just sounds smart—it either works or it doesn’t.
Concept | Description |
---|---|
Lean | A formal proof system that checks logic rigorously, step-by-step |
Tactic Prediction | Asking the AI to generate the next step of a proof |
Full Proof Generation | Asking the AI to solve the whole problem from scratch |
Hallucination | When the AI makes up fake math steps or lemmas that don’t exist |
📉 So How Did the AIs Do?
In this study, the researchers tested several advanced AI models available in early-to-mid 2025. Since AI models improve rapidly, it’s important to understand that these results reflect a snapshot in time and may not represent the current state of the latest versions.
Model | Full-Proof Accuracy (Realistic Setting) | Notes |
---|---|---|
GPT-4 (early 2025) | 17% | Closed-source, commercial version |
Claude Opus (Anthropic) | 14% | Strong on general reasoning |
Gemini 1.5 Pro (Google DeepMind) | 9% | Newest multimodal model at the time |
DeepSeek Coder | 4% | Open-source, coding-focused |
CodeGemma | 2% | Lightweight model with open weights |
While newer models like GPT-4o or Claude 3 Opus may now exist, the key insight remains: even top-tier models struggle when logic must be precise, step-by-step, and verifiable.
Common errors included:
- Writing proofs that looked right but had hidden logic gaps
- Repeating the same step over and over
- Forgetting what was already proven
Common errors included:
- Writing proofs that looked right but had hidden logic gaps
- Repeating the same step over and over
- Forgetting what was already proven
🧪 A Smarter Way to Test Them
To make things fair, the team built a special test set called LeanDojoEval. It includes math problems of different levels: easy, medium, and hard. The AI was tested under three different settings:
Prompt Type | Description | Example Task |
---|---|---|
One-step (Tactic) | Predict the next tactic given the current proof state | “What comes after ‘intro h’?” |
Multi-step | Write the full proof with some hints | “Prove a basic lemma with support” |
Realistic | Solve the full problem with no hints at all | “Prove a theorem from scratch” |
They found that AIs can sometimes guess the next step correctly (like multiple-choice), but when they need to solve the whole problem, they collapse.
🔍 What Does This Tell Us?
It tells us that today’s AI models are still far from reliable thinkers.
- They copy the style of reasoning, but don’t always understand the logic.
- They often sound confident even when wrong.
- They forget things they just saw or don’t use the right info.
In simple terms, they’re great at writing, but not yet at thinking.
🧭 What Needs to Improve?
To make AI truly helpful for tasks like science, law, or math, we need:
- Smarter ways for AI to remember and check its own work
- Hybrid systems that combine natural language with formal logic
- Tools that can catch mistakes instead of just trusting the AI
Rather than expecting one model to do everything, the future may involve teams of tools working together—one to generate, one to verify, one to plan.
🤖 Imagine an AI team: the Writer generates steps, the Checker verifies logic, and the Planner keeps the goal in sight.
Cognaptus: Automate the Present, Incubate the Future.