What happens when we ask the smartest AI models to do something truly difficult—like solve a real math problem and prove their answer is correct?

That’s the question tackled by a group of researchers in their paper “Mathematical Proof as a Litmus Test.” Instead of testing AI with casual tasks like summarizing news or answering trivia, they asked it to write formal mathematical proofs—the kind that leave no room for error. And the results? Surprisingly poor.

🧠 Why Use Math Proofs to Test AI?

Math proofs are the ultimate test of careful thinking. There’s no room for vague answers. To pass, an AI must show it truly understands the logic step-by-step—just like a top student in a math competition. This makes proofs a perfect “litmus test” for checking if AI really knows what it’s talking about.

The researchers used a system called Lean, where every step of the proof must be correct and verified by a computer. This way, we don’t have to guess if the answer just sounds smart—it either works or it doesn’t.

Concept Description
Lean A formal proof system that checks logic rigorously, step-by-step
Tactic Prediction Asking the AI to generate the next step of a proof
Full Proof Generation Asking the AI to solve the whole problem from scratch
Hallucination When the AI makes up fake math steps or lemmas that don’t exist

📉 So How Did the AIs Do?

In this study, the researchers tested several advanced AI models available in early-to-mid 2025. Since AI models improve rapidly, it’s important to understand that these results reflect a snapshot in time and may not represent the current state of the latest versions.

Model Full-Proof Accuracy (Realistic Setting) Notes
GPT-4 (early 2025) 17% Closed-source, commercial version
Claude Opus (Anthropic) 14% Strong on general reasoning
Gemini 1.5 Pro (Google DeepMind) 9% Newest multimodal model at the time
DeepSeek Coder 4% Open-source, coding-focused
CodeGemma 2% Lightweight model with open weights

While newer models like GPT-4o or Claude 3 Opus may now exist, the key insight remains: even top-tier models struggle when logic must be precise, step-by-step, and verifiable.

Common errors included:

  • Writing proofs that looked right but had hidden logic gaps
  • Repeating the same step over and over
  • Forgetting what was already proven

Common errors included:

  • Writing proofs that looked right but had hidden logic gaps
  • Repeating the same step over and over
  • Forgetting what was already proven

🧪 A Smarter Way to Test Them

To make things fair, the team built a special test set called LeanDojoEval. It includes math problems of different levels: easy, medium, and hard. The AI was tested under three different settings:

Prompt Type Description Example Task
One-step (Tactic) Predict the next tactic given the current proof state “What comes after ‘intro h’?”
Multi-step Write the full proof with some hints “Prove a basic lemma with support”
Realistic Solve the full problem with no hints at all “Prove a theorem from scratch”

They found that AIs can sometimes guess the next step correctly (like multiple-choice), but when they need to solve the whole problem, they collapse.

🔍 What Does This Tell Us?

It tells us that today’s AI models are still far from reliable thinkers.

  • They copy the style of reasoning, but don’t always understand the logic.
  • They often sound confident even when wrong.
  • They forget things they just saw or don’t use the right info.

In simple terms, they’re great at writing, but not yet at thinking.

🧭 What Needs to Improve?

To make AI truly helpful for tasks like science, law, or math, we need:

  • Smarter ways for AI to remember and check its own work
  • Hybrid systems that combine natural language with formal logic
  • Tools that can catch mistakes instead of just trusting the AI

Rather than expecting one model to do everything, the future may involve teams of tools working together—one to generate, one to verify, one to plan.

🤖 Imagine an AI team: the Writer generates steps, the Checker verifies logic, and the Planner keeps the goal in sight.


Cognaptus: Automate the Present, Incubate the Future.