The Problem with Problems: Why LLMs Still Don’t Know What’s Interesting

Opening — Why this matters now

In an age when AI can outscore most humans in the International Mathematical Olympiad, a subtler question has emerged: can machines care about what they solve? The new study A Matter of Interest (Mishra et al., 2025) explores this psychological fault line—between mechanical brilliance and genuine curiosity. If future AI partners are to co‑invent mathematics, not just compute it, they must first learn what humans deem worth inventing.

Background — Context and prior art

Historically, “automated mathematical discovery” (AMD) systems—from Lenat’s AM in the 1970s to DeepMind’s AlphaTensor and AlphaEvolve—focused on generating or improving results, not judging whether the questions themselves merited attention. In contrast, human mathematicians routinely discriminate: which problems are elegant, fruitful, or beautiful? Those judgments of interestingness shape the evolution of mathematics itself.

Large language models (LLMs) have recently reached Olympiad‑gold‑level reasoning ability. Yet, they remain glorified problem solvers—fed pre‑selected tasks and never asked why this problem? The authors of this study asked precisely that, comparing how humans and models assess the “interestingness” and “difficulty” of math problems.

Analysis — What the paper does

The researchers collected two complementary datasets:

Crowdsourced judgments (Prolific participants): 63 adults rated modified AMC problems on a 0–100 interestingness scale, explaining their rationale.
Expert judgments (IMO contestants): 48 Olympiad participants rated IMO‑level problems and identified reasons for interest or disinterest—simplicity, elegance, novelty, usefulness, etc.

They then tested 12 AI models across 5 families (OpenAI, Mistral, Meta’s Llama, Qwen, DeepSeek) to see whether their evaluations aligned with human ones. The analysis involved correlations ($R^2$) between mean ratings and distributional comparisons using the Wasserstein Distance (WD)—a measure of how similarly shaped two opinion distributions are.

Findings — What the numbers say

Model Family	R² (Human Alignment)	Distributional Alignment (WD ↓)	Notable Traits
Mistral 7B	0.78	12.4	Closest to human spread of opinions
Mistral 24B	0.74	15.6	Consistent, human‑like variability
DeepSeek R1	0.66	16.4	Strong reasoning but stiffer preferences
GPT‑5	0.52	21.2	Accurate yet distributionally narrow
Llama‑4 Maverick	0.55	20.7	Over‑confident judgments

(Lower WD = closer to human variability; human–human baseline ≈ 9.5)

Despite decent correlations, only Mistral‑family models produced rating distributions statistically close to human ones. Most others exhibited “flattened curiosity”—they could approximate human judgment averages but failed to capture its diversity.

Further, when IMO participants justified their interest, three reasons dominated:

“The problem statement is simple and elegant.”
“The solution does not require sophisticated theorems.”
“The solution itself is elegant.”

Yet most models failed to mirror this pattern. They either fixated on novelty or defaulted to high complexity as a proxy for interest. In short, LLMs still conflate “hard” with “interesting.”

Implications — For education, AI partners, and design

The findings strike at the heart of AI alignment with human aesthetics. If models are to assist mathematicians—or design learning curricula—they must internalize the nuanced spectrum between trivial, challenging, and beautiful. Over‑weighting difficulty risks producing uninspired tutors and unimaginative conjecture generators.

This also invites a more philosophical reflection: should AI mimic the average human’s curiosity, or the expert’s? A model tuned to Olympiad elegance may alienate beginners, while one calibrated to student curiosity might bore professionals. “Alignment,” in this sense, becomes multidimensional—spanning expertise, motivation, and even taste.

For practical adoption:

Educational AI should learn to scaffold curiosity—proposing problems that feel just reachable enough to sustain engagement.
Research copilots should embed aesthetic priors—preferring conjectures with elegant formulations or surprising symmetry.
Evaluation frameworks may evolve from correctness‑driven to curiosity‑driven metrics, rewarding models that select, not just solve.

Conclusion — When reasoning meets wonder

Mathematics advances not only by proof, but by preference. The next frontier for reasoning models lies in modeling the human sense of fascination—the emotional calculus that decides what’s worth solving in the first place. Until LLMs can feel a flicker of delight at a clever trick or a clean symmetry, they may remain problem solvers, not problem choosers.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper does#

Findings — What the numbers say#

Implications — For education, AI partners, and design#

Conclusion — When reasoning meets wonder#