The Problem with Problems: Why LLMs Still Don’t Know What’s Interesting

A tutoring system has one deceptively simple job: give the learner the next problem.

Not the hardest problem. Not the flashiest problem. Not the one that makes the model feel terribly pleased with itself after a 4,000-token monologue. The next problem: the one that keeps a student engaged, teaches the right structure, and feels worth the effort.

Research copilots face the same problem wearing a more expensive suit. When an AI system proposes conjectures, examples, lemmas, or search directions, it is no longer just solving problems. It is choosing what deserves attention. That is a much more dangerous form of intelligence, because bad taste scales very efficiently.

The paper A Matter of Interest: Understanding Interestingness Judgments of Math Problems in Humans and Language Models asks whether language models can judge which mathematics problems humans find interesting.¹ The answer is not the usual tidy binary. Models can often track average human ratings surprisingly well. They also fail, in important ways, to reproduce the diversity of human judgement, the reasons behind those judgements, and the relationship between evaluation and generation.

So the useful conclusion is not “LLMs understand interestingness” or “LLMs do not understand interestingness”. That would be too convenient, and therefore suspicious. The useful conclusion is narrower: LLMs can imitate some central tendency of human taste, but they are still unreliable as standalone selectors of what is worth solving.

That distinction matters.

The evidence starts promising, which is exactly why it becomes interesting

The study begins with a controlled comparison between human and model judgements of mathematical problems. The authors collected ratings from two human populations: 63 Prolific participants with college-level mathematics experience rating AMC-style contest problems, and 48 International Mathematical Olympiad participants rating IMO-level problems and selecting reasons for their interest or disinterest.

The model side covered 12 language models across five families, including OpenAI, Mistral, Llama, Qwen, and DeepSeek systems. The models rated the same problems for interestingness and difficulty, and in some settings supplied rationales or selected criteria such as elegance, novelty, usefulness, simplicity, and surprise.

The first result looks good for the models. On the Prolific AMC-style dataset, model-human agreement on mean per-problem interestingness ratings ranged from about $R^2 = 0.48$ to $R^2 = 0.78$. Mistral-family models were especially strong on interestingness alignment. This is not trivial. It means that, at the level of average ratings, several models can detect something about which contest problems mathematically interested participants tend to prefer.

That is the part a vendor deck would stop at. It should not.

Mean alignment is a useful signal, but it is not the whole judgement. If 30 humans rate a problem, they do not merely produce an average. They produce a distribution: some bored, some delighted, some indifferent, some annoyed because the problem looks like a tired trick in a fake moustache. A system that matches the average but compresses the spread is not modelling a population. It is modelling a committee summary.

The authors therefore ask a second question: do model rating distributions resemble human rating distributions? They use Wasserstein distance, where lower values indicate closer distributional alignment. The human-human split-half baseline is approximately 9.5. Individual models mostly sit far above that. At temperature 1.0, Mistral 7B Instruct is the closest individual model, with a WD of 12.4 and a confidence interval overlapping the human baseline. Others drift much further away: Mistral 24B at 15.6, DeepSeek R1 at 16.4, QwQ 32B at 18.1, GPT-OSS-120B at 18.3, Mixtral 8x7B at 20.2, Llama 4 Maverick at 20.7, GPT-5 at 21.2, Qwen 235B Instruct at 21.3, and o3 at 21.9.

That is the first reversal. Models can agree with the average human judgement while failing to reproduce the human population’s spread of judgement. In education, that distinction is the difference between “students like this kind of problem on average” and “this group of students will split sharply, and you should adapt”. One is content ranking. The other is curriculum design.

Question	What the paper tests	What the evidence supports	What it does not prove
Do models agree with average human interest ratings?	Correlation between mean human and model ratings on AMC-style problems	Many models broadly track average human interestingness, with strongest results around the Mistral family	That a model captures individual variation or user-specific taste
Do models match the distribution of human ratings?	Wasserstein distance between human and model rating distributions	Most individual models do not reproduce the diversity of human judgements	That higher temperature alone solves population-level alignment
Can model pooling help?	Combinatorial subsets of models compared against human rating distributions	Selective multi-model pools can approach the human split-half baseline	That “more models” is always better
Do models share human reasons for interest?	IMO participant criteria compared with LLM criteria distributions	Most models poorly match the diversity of human rationale patterns	That rating alignment implies rationale alignment
Can models generate interesting problems?	Pilot study rating filtered LLM-generated problems	Humans can find filtered LLM-generated problems interesting	That one model can generate, validate, rank, and personalise problems reliably

Interesting does not mean difficult, despite what models keep implying

A common mistake in AI-for-math discussions is to treat difficulty as a rough substitute for interestingness. Harder must mean more interesting. Sophisticated must mean more valuable. Longer reasoning must mean deeper insight. This is a very machine-like mistake, which is probably why machines are so fond of it.

Humans in the study do not behave that way. Across participants, the correlation between their own interestingness and difficulty ratings averaged 0.47, with a 95% confidence interval of [0.39, 0.55]. That is moderate, not negligible, but far from identity. A problem can be easy and elegant. It can be hard and boring. It can be technically demanding but aesthetically dead on arrival.

Many models, however, show much tighter coupling between difficulty and interest. Reasoning models are especially prone to this, with interestingness-difficulty correlations consistently at or above 0.9. Other LLMs vary more, but most still correlate the two dimensions strongly; Mistral 7B Instruct is one of the notable exceptions at temperature 1.0, with a much lower correlation of 0.41.

This is operationally important. A tutoring product that treats interest as difficulty will over-assign hard problems to maintain engagement. A research assistant that treats interest as difficulty will favour complicated directions even when a simple formulation is more generative. A corporate learning platform that treats interest as difficulty will confuse cognitive load with motivation, which is how one ends up with a workforce that is technically “challenged” and spiritually absent.

The paper’s useful lesson is that interestingness should be treated as its own variable. Not as prettified difficulty. Not as usefulness with nicer branding. Not as a proxy for “requires more tokens”. It needs to be measured, calibrated, and monitored separately.

The distribution problem is where product teams should pay attention

The strongest practical result in the paper is not that one model wins. It is that selective pooling of models improves distributional alignment.

The authors evaluate combinations of the 12 models and compare pooled model judgements against human judgement distributions. The best-performing minimal subset consists of Mistral 24B, Mistral 7B, Mixtral 8x7B, and OpenAI o3. This pool reaches a WD of 9.07 with a 95% confidence interval of [8.34, 10.9], roughly matching the human-human split-half baseline of 9.5 [7.8, 11.5]. Other strong pools also include a mix of model families, such as Mistral 7B, o3, and Qwen Instruct.

The disappointing but useful detail: adding more models does not automatically help. Some of the least aligned pools are OpenAI-heavy combinations, including {GPT-5, o3}, which has a WD of 20.6. The authors suggest that intra-family similarities may reduce useful judgement diversity rather than expand it.

That should sound familiar to anyone building AI workflows. Ensemble systems are not magical because they have more models. They work when their errors are usefully different. A room full of confident clones is not a committee. It is just a more expensive echo.

For business systems, the design implication is clear: if AI is selecting educational problems, research directions, product ideas, strategy options, or candidate experiments, a single model judge is a fragile architecture. But a naive multi-model vote is not enough either. The model pool needs to be selected for complementary judgement patterns and calibrated against the target human population.

In other words, “use several models” is not the recommendation. “Use several models whose disagreements resemble the disagreements of the users you care about” is closer.

That is less catchy. It is also more likely to work.

Human reasons are more varied than model reasons

The IMO part of the study shifts from ratings to rationales. Participants selected criteria explaining why a problem was interesting or uninteresting. The most frequently selected interestingness reasons were: the problem statement is simple and elegant; the solution does not require sophisticated techniques or theorems; and the solution is elegant.

That list is revealing. Experts were not merely chasing novelty, difficulty, or obscurity. They valued elegance, simplicity, transferability, playfulness, and naturalness. Good mathematics is often interesting because it compresses complexity into a clean idea. Models, by contrast, frequently behave as though complexity itself deserves applause. It does not. Complexity is cheap. Clarity is expensive.

When the same rationale-selection task was given to models, most LLMs showed poor distributional match with human criterion choices. The authors report that only Mistral 7B Instruct and Mistral 24B Instruct reflected the human distributions of interestingness rationales well. Most models selected only one or two importance levels per criterion despite repeated sampling, while human answers spanned the available range.

This matters because rationales are not decorative. In a deployed system, the rationale often becomes the interface.

A tutor does not merely choose a problem; it explains why the problem is worth trying. A research copilot does not merely propose a direction; it argues why the direction may be fruitful. A curriculum generator does not merely rank exercises; it needs to justify sequencing. If the model’s stated reasons do not match the reasons humans actually care about, the system may recommend the right item for the wrong reason. That can still work in a benchmark. It fails more quietly in a product.

The paper therefore separates two capabilities that are often lazily bundled together:

rating what humans will find interesting;
explaining why humans will find it interesting.

The first can be moderately good. The second remains much weaker and more model-dependent.

Reasoning tokens are a signal, not a soul

The study also examines whether large reasoning models spend more reasoning effort on problems they later rate as interesting. This is best read as an exploratory mechanism probe, not the main thesis.

On the Prolific AMC-style dataset, reasoning models tend to make faster judgements for problems they label as uninteresting and spend longer reasoning chains on problems they label as more interesting. That suggests reasoning-token usage might sometimes serve as a proxy for internal interest assessment. The model lingers where it finds something worth processing.

But this pattern breaks down at the IMO level. For harder Olympiad-style problems, longer reasoning is no longer clearly tied to higher interestingness ratings. The authors offer a plausible explanation: for difficult problems, the model may spend most of its effort simply parsing and understanding the task. The resource signal becomes confounded by comprehension difficulty.

That boundary is important. Reasoning-time telemetry may help diagnose model judgement on moderately difficult tasks. It should not be mistaken for a general-purpose curiosity meter. Once tasks become hard enough, “the model thought longer” may just mean “the model was lost in the foyer”.

For product teams, the operational lesson is modest but useful: reasoning-token length can be monitored as a behavioural feature, but it should not be used alone to infer interest, value, novelty, or pedagogical quality.

Generation works only after the adults inspect the output

The final part of the paper asks whether models can generate problems humans find interesting. This is where the result becomes both encouraging and awkward.

The authors prompt three models—Mistral 7B Instruct, Qwen 235B Thinking, and OpenAI o3—to generate 90 “interesting, high-school level competition math problems”. They then manually filter invalid problems. That filtering step removes 12 problems, all generated by Mistral 7B Instruct. A pilot group of 30 Prolific participants then rates 24 filtered LLM-generated problems, producing 360 judgements.

Humans do find some generated problems interesting. Mean ratings by generator are broadly similar: Mistral-generated problems receive 59.36 [52.35, 68.09], o3-generated problems 62.46 [56.76, 69.00], and Qwen 235B Thinking problems 62.84 [52.25, 69.83]. There is no significant difference across generators.

This is promising, but not self-sufficient. The generated problems are evaluated after validity filtering. The study also finds no significant relationship between a model’s alignment as a judge on human-written problems and its performance on LLM-generated problems. Even the model most aligned with human ratings on LLM-generated problems lags far behind the human-model correlations observed on human-written problems.

The paper’s t-SNE analysis adds another clue: human-written and LLM-written problems cluster separately in semantic embedding space. That does not prove LLM problems are worse. It does show they are systematically different. A generator may produce problems from a different region of the problem space than human contest writers, and a judge trained or prompted against human-written problems may not transfer cleanly to judging those model-written problems.

This creates a practical architecture problem. The system business users want is not “a model can generate a few interesting problems after filtering”. They want something closer to:

$$ \text{generate} \rightarrow \text{validate} \rightarrow \text{rank} \rightarrow \text{personalise} \rightarrow \text{adapt after feedback} $$

The paper provides evidence for parts of that chain, not the whole chain. Models can generate candidates. Validity filtering remains essential. Human-aligned judging does not automatically transfer from human-authored problems to model-authored ones. Generator quality and judge quality are not the same capability.

That is a deeply useful negative result. It prevents the obvious but flawed product shortcut: use one strong model to generate the problems and the same model to choose the best ones. Convenient, yes. Reliable, not yet.

What the paper directly shows, and what business should infer

The business relevance is not limited to mathematics. Mathematics is the controlled testbed. The underlying problem is selection under subjective value.

Every AI workflow that recommends what to do next faces a version of interestingness judgement. Which lead deserves attention? Which product experiment is promising? Which research direction is worth exploring? Which training module will keep a learner engaged? Which exception case should an analyst inspect? Correctness matters, but so does the system’s model of human value, curiosity, and opportunity.

The paper directly shows that in contest mathematics:

several LLMs correlate reasonably well with average human interestingness ratings;
most individual models fail to match the diversity of human judgement distributions;
selective model pooling can improve distributional alignment to around a human split-half baseline;
models often conflate interestingness with difficulty more than humans do;
rationale alignment is weaker and uneven across model families;
filtered LLM-generated problems can be interesting to humans, but generation, validation, and selection remain distinct problems.

Cognaptus would infer the following for applied AI systems:

Design decision	Practical implication
Do not use difficulty as the sole proxy for interest	Engagement systems need separate signals for challenge, curiosity, novelty, elegance, and relevance
Calibrate against the target user group	Beginner learners, Olympiad competitors, corporate trainees, and researchers will not share the same taste distribution
Prefer selected diversity over raw model count	Multi-model judging is useful only when the model pool contributes complementary judgement patterns
Separate generator and judge roles	A model that proposes good candidates may not be the best model to select them
Keep validation outside the charm radius	Generated tasks must be checked for correctness, solvability, and suitability before being shown to users
Monitor rationales, not just ratings	A correct recommendation with a mismatched explanation can erode user trust and distort learning behaviour

The uncertainty boundary is equally important. The study uses AMC and IMO-style contest mathematics. The Prolific participants were recruited from people with baseline interest in math. The IMO participants represent a highly specialised population. The generation study is a pilot, and the final evaluated set contains filtered problems rather than raw model output.

So no, this paper does not prove that the same findings transfer neatly to corporate training, scientific discovery, legal research, sales prioritisation, or software architecture. It does, however, expose a pattern those domains should take seriously: judging what is worth attention is not the same as solving what has already been selected.

The real alignment problem is not taste; it is whose taste

The paper ends near a question that deserves more commercial attention than it usually gets: should models align to the variability of human responses, and if so, to which humans?

That is not philosophical garnish. It is product strategy.

An AI tutor aligned to Olympiad contestants may recommend elegant, compact problems that beginners experience as hostile little riddles. A system aligned to average learners may bore advanced students. A research copilot aligned to “popular” interestingness may avoid strange, unfashionable directions that later prove valuable. A corporate learning system optimised for engagement may accidentally underweight productive discomfort.

There is no universal interestingness function waiting politely to be discovered. There are populations, contexts, goals, and expertise levels. The business objective is not to build an AI with “good taste” in the abstract. The objective is to build systems whose selection behaviour is calibrated to the people and outcomes they are meant to serve.

That means collecting preference data from the actual user base. It means testing whether recommendations diversify or narrow attention. It means measuring not only immediate ratings, but downstream effects: persistence, learning gain, discovery quality, user trust, and whether the system keeps proposing the same flavour of problem because one metric smiled at it once.

The future AI assistant will not merely answer questions. It will shape which questions get asked.

That power requires more than benchmark performance. It requires calibrated judgement, population-aware evaluation, and the humility to admit that “interesting” is not a scalar hiding inside the model. It is a relationship between a problem, a person, and a purpose.

LLMs are getting better at solving problems. This paper shows they are only beginning to understand the problem with problems.

Cognaptus: Automate the Present, Incubate the Future.

Shubhra Mishra et al., “A Matter of Interest: Understanding Interestingness Judgments of Math Problems in Humans and Language Models,” arXiv:2511.08548, https://arxiv.org/abs/2511.08548. ↩︎

The evidence starts promising, which is exactly why it becomes interesting#

Interesting does not mean difficult, despite what models keep implying#

The distribution problem is where product teams should pay attention#

Human reasons are more varied than model reasons#

Reasoning tokens are a signal, not a soul#

Generation works only after the adults inspect the output#

What the paper directly shows, and what business should infer#

The real alignment problem is not taste; it is whose taste#