Thinking in Circles: How Self-Questioning LLMs Learn Without Labels

TL;DR for operators

Self-Questioning Language Models, or SQLM, tests a tempting idea: can a language model improve its reasoning ability without being handed a curated training set of questions and answers? The answer in this paper is: partly, in narrow settings, if the training loop is engineered carefully enough.¹

The mechanism is not mystical self-awareness. A model is split into two roles. One role proposes questions from a single topic prompt. The other tries to solve them. Reinforcement learning then updates the system using proxy rewards: majority-vote agreement for arithmetic and algebra, and proposer-generated unit tests for coding. The proposer is rewarded for problems that are not too easy and not too hard; the solver is rewarded for answers that pass the available proxy.

The reported gains are real enough to pay attention to. On Qwen2.5-3B-Instruct, self-play improves three-digit multiplication accuracy from 0.791 to 0.948 and OMEGA linear-equation accuracy from 0.440 to 0.600. On Qwen2.5-Coder-3B-Instruct, Codeforces accuracy rises from 0.320 to 0.391. A format-only reward baseline underperforms, which matters because it suggests the model is learning more than how to wrap an answer in the right tags. Small mercy; sometimes the bar is literally XML-shaped.

For business use, the interesting path is not “models will train themselves, everyone go home.” It is “some domains may need fewer hand-authored practice problems during reasoning post-training.” That could reduce the cost of creating internal training curricula for structured tasks, especially where verification is cheap or proxy scoring is acceptable.

The boundary is equally important. SQLM still needs prompt iteration, parseable output formats, task constraints, and proxy rewards. Majority voting can reinforce shared mistakes. Unit tests can be incomplete or badly generated. The experiments are limited to arithmetic, linear algebra word problems, and constrained list-processing coding tasks on small open models. The paper is promising, but it is not a blank cheque for autonomous model improvement. Blank cheques, as usual, are where engineering teams go to die.

The expensive part is not always the answer

A company fine-tuning a model for reasoning work usually thinks about labels first. Correct answers, grader rubrics, preference comparisons, expert annotations: these are the visible costs. SQLM starts from a different observation. Even if answer labels disappear, the questions themselves still have to come from somewhere.

That is the bottleneck the paper targets. Existing unsupervised reward methods can sometimes train models without ground-truth answers, using confidence, entropy, or majority agreement as a signal. But those methods still assume there is a dataset of prompts or problems to train on. SQLM asks whether the model can generate those prompts too.

The authors frame the input as deliberately minimal: a single high-level topic prompt, such as “generate a three-digit arithmetic problem” or “generate algebra word problems involving linear equations.” From there, the model enters a loop. It asks itself questions, answers them, receives a reward, and updates.

The phrase “self-questioning” sounds like therapy for GPUs. Mechanically, it is more prosaic and more useful: it is online synthetic data generation coupled to reinforcement learning. The proposer creates the curriculum. The solver struggles through it. The reward function decides whether the struggle was useful.

That mechanism is the story. Without it, the paper collapses into a familiar headline: “LLMs improve without labels.” With it, the result becomes more interesting: the paper is trying to replace curated training data with an adaptive question-generation game.

SQLM turns post-training into a two-role game

SQLM uses asymmetric self-play. In classic asymmetric self-play, one agent proposes a task and another tries to solve it. The proposer should not generate impossible tasks, because those teach nothing. It should not generate trivial tasks either, because those also teach nothing. The useful region is the awkward middle: hard enough to expose weakness, easy enough that learning has a foothold.

SQLM imports that idea into language-model post-training.

The loop has two roles:

Role	What it does	What it is rewarded for	Operational interpretation
Proposer	Generates a problem from a topic prompt	Producing problems that are neither unanimously solved nor completely unproductive	Curriculum generator
Solver	Attempts the generated problem	Matching the majority answer, or passing generated unit tests	Trainable problem-solver
Reward function	Converts self-play outcomes into RL feedback	Domain-dependent proxy correctness	Cheap grader, not an oracle

In the experiments, the proposer and solver are language-model policies, and the paper notes that the roles are initialised from the same pretrained model; in the experiments, they also share weights. This matters because it prevents a lazy interpretation where a superior teacher model is simply spoon-feeding a weaker student. SQLM is closer to a single model being trained through role separation.

The reward design is the hinge.

For arithmetic and algebra, the generator-verifier gap is small. If the model can verify the answer to a three-digit multiplication problem, it is already doing much of the same work required to solve it. So the authors do not ask the proposer to generate a trusted solution. Instead, they sample multiple solver outputs and use majority agreement as a proxy for correctness. Solver outputs that match the majority answer receive reward.

The proposer is then judged by the distribution of solver answers. If every sampled answer agrees, the problem is too easy. If the answers scatter uselessly, the problem is too hard. The intended sweet spot is disagreement with some structure: the solver is not fully confident, but the task is not pure noise.

For coding, the generator-verifier gap is larger. Writing correct code is hard; checking code against tests is cheaper. So the proposer generates both a programming problem and five unit tests. The solver receives a reward equal to the fraction of unit tests passed. The proposer receives reward when the solver passes some, but not all, tests.

That split is the paper’s most practical design choice. SQLM is not pretending that one reward trick fits every domain. It is saying: use internal agreement when verification is as hard as generation, and use executable checks when verification is cheaper. For operators, that distinction is the difference between a research toy and a possible workflow pattern.

The main evidence shows gains, but the setting is deliberately narrow

The paper evaluates SQLM on three settings:

Procedurally generated three-digit multiplication, with 4,096 test problems.
Linear-equation word problems from the OMEGA benchmark, with a 100-question test set.
A 123-example Codeforces subset, using the Eurus-2 examples, with Qwen2.5-Coder-3B-Instruct.

The main result table is doing the core evidentiary work.

Model / training setup	Multiplication	Linear equations	Codeforces
Qwen2.5-(Coder)-3B-Instruct	0.791	0.440	0.320
+ self-play	0.948 ± 0.009	0.600 ± 0.010	0.391 ± 0.019
+ self-play, format reward	0.826 ± 0.079	0.553 ± 0.015	N/A

The multiplication improvement is large in absolute terms: +0.157. Algebra improves by +0.160. Coding improves by +0.071. The paper describes these as roughly 14%, 16%, and 7% gains, respectively, without curated training data.

The format-reward baseline is important because it controls for a very boring failure mode. In arithmetic and algebra, the model must put the final answer in a specific format. A model could appear to improve simply by learning the wrapper. The format baseline rewards only correct formatting, not answer correctness. It reaches 0.826 on multiplication and 0.553 on linear equations, below the full self-play method. That does not prove deep mathematical enlightenment has occurred. It does show that the main gains are not merely the model learning to dress its answer properly for the evaluator, which is a useful anti-nonsense check.

The coding result is more modest and should be read differently. The Codeforces task is constrained: the proposer is prompted to generate LeetCode-easy-style problems taking a list of integers and outputting either an integer or another list. That constraint makes the loop parseable and testable. It also means the result is not evidence that SQLM can suddenly generate a rich software-engineering curriculum from thin air. It is evidence that, inside a narrow input-output regime, generated unit tests can provide enough signal to improve a small coder model on a related benchmark.

The appendix adds a small generality check for coding. Llama-3.2-3B-Instruct improves from 0.211 to 0.243, and Llama-3.1-8B-Instruct improves from 0.231 to 0.382. This supports the idea that the method is not exclusive to Qwen, but it is still an exploratory extension, not a broad scaling law. Three models on one coding subset do not make a doctrine.

The proposer is not just making data; it is shaping difficulty

The qualitative examples are easy to dismiss because they look anecdotal. They are not the main evidence. Their likely role is to show how the proposer’s curriculum changes during training.

In arithmetic, the proposer starts with a simple expression:

563 + 247 − 189

By step 10, it generates:

673 − 145 + 98 × 2 ÷ 7

By step 20, it generates:

384 ÷ (52 × 2) + 73 − 111

The paper’s point is not that these are magnificent problems. They are arithmetic snacks. The point is that the generated tasks become more structurally involved: more operations, more ordering constraints, more chances for the solver to stumble.

The coding appendix shows a similar pattern. The proposer begins with “square each element of a list,” moves to “sum all even numbers,” and later generates “find the length of the longest contiguous subarray with all unique elements.” That is a meaningful rise in algorithmic structure under the paper’s constrained list-processing format.

This is where the “self-questioning” metaphor becomes useful. A static synthetic dataset is like printing 6,400 worksheets and hoping they cover the right gradient of difficulty. SQLM instead asks the model to keep adjusting the worksheet as the student changes. Since the student is also the teacher, this is educationally questionable but computationally efficient. A small tragedy of pedagogy; a useful trick for post-training.

The update-frequency test is a sensitivity check, not a second thesis

The proposer update frequency experiment asks how often the proposer should be updated relative to the solver. This is not the main result. It is a sensitivity test for the self-play loop.

Proposer update frequency	Multiplication	Linear equations	Codeforces
Base model	0.791	0.440	0.320
Every 1 step	0.937 ± 0.019	0.556 ± 0.051	0.375 ± 0.050
Every 5 steps	0.948 ± 0.009	0.600 ± 0.010	0.391 ± 0.019
Every 10 steps	0.951 ± 0.012	0.546 ± 0.005	0.324 ± 0.014
Never	0.934 ± 0.025	0.563 ± 0.023	0.343 ± 0.022

The cleanest interpretation: updating the proposer matters, but more frequent updating is not automatically better. Every five steps gives the best algebra and coding result, and a near-best multiplication result. Every ten steps gives the best multiplication number, 0.951, but weaker algebra and coding. Never updating the proposer still improves over the base model, which suggests that simply using model-generated problems plus RL can help. But the adaptive proposer gives a stronger and more stable pattern across tasks.

For implementation teams, this is the first hint of operational fragility. SQLM is not a one-line recipe. The curriculum generator is itself a moving component. Move it too slowly, and it may not create useful pressure. Move it too quickly, and the solver may chase a shifting distribution before learning much from the last one. Congratulations, we have reinvented lesson planning, but with stochastic policies.

Online generation beats “please be diverse” prompting

The diversity experiment is one of the paper’s more business-relevant details because it targets a common enterprise fantasy: ask the model to generate a large synthetic dataset upfront, add the phrase “make it diverse,” and call the procurement team.

The authors compare the main online proposer setup against pre-generating all problems at once. They generate 16 problems per inference call, repeat this 400 times, and create a dataset of 6,400 questions. The prompt explicitly asks for a wide range of difficulty. Despite that, the pre-generated dataset shows reduced diversity and impairs learning on the arithmetic task. A PCA analysis of generated questions supports the same direction: online proposer updates produce a broader distribution than pre-generation.

This is not proof that every offline synthetic-data strategy is inferior. It is evidence for a narrower claim: in this setup, adaptive generation produces a better learning distribution than asking the model to operationalise abstract instructions like “diverse” and “difficult” in one batch.

That is a useful lesson. Models are often better at responding to local feedback than obeying global dataset-design ideals. “Generate diverse examples” is a vibe. “Generate examples that land in the solver’s current zone of partial failure” is a control signal.

What each experiment is really doing

The paper includes several results, but they should not all be read with the same weight.

Evidence item	Likely purpose	What it supports	What it does not prove
Main Qwen/Qwen-Coder result table	Main evidence	SQLM improves small open models across three constrained reasoning settings	General autonomous learning across open-ended domains
Format reward baseline	Ablation/control	Gains are not only from learning answer formatting	Majority voting always tracks truth
Proposer update frequency	Sensitivity test	Curriculum update cadence affects performance and stability	A universal optimal frequency
Online vs pre-generated questions	Robustness / mechanism probe	Adaptive question generation helps preserve useful diversity	Offline synthetic data is always bad
Llama coding appendix	Exploratory extension	The coding loop can help non-Qwen models	Broad model-family generality
Prompt appendix	Implementation detail	The method still needs carefully constrained prompts and formats	Prompt-free self-training

This separation matters because the paper is easy to overread. The strongest claim is not “LLMs can self-improve indefinitely.” The strongest claim is: under constrained tasks, with carefully designed proxy rewards, a model can generate a useful online curriculum and improve without curated training examples.

That is already interesting. It does not need to be inflated. Inflated claims have a habit of bursting, usually in production.

The business value is cheaper curriculum generation, not magical autonomy

What the paper directly shows is limited but useful. Given a task prompt and a domain-specific reward proxy, SQLM can produce measurable improvements without a curated dataset of training questions and labels. It works across arithmetic, algebra, and constrained coding. It benefits from adaptive proposer updates. It appears to preserve more useful diversity than pre-generating a synthetic dataset in one batch.

What Cognaptus infers for business use is a workflow pattern:

Define a narrow skill domain.
Constrain the problem format.
Let a proposer generate practice tasks.
Use a cheap verifier or proxy reward.
Keep an external holdout set for evaluation.
Audit generated tasks for safety, relevance, and representativeness.

The ROI pathway is not about eliminating humans. It is about moving human effort upstream. Instead of writing thousands of questions and answers, experts define the domain, specify constraints, provide validation rules where possible, and review the generated curriculum distribution.

This could matter in domains such as internal coding assistants, analytics query generation, financial calculation workflows, compliance checklist reasoning, or customer-support troubleshooting flows. The common requirement is structure. SQLM is more plausible where tasks can be formatted, sampled, and checked. It is much less plausible where correctness is interpretive, context-heavy, legally sensitive, or dependent on changing external facts.

A practical enterprise version would not rely on majority vote alone. It would combine SQLM-style generation with external validators: symbolic calculators, unit tests, database checks, policy engines, retrieval-grounded answer checks, and human-reviewed audit sets. The self-play loop can create pressure; it should not be allowed to grade its own ethics homework.

The misconception: self-questioning is not self-grounding

The obvious misconception is that SQLM makes the model truth-grounded by introspection. It does not.

Majority voting measures agreement among sampled model outputs. Agreement can correlate with correctness when the model is reasonably calibrated and the task distribution behaves. It can also reinforce systematic errors. If four sampled answers confidently converge on the same wrong value, SQLM has no built-in external truth mechanism to object. The paper states this limitation plainly: unsupervised approaches are constrained by the lack of ground-truth rewards or perfect verifiers.

Unit tests are stronger than majority voting but still imperfect. Generated tests can be shallow, incomplete, or biased toward the proposer’s own interpretation of the task. A solver can pass weak tests and still fail real cases. Anyone who has watched production code pass a cheerful little unit suite before detonating in deployment will recognise the genre.

The paper also does not eliminate hand-engineering. The authors report that prompt iteration was needed to constrain generation and enforce expected formatting. Coding was especially tricky because the proposer had to emit unit tests in a parseable format. The prompt appendix makes this visible: the coding prompt is not a casual “make me some programming problems.” It specifies input-output structure, requires five test cases, and gives an exact formatting pattern.

So the better mental model is not “the model teaches itself.” It is “the model can help generate its own practice distribution when engineers provide a constrained game and a reward proxy.” Less romantic, more deployable.

Where this applies, and where it does not

SQLM is most relevant when four conditions hold.

First, the target skill must be narrow enough that generated tasks remain on-distribution. Three-digit multiplication and list-processing programming problems qualify. Open-ended legal reasoning does not, unless heavily constrained.

Second, there must be a usable reward proxy. Majority vote is a weak proxy, acceptable mainly when the cost of error during training is manageable and external evaluation remains separate. Unit tests are better, but only in domains where tests are meaningful and cheap.

Third, the organisation must maintain an independent evaluation set. SQLM’s training loop can generate its own tasks, but business users still need external measurement. Otherwise the model can optimise the classroom it invented for itself. That is not learning; that is academic fraud with tensors.

Fourth, generated questions must be filtered. The paper explicitly notes that there is currently no safeguard ensuring that model-generated questions are reasonable, safe, relevant, or interesting. In enterprise settings, “interesting” is optional. Safe and relevant are not.

The immediate opportunity is therefore semi-supervised, not fully autonomous. Use self-questioning to expand practice coverage, surface weaknesses, and reduce the amount of manually authored curriculum. Keep validators and humans in the loop where correctness matters. Especially where money, compliance, safety, or customer trust are involved, which is inconveniently most places businesses care about.

The real contribution is a training pattern

The contribution of SQLM is not one benchmark table. It is a training pattern that connects three ideas: online synthetic data, asymmetric self-play, and unsupervised or cheap verification.

That pattern is valuable because many organisations have expertise but not datasets. They know the kind of reasoning they want from a model. They can describe the task. They may even have validators for pieces of it. But they do not have a large labelled corpus of representative questions and answers sitting politely in a folder named training_data_final_v7_really_final.

SQLM suggests a way to start from less. Not nothing, exactly. The “single prompt” still carries assumptions, constraints, and formatting decisions. The reward function still encodes a view of correctness. The benchmarks still need independent evaluation. But the amount of curated training content can shrink.

That is the commercial lesson: the future of post-training may depend less on collecting ever-larger piles of static examples and more on designing adaptive loops that manufacture useful failure cases. The model does not need to become wise. It needs to be made uncomfortable in the right way, repeatedly, with enough measurement to know whether the discomfort helped.

Conclusion: self-play is a curriculum engine, not a truth machine

SQLM is a clean demonstration of a narrow but important idea: a language model can generate its own practice problems and use proxy rewards to improve on downstream reasoning benchmarks without curated training data. The mechanism is credible because it does not depend on vague self-reflection. It depends on role separation, adaptive difficulty, majority agreement where verification is hard, and unit tests where verification is cheaper.

The paper’s results justify attention, not triumphalism. Multiplication, linear equations, and constrained coding are good proving grounds for a mechanism. They are not the world. The method still needs prompt engineering, output constraints, reward design, evaluation sets, and safeguards against systematic self-reinforcement.

For operators, SQLM should be read as a curriculum-generation architecture. It can reduce the cost of producing reasoning practice. It can help explore task distributions. It may make post-training less dependent on hand-curated examples. But it does not remove the need for external grounding.

The circle is useful. Just do not confuse it with a compass.

Cognaptus: Automate the Present, Incubate the Future.

Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak, “Self-Questioning Language Models,” arXiv:2508.03682, 2025, https://arxiv.org/abs/2508.03682. ↩︎

TL;DR for operators#

The expensive part is not always the answer#

SQLM turns post-training into a two-role game#

The main evidence shows gains, but the setting is deliberately narrow#

The proposer is not just making data; it is shaping difficulty#

The update-frequency test is a sensitivity check, not a second thesis#

Online generation beats “please be diverse” prompting#

What each experiment is really doing#

The business value is cheaper curriculum generation, not magical autonomy#

The misconception: self-questioning is not self-grounding#

Where this applies, and where it does not#

The real contribution is a training pattern#

Conclusion: self-play is a curriculum engine, not a truth machine#