Feedback is addictive.
Give an AI agent a tool, an API, a database, a browser, a simulator, or a workflow environment, and the temptation is obvious: let it keep poking the world until something works. It tries. It observes. It corrects. It tries again. Compared with a model sitting alone in a prompt box, imagining every possible transition in its head, this looks much healthier. Less hallucinated planning, more contact with reality. Very grown-up.
Except there is a catch. An agent that can always ask the environment what happened may never learn to predict what will happen. It may become not a planner, but a professional button-masher with better vocabulary.
That is the useful tension behind Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction, which introduces WMAct: world-model internalization through efficient interaction and active reasoning.1 The paper is not merely saying that multi-turn interaction helps LLM agents. That part is now almost boring. The sharper claim is that interaction only becomes valuable when it is converted into internal structure. The model must not just use feedback to finish the current task. It must compress feedback into a reusable world model, so that later it can act with fewer external checks.
That distinction matters for enterprise AI. A customer-service agent that needs ten tool calls to solve every ticket is not the same product as one that learns the operational shape of the process. A logistics agent that repeatedly queries inventory, routes, and exceptions is not the same as one that forms a reliable internal model of constraints. A workflow agent that “recovers” from errors by repeatedly asking the system what broke is still expensive, slow, and occasionally charming in the way a leaking roof is charming.
WMAct is interesting because it treats interaction as a training medium, not a permanent crutch.
The real problem is not lack of interaction; it is bad interaction
The easy reading of the paper is: “LLM agents should learn by doing.” True, but too soft. The paper’s actual mechanism is more disciplined.
The authors start from a familiar failure mode. In monolithic reasoning, the model receives a task and must generate the entire plan before seeing any environmental feedback. This resembles the classic long chain-of-thought setup: reason first, answer later. For math and coding, this often works because the problem can be internally manipulated. For interactive environments, it becomes fragile. The model must simulate state transitions, remember constraints, anticipate consequences, and avoid compounding its own wrong assumptions.
In the paper’s framing, monolithic reasoning creates a heavy cognitive burden. The agent is forced to imagine the world instead of checking it. Once the internal simulation goes wrong, later reasoning may confidently build on the wrong state. This is not “deep reasoning.” It is a very elaborate wrong turn.
Multi-turn interaction lowers that burden. The agent can act, observe the result, and update its reasoning. In the paper’s environments—Maze, Sokoban, and Taxi—the agent sees text-rendered grid states and outputs actions. The environment validates moves and returns feedback. This gives the model a grounded loop: think, act, observe, revise.
But interaction introduces two new problems.
First, the agent may brute-force. If success is rewarded but inefficiency is not punished, the model can enumerate actions, stumble around, and still receive credit when it eventually succeeds. That produces trajectories full of redundant actions. The agent completes the task, but the knowledge acquired from the process is noisy and low-quality.
Second, the agent may become feedback-dependent. If the environment is always available, the model can rely on immediate observations instead of learning the underlying dynamics. It becomes reactive. It uses feedback for local correction, not for internalization.
So the paper’s core question is not “Should agents interact?” The answer is already yes. The real question is: how do we make interaction teach the model something portable?
WMAct works by squeezing useful experience out of action
WMAct has two main mechanisms, and they are best understood as two pressures applied to the training loop.
The first pressure rewards effective action. The second pressure gradually reduces dependence on interaction.
Together they create the training curriculum: first let the agent touch the world; then make each touch count; then make the agent need fewer touches.
Reward rescaling makes pointless action expensive
The reward-rescaling mechanism is simple enough to look almost suspicious. WMAct scales the outcome reward by the fraction of actions that actually change the environment state:
Here, $N$ is the total number of actions in an episode, and $N_{eff}$ is the number of effective actions—actions that lead to a different state. If the model solves the task but wastes many moves, its reward is discounted. If it solves the task with purposeful moves, the reward remains high.
This matters because sparse outcome rewards can accidentally bless stupid paths. In an interactive environment, “success” alone does not distinguish between a clean plan and a frantic search that happened to terminate correctly. WMAct’s reward rescaling tells the model: finishing matters, but so does not flailing.
For business readers, this is the first practical lesson. Many deployed agents are evaluated by task completion alone: did the ticket close, did the report generate, did the invoice match, did the customer receive an answer? That metric misses operational waste. An agent can complete a task while burning API calls, database queries, latency, compliance exposure, and human patience. A mature agent metric needs something closer to effective-action density: how many steps actually moved the process forward?
That does not mean every enterprise system should literally use the paper’s state-change ratio. A CRM workflow, fraud review, or procurement pipeline has richer states than a grid world. But the principle transfers cleanly: reward progress, not activity. Otherwise, congratulations, you have automated confusion.
Frequency annealing prevents feedback addiction
The second mechanism is interaction frequency annealing. WMAct periodically adjusts the maximum number of allowed interaction turns according to recent behavior:
Here, $\bar{L}$ is the average number of interaction turns over recent episodes, and $L’_{max}$ is the maximum number of turns observed. The new limit is set between the average and the observed maximum.
The idea is not to starve the agent from the beginning. Early in training, the agent needs interaction to explore and collect evidence about state transitions. Later, the allowed interaction budget tightens. This pressure forces the model to solve tasks with less external feedback. If it has learned only to react, performance should collapse. If it has internalized environmental dynamics, it should retain competence even with fewer turns.
This is the part that makes the paper more than another “agents need tools” story. WMAct does not worship interaction. It uses interaction and then slowly takes it away.
There is a useful business analogy here. Training a new employee by shadowing every action is reasonable. Allowing that employee to ask a supervisor every thirty seconds forever is not. The point of supervision is internalization. The same logic applies to agents. Tool use is valuable when it teaches the agent how the process behaves. Tool use becomes a tax when the agent never gets beyond checking everything.
The mechanism chain is the article, not a decoration
The paper’s argument can be compressed into a four-step causal chain:
| Problem in agent training | WMAct pressure | What the model is pushed to learn | Business analogue |
|---|---|---|---|
| Monolithic reasoning forces the model to simulate too much internally | Multi-turn interaction | Use environmental feedback to correct state assumptions | Let agents operate inside real workflow sandboxes, not just prompt-only evaluations |
| Sparse success rewards tolerate brute-force exploration | Reward rescaling by effective actions | Prefer concise, state-changing action sequences | Measure useful progress per tool call, not just final completion |
| Unlimited feedback encourages reactive policies | Interaction frequency annealing | Compress feedback into internal planning ability | Reduce supervision and tool budgets over training/evaluation |
| Weak model priors cannot extract general rules from experience | Dependence on existing reasoning behaviors | Reflection, self-correction, and foresight are prerequisites for internalization | Not every base model becomes an agent just because it gets an API key |
The last row is easy to miss, but it is important. The paper’s ablation on model prior shows that Qwen2.5-7B-Instruct did not exhibit the same trend of multi-turn interaction driving improved single-turn performance. The authors interpret this as evidence that advanced reasoning patterns—reflection, self-correction, strategic foresight—are needed for the model to treat feedback as evidence rather than isolated events.
That should cool down one common enterprise fantasy: “We can take a cheap model, attach tools, add a workflow loop, and get an agent.” Sometimes, yes. Often, no. Interaction supplies evidence. It does not automatically supply the cognitive machinery needed to compress that evidence into a usable world model.
Tiny but important detail. Reality continues to be annoyingly non-magical.
The main results show internalization, not just task completion
The paper evaluates WMAct on three text-rendered grid-world environments:
- Maze, where the agent must navigate through structured paths;
- Sokoban, where the agent must push boxes onto goals while avoiding irreversible deadlocks;
- Taxi, where the agent must follow sequential pickup and dropoff rules.
These are controlled environments, not enterprise systems. But they are not random toy tasks either. Each stresses a different part of sequential reasoning: pathfinding, irreversible planning, and ordered sub-goal execution.
The key evaluation choice is that the reported comparisons are conducted in a single-turn manner for fairness. This matters. If WMAct were evaluated only with many interactive turns, better performance might simply mean better use of feedback during inference. But the paper’s stronger claim is that interaction during training improves the model’s ability to solve tasks even when later evaluated with only one turn. That is the internalization thesis.
The main task results are substantial:
| Method | Sokoban Standard | Sokoban Hard-1 | Sokoban Hard-2 | Maze Standard | Maze Hard | Taxi Standard |
|---|---|---|---|---|---|---|
| Qwen3-8B-Own | 3.29 | 0.84 | 1.39 | 1.95 | 0.20 | 5.60 |
| PPO-EntirePlan | 49.12 | 2.34 | 0.35 | 75.04 | 26.51 | 38.92 |
| PPO-Interactive | 64.21 | 41.26 | 46.83 | 83.74 | 36.52 | 39.16 |
| WMAct | 78.57 | 52.68 | 49.90 | 88.14 | 50.59 | 62.16 |
The obvious headline is that WMAct beats PPO-EntirePlan and PPO-Interactive across these task settings. But the more interesting interpretation is where the gaps appear.
On standard Sokoban, PPO-EntirePlan reaches 49.12 while WMAct reaches 78.57. On Sokoban Hard-1, PPO-EntirePlan collapses to 2.34, while WMAct reaches 52.68. On Sokoban Hard-2, PPO-EntirePlan is nearly dead at 0.35, while WMAct reaches 49.90.
That is not merely a better score. It suggests that monolithic planning can learn something that works in the training-shaped setting, but fails when the environment grows more complex. Sokoban is especially revealing because actions can be irreversible. Push a box into the wrong position and the puzzle may be doomed. This punishes greedy local moves and shallow state tracking. WMAct’s advantage there supports the paper’s claim that interaction can help the model learn more robust long-horizon dynamics.
Maze tells a related but milder story. PPO-EntirePlan is already strong on the standard Maze task at 75.04, but WMAct still improves to 88.14. On the harder Maze variant, the gap widens: 26.51 for PPO-EntirePlan versus 50.59 for WMAct. Again, the interpretation is not “interaction is nice.” It is that training through disciplined interaction appears to transfer better when planning length or structural complexity increases.
Taxi is interesting for another reason. PPO-Interactive barely improves over PPO-EntirePlan on Taxi: 39.16 versus 38.92. WMAct reaches 62.16. This suggests that interaction alone is not sufficient. The structure of interaction matters. Without reward rescaling and frequency annealing, the model may interact but fail to internalize.
The training curves are the central evidence for the paper’s thesis
The task tables show performance. The training curves explain why the paper’s claim is stronger.
The authors compare PPO-EntirePlan, WMAct’s single-turn performance, and WMAct’s multi-turn performance during training. The key observation is that WMAct’s single-turn performance gradually approaches its own multi-turn performance across the environments. In Taxi, the authors report that both single-turn and multi-turn WMAct break through the PPO-EntirePlan performance ceiling.
This is the closest thing to the paper’s “smoking gun.” If multi-turn training only taught the model to rely on feedback, then single-turn evaluation should remain weak. Instead, the model becomes better at solving without the same amount of interaction. The environmental feedback appears to have been compressed into internal planning ability.
That is why “thinking by doing” is not just a slogan. The model first uses action to reduce uncertainty. Then the training pressure forces it to act as if it has learned the pattern behind that uncertainty.
For enterprise AI, this is the difference between two product architectures:
| Architecture | What improves | Hidden cost |
|---|---|---|
| Always-interactive agent | More opportunities to recover from errors | Higher latency, higher tool cost, stronger dependence on environment feedback |
| Internalized world-model agent | Better planning before acting, fewer corrective loops | Requires training/evaluation that rewards useful compression, not just completion |
| Scripted workflow agent | Predictable behavior in known cases | Brittle outside predefined paths |
| WMAct-style training philosophy | Flexible behavior shaped by incentives and interaction budgets | Needs controlled environments and carefully designed progress metrics |
A company deploying agents should care less about whether an agent can eventually finish a process and more about whether it needs fewer interventions over time. This is where the paper’s mechanism becomes operationally meaningful. The question becomes: after exposure to workflow feedback, can the agent perform with fewer tool calls, fewer retries, fewer escalations, and fewer state-inspection queries?
If not, the agent has learned how to consume feedback. It has not learned the process.
The ablations separate useful interaction from noisy activity
The ablation study is not a side dish. It is where the paper defends the claim that the two WMAct mechanisms are doing real work.
On Sokoban, the authors compare PPO-EntirePlan, PPO-Interactive, PPO-Interactive plus reward rescaling, and the full WMAct setup with frequency annealing:
| Method | Standard | Hard-1 | Hard-2 | Likely purpose of test |
|---|---|---|---|---|
| PPO-EntirePlan | 49.12 | 2.34 | 0.35 | Baseline for monolithic reasoning |
| PPO-Interactive | 64.21 | 46.83 | 41.26 | Main comparison showing interaction helps |
| + reward rescaling | 73.68 | 50.78 | 48.05 | Ablation showing effective-action reward adds value |
| + frequency annealing | 78.57 | 52.68 | 49.90 | Ablation showing reduced interaction dependence adds further value |
The first jump—from PPO-EntirePlan to PPO-Interactive—says that interaction itself helps, especially in harder Sokoban settings. The second jump—from PPO-Interactive to reward rescaling—says that purposeful interaction helps more. The final jump—from reward rescaling to frequency annealing—says that forcing reduced dependence improves the final policy further.
This is an important correction to a lazy misconception: more interaction is not automatically better. Interaction can become noise, dependency, or brute force. WMAct improves because it changes the incentive structure around interaction.
The paper also compares frequency annealing with a fixed step penalty. A step penalty of -0.1 reaches 72.43 on Sokoban Standard, 49.32 on Hard-1, and 45.46 on Hard-2. Frequency annealing with $\tau = 100$ reaches 78.57, 52.68, and 49.90. The authors argue that a fixed step penalty can push the agent toward myopic efficiency, while annealing allows broader exploration early and compression later.
That distinction is directly relevant to business automation. Penalizing every action too early can produce agents that are cheap but brittle. Letting agents explore forever produces agents that are robust but expensive and dependent. The practical question is not whether to minimize steps. The practical question is when to minimize steps.
Early training: tolerate exploration. Later training: demand compression. Production: measure whether compression survived contact with the workflow.
Simple. Not easy. Annoying how often those travel together.
General benchmark gains are promising, but should not be oversold
The paper also evaluates a WMAct-Sokoban model on broader benchmarks, comparing it with Qwen3-8B-Own. The reported gains are broad but modest-to-moderate:
| Benchmark | Qwen3-8B-Own | WMAct-Sokoban | Gain |
|---|---|---|---|
| AIME24 | 85.10 | 86.56 | +1.46 |
| AIME25 | 77.92 | 79.48 | +1.56 |
| BeyondAIME | 52.88 | 55.14 | +2.26 |
| HMMT25 | 63.44 | 68.49 | +5.05 |
| GPQA-Diamond | 59.91 | 62.15 | +2.24 |
| LiveCodeBench v5 | 65.32 | 67.14 | +1.82 |
| LiveBench | 67.93 | 69.60 | +1.67 |
| MMLU-Pro | 72.35 | 73.14 | +0.79 |
This is not evidence that Sokoban magically teaches all reasoning. Please, no “box-pushing is the new AGI curriculum” LinkedIn post. Humanity has suffered enough.
The more grounded interpretation is that disciplined interaction training may strengthen reusable planning behaviors: state tracking, foresight, self-correction, and structured decomposition. Sokoban is a good stressor because it involves irreversible consequences and deadlock avoidance. The appendix extends this comparison across WMAct-Maze, WMAct-Taxi, and WMAct-Sokoban. All improve over the base model, but Sokoban leads across most reported metrics, especially HMMT25 and GPQA-Diamond. Taxi slightly leads on LiveCodeBench v5, which the authors plausibly connect to sequential sub-goal planning.
The likely purpose of these tests is exploratory extension, not the main proof. The main proof lives in the controlled agent environments and ablations. The general benchmarks suggest transfer, but they do not establish that WMAct will generalize to arbitrary enterprise tasks.
Still, the direction is interesting. If a training environment pressures the model to manage state, avoid irreversible mistakes, and plan beyond the next move, some of that behavior may appear outside the original environment. That is exactly the kind of transfer agent builders should want—but also exactly the kind they should verify before buying the champagne.
The appendix matters because it tells us what the result is not
The paper’s appendix clarifies several implementation and evaluation details that matter for interpretation.
The environments are converted into standardized text-based ASCII maps. The action spaces are textual and validated. The initial prompt includes a system prompt, environment description, and action prompt; later turns contain feedback and the action prompt. The authors report that the structural prompt components are held identical and not domain-engineered across the three environments.
This helps the paper’s case that the observed behaviors are learned from environment constraints rather than hand-scripted per task. It does not mean the setup is prompt-free or environment-free. It means the prompt structure is standardized, while the environment supplies the dynamics.
Training uses Qwen-8B-Own and Qwen2.5-Instruct instances. The paper samples 256 trajectories from 32 prompts, with each prompt replicated 8 times. It uses a vLLM decoding backend with temperature 1.0, top-p 1.0, top-k 0, maximum generation length of 16K tokens, and up to 12K tokens per turn. Responses contain <think> and <action> tags; actions are extracted and executed sequentially. The RL setup uses on-policy PPO over entire trajectories, with GAE configured using $\gamma = 1$ and $\lambda = 1$, and with KL penalty and entropy regularization disabled.
These are not minor footnotes. They define the conditions under which WMAct works. The result is not “any LLM plus any environment plus any RL loop equals internalized world model.” The result is closer to this:
| Evidence component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Main Maze/Sokoban/Taxi results | Main evidence | WMAct outperforms monolithic and basic interactive PPO in controlled text-grid tasks | General enterprise reliability |
| Single-turn vs multi-turn curves | Main evidence for internalization | Interaction-trained knowledge can become usable without interaction | Full causal explanation of internal representations |
| Reward rescaling ablation | Ablation | Penalizing ineffective actions improves performance | Best possible reward design |
| Frequency annealing and $\tau$ tests | Ablation / sensitivity test | Gradual interaction reduction improves results and needs scheduling balance | Universal annealing schedule |
| Qwen2.5-Instruct dynamics | Model-prior boundary test | Weaker reasoning priors may fail to internalize feedback | Exact model-size threshold |
| General benchmark results | Exploratory extension | Some reasoning gains transfer beyond training environments | Broad deployment transfer |
| Visual case studies | Qualitative interpretation | WMAct traces show recovery, foresight, and decomposition patterns | Statistical proof on their own |
This is how the paper should be read. The strongest claim is not that WMAct solves agentic AI. The strongest claim is that disciplined interaction can become internalized planning, and the paper provides controlled evidence for that claim.
Business implication: agent ROI depends on interaction discipline
For Cognaptus readers, the business lesson is not “add more tools.” That is the cheapest possible interpretation, and therefore naturally the one many teams will choose first.
The better lesson is that agent systems need an interaction discipline layer. This layer should ask four questions.
First, what counts as a meaningful state change? In the paper, an effective action changes the environment state. In enterprise systems, meaningful progress might be a validated data retrieval, a resolved mismatch, a successful policy check, a reduced uncertainty state, or a completed sub-task. If this is not defined, the agent may optimize for visible activity.
Second, how much exploration is allowed during learning? Agents need enough freedom to discover workflow dynamics. Over-constraining them early can produce brittle scripts dressed up as intelligence. But exploration should not remain unlimited.
Third, when is interaction reduced? WMAct’s annealing mechanism is a training-loop version of “we are taking off the training wheels now.” Enterprise agent evaluation can borrow this idea: after sandbox exposure, reduce tool budgets, context hints, retries, or human assistance and test whether performance survives.
Fourth, do improvements persist under harder variants? The paper’s Hard settings are not decorative. They test whether learned behavior generalizes when map size, object coordination, or structural complexity increases. Enterprise analogues include larger customer accounts, messier invoice histories, more exception types, incomplete records, multilingual documents, or cross-system dependencies.
The practical framework is straightforward:
| Deployment question | WMAct-inspired metric | Why it matters |
|---|---|---|
| Is the agent completing tasks or just thrashing until success? | Effective progress per action/tool call | Separates useful action from operational noise |
| Is the agent learning the workflow or relying on live feedback? | Performance after reduced interaction budget | Tests internalization |
| Does the agent generalize beyond easy cases? | Performance on harder workflow variants | Detects brittle overfitting |
| Does the base model have enough reasoning prior? | Improvement from interaction in single-turn evaluation | Avoids wasting tooling on models that cannot compress experience |
| Is the reward system shaping the right behavior? | Completion + efficiency + error recovery + compliance-safe state changes | Prevents “success” from hiding unsafe or expensive paths |
This is not marketing copy. It is a warning. Enterprise agents will not become reliable merely because we connect them to more systems. More tools can create better grounding, but they can also create more ways to be expensively confused. WMAct’s contribution is to show that feedback must be shaped into competence, not merely consumed as runtime assistance.
The boundaries are real, and useful
The paper’s boundaries are not embarrassing. They are what make the result interpretable.
The environments are controlled grid worlds, rendered as text. They have clear state transitions, validated action spaces, and objectively measurable success. Enterprise environments are noisier. APIs fail. Data schemas drift. Users contradict themselves. Permissions block actions. Compliance constraints turn “try again” into “please stop before legal notices us.”
The base model setup also matters. The main experiments use Qwen3-8B-Own, which has supervised fine-tuning on proprietary general and reasoning datasets. The weaker Qwen2.5-7B-Instruct case suggests that interaction alone may not produce internalization unless the model already has enough reasoning behavior to interpret feedback productively.
The general benchmark improvements are promising but should be treated as transfer evidence, not a universal law. Gains on HMMT25, GPQA-Diamond, LiveBench, and related benchmarks suggest broader reasoning benefits, but they do not prove that WMAct-trained agents will handle enterprise workflows out of the box.
There is also a measurement boundary. “Effective action” is easy to define in a grid world: did the state change? In business systems, state changes can be harmful, reversible, irrelevant, or administratively noisy. A bad email sent to a client is definitely a state change. Very effective, in the worst possible sense. So enterprise reward design must include usefulness, safety, reversibility, and policy compliance—not just movement.
That is the correct level of caution. Not “this may not work in the real world” as a lazy disclaimer. Rather: the mechanism is portable, but the measurement layer must be rebuilt for each operational domain.
The larger shift: from scripted cognition to shaped behavior
WMAct sits inside a broader shift in agent design. Earlier approaches often tried to improve agents by prescribing cognitive structure: perceive, plan, predict state changes, reflect, then act. Those scaffolds can help, especially when models are weak or environments are sensitive. But they are also rigid. They may teach the model to follow a form rather than discover an efficient strategy.
WMAct takes a different route. It does not force a detailed human-designed reasoning template. It shapes the incentives around action. The model is allowed to learn through interaction, but the training loop makes inefficient action less rewarding and gradually restricts feedback dependence.
That distinction is subtle but important. A scripted agent says: “Think in this format.” A WMAct-style agent-training philosophy says: “Act in the world, but only useful action is rewarded, and eventually you must need less help.”
For business automation, this suggests a future where agent governance is not only about rules and prompts. It will involve environment design, reward design, interaction budgets, escalation policies, and progressive reduction of assistance. The agent’s behavior will be shaped by the operational game it is trained to play.
This does not remove the need for human oversight. It changes where oversight should focus. Instead of obsessing over every reasoning phrase the model emits, builders should ask whether the agent’s training environment rewards the behaviors the business actually wants: fewer redundant actions, better state tracking, safer recovery, lower escalation load, and competence under constrained feedback.
The uncomfortable part is that this requires more than prompt engineering. It requires building realistic environments where agents can safely fail, receive feedback, and be pressured to internalize. The cheaper alternative is to ship agents that keep asking the world what to do next. Some will call that autonomy. The invoices will call it something else.
Conclusion: doing is useful only when it becomes knowing
The strongest idea in WMAct is not that agents should act. Everyone building agents already knows that. The stronger idea is that action must become knowledge.
Multi-turn interaction lowers cognitive burden by grounding the model in feedback. Reward rescaling prevents that feedback loop from degenerating into brute-force exploration. Frequency annealing prevents the model from becoming addicted to environmental cues. The resulting pressure encourages the model to compress experience into a more efficient internal world model.
That is why the paper’s mechanism-first reading matters. The benchmark results make sense only after the causal loop is clear: interaction teaches, reward filters, annealing compresses, and single-turn performance tests whether internalization actually happened.
For enterprise AI, the takeaway is precise. Do not merely count task completions. Count useful progress. Do not merely give agents tools. Measure whether they need fewer tools after learning. Do not merely celebrate recovery. Ask whether the agent is becoming less dependent on recovery.
The next useful generation of business agents will not be the ones that interact the most. They will be the ones that learn from interaction well enough to need less of it.
Cognaptus: Automate the Present, Incubate the Future.
-
Bao Shu et al., Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction, arXiv:2511.23476, submitted November 28, 2025. https://arxiv.org/abs/2511.23476 ↩︎