Think Longer, Act Worse? What M2A Teaches About Reasoning Agents
A coding agent does not fail only because it cannot think.
Sometimes it fails because it keeps thinking after it should inspect the repository. Sometimes it writes a plausible explanation before checking the relevant file. Sometimes it burns the context window by wandering through hypotheses, each one almost reasonable, none of them decisive. The result is not stupidity in the familiar sense. It is a coordination failure: the model does not know when to reason, when to call a tool, when to absorb feedback, and when to edit.
That is the useful starting point for the paper “M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models.”1 The paper’s headline number is easy to quote: applying M2A to a Qwen3-8B coding agent improves SWE-Bench Verified resolved rate from 44.0% to 51.2%, without additional training. Nice. But the more interesting result is not “more reasoning helps agents.” That would be too convenient, and also wrong.
The paper’s real claim is narrower and more useful: mathematical reasoning can help agentic reasoning only if the integration mechanism preserves the agent’s interaction behavior. In plain terms, a software agent needs its action loop protected. Otherwise, importing a math-style reasoning capability can turn a tool-using agent into a verbose internal monologue with a keyboard nearby.
The mistake is treating “reasoning” as one transferable substance
The paper begins from a distinction that should be obvious but is often ignored in product discussions: mathematical reasoning and agentic reasoning are not the same behavior.
Mathematical reasoning is usually closed-world and single-response. The model receives a problem, reasons internally, and produces a solution. Even when the reasoning trace is long, the loop is mostly inside the model.
Agentic reasoning is different. A coding agent must work in an open environment. It reads an issue, searches a repository, inspects files, runs tests, edits code, receives observations, and revises its plan. The reasoning is not merely a longer text segment. It is interleaved with external action.
That difference matters because the two capabilities can interfere. The paper tests the obvious approach first: multi-task supervised fine-tuning on a mixture of coding-agent and math-reasoning data. The result is a useful warning. The multi-task model reasons longer, but performs worse as an agent.
| Method | Resolved rate | Avg. reasoning length per step | Avg. interaction steps | Interpretation |
|---|---|---|---|---|
| Agent-8B | 44.0% | 253.3 | 175.3 | Baseline coding agent behavior |
| Multi-Task-8B | 41.1% | 347.4 | 150.1 | Longer traces, fewer interactions, worse task resolution |
| M2A-Agent-8B | 51.2% | 327.4 | 178.0 | More reasoning while preserving interaction |
The multi-task result is the misconception trap. If you only look at reasoning length, the model looks improved. If you look at task completion, it is worse. The model has learned a surface pattern: talk more like a reasoning model. It has not learned to use that reasoning inside the inspect-act-observe loop.
This is the point where generic summaries of the paper tend to become useless. “The authors combine mathematical and agentic reasoning” is technically true and operationally vague. The important question is how they combine them without damaging the behavior that makes an agent an agent.
M2A protects the action loop before injecting reasoning
M2A treats the problem as behavior-preserving capability integration. That phrase sounds like academic packaging, but the mechanism is concrete.
The method starts with two models sharing the same base:
- an agent model, trained for coding-agent behavior;
- a reasoning model, stronger in mathematical reasoning.
A naive merge would add the reasoning model’s task vector to the agent model:
$$ \theta_{\text{merge}} = \theta_{\text{agent}} + \alpha \Delta\theta_{\text{reason}} $$
This is attractive because it is training-free. It is also dangerous because it assumes all parameter directions are equally safe to change. They are not. Some directions encode the behavioral transitions that tell the agent when to stop thinking and act, when to call a tool, and when to wait for observations.
M2A therefore inserts a protection step.
First, it builds an agent-critical behavior subspace. The paper identifies behavior markers around the agent’s transitions, such as <think>, </think>, <function=, and </function> in the XML-style format, or equivalent tool-call markers in native tool formats. Around these markers, it extracts hidden states from local token neighborhoods. These activations are used to estimate the representation directions most relevant to the think-act-observe behavior.
Second, it projects the mathematical reasoning update into the null space of that protected subspace. In simplified form, if $Q$ spans the agent-critical subspace, the null-space projector is:
$$ P_{\text{null}} = I - QQ^\top $$
The reasoning update is then modified so it does not perturb those protected directions:
$$ \widetilde{\Delta W}{\text{reason}} = \Delta W{\text{reason}} P_{\text{null}} $$
The intuition is not mysterious. Keep the part of the reasoning update that can fit around the agent’s action behavior. Remove the part that would overwrite the transition machinery. The model may think better, but it should not forget how to take turns with the environment.
Third, M2A adds adaptive layer-wise merging. The authors observe that reasoning and agent task vectors vary in magnitude across layers. A uniform merge coefficient can let the reasoning update dominate the agent update in the wrong places. M2A therefore rescales the reasoning update layer by layer and uses a similarity-aware layer mask, merging only layers where the agent and reasoning task vectors are sufficiently aligned.
That last detail matters. The method is not simply “project then add.” It is “protect the behaviorally critical subspace, calibrate update size, and avoid low-alignment layers.” Less elegant as a slogan, more useful as engineering.
The main evidence is not just the 51.2% result
The paper’s main benchmark is SWE-Bench Verified under the OpenHands scaffold. This is a coding-agent setting where the model must resolve real GitHub issues through multi-turn interaction. The reported metric is resolved rate, averaged over three independent runs.
The main result table compares M2A with SFT and several model-merging baselines.
| Model or method | Resolved rate | Avg. reasoning length | Avg. step | What the result suggests |
|---|---|---|---|---|
| Agent-8B | 44.0 ± 0.9 | 253.3 | 175.3 | Strong baseline agent behavior |
| Reasoning-8B | 0.2 ± 0.1 | 1800.7 | 9.2 | Math reasoning alone is not agent behavior |
| Multi-Task-8B | 41.1 ± 1.4 | 347.4 | 150.1 | Longer reasoning can reduce agent effectiveness |
| Task Arithmetic | 47.6 ± 0.9 | 836.9 | 87.1 | Gains, but interaction frequency is nearly halved |
| TIES-Merging | 39.0 ± 0.9 | 494.6 | 112.1 | Merge interference hurts performance |
| DARE | 22.0 ± 0.8 | 1219.1 | 68.6 | Severe behavior collapse |
| SLERP | 47.2 ± 0.7 | 747.4 | 87.6 | Similar tradeoff: more internal reasoning, fewer turns |
| RAIN-Merging | 43.2 ± 1.1 | 262.7 | 170.4 | Preserves agent behavior but adds little useful reasoning |
| M2A-Agent-8B | 51.2 ± 0.6 | 327.4 | 178.0 | Stronger reasoning without suppressing interaction |
The contrast with the Reasoning-8B model is almost comic, in the dry way benchmark tables sometimes are. The reasoning model produces extremely long reasoning traces and barely interacts. Its resolved rate is 0.2%. This is not a small underperformance. It is a reminder that a single-response reasoning model dropped into an agent environment is not a software engineer. It is a lecturer trapped in a terminal.
The naive merging baselines are subtler. Task Arithmetic and SLERP do improve over the base agent, but they do so while sharply reducing interaction steps and increasing reasoning length. The model reasons much more per step, but acts far less often. This can help in some cases, but it is not stable agentic synergy. It is more like replacing many short inspect-edit cycles with fewer, heavier monologues.
RAIN-Merging shows the opposite failure mode. It preserves behavior but does not inject enough useful reasoning to improve resolved rate. That makes it a useful comparison: preserving behavior alone is not sufficient; injecting reasoning alone is not sufficient. M2A’s contribution is the combination.
The ablations identify the mechanism, not just the decoration
The paper’s ablation study is important because M2A has several moving parts. The authors remove the similarity-aware layer mask, remove merge coefficient calibration, and remove null-space projection.
| Variant | Resolved rate | Likely purpose of test | What it supports | What it does not prove |
|---|---|---|---|---|
| Base Agent | 44.0% | Reference point | Baseline coding-agent performance | Nothing about M2A components |
| Full M2A | 51.2% | Main method | Combined mechanism improves performance | Generality beyond tested setting |
| Without layer mask | 49.8% | Ablation | Selecting aligned layers helps | That this mask is globally optimal |
| Without coefficient calibration | 47.7% | Ablation | Layer-wise scale balancing matters | That this calibration is the only viable one |
| Without null-space projection | 45.6% | Core ablation | Protecting agent-critical behavior is central | That all agent behaviors are captured by the marker subspace |
The largest degradation comes from removing null-space projection. That is the paper’s strongest internal evidence for its mechanism. If the projection is removed, performance falls close to the base agent. This supports the claim that naive reasoning injection interferes with agent-critical behavior.
The layer mask and coefficient calibration also matter, but they look more like stabilizers. They improve performance by controlling where and how strongly reasoning is injected. The null-space projection is the conceptual core: it is the part that turns “add reasoning” into “add reasoning without damaging the action loop.”
Merge strength becomes a behavior control, not a roulette wheel
One of the more practical parts of the paper is the analysis of merge strength, denoted $\beta$.
In many model-merging workflows, the merge coefficient is a fragile hyperparameter. A slightly stronger merge may help, or it may collapse behavior. The paper shows this pattern for standard merging baselines: reasoning length can jump abruptly toward response-length limits, and performance can collapse when the reasoning model begins to dominate the agent model.
M2A behaves differently. As $\beta$ increases, average reasoning length grows approximately monotonically and becomes nearly linear in the high-performing region. The authors divide the behavior into three regimes:
| Regime | Merge strength behavior | Agent behavior | Practical reading |
|---|---|---|---|
| Low reasoning | Weak reasoning injection | Agent behavior mostly preserved, reasoning remains limited | Safe but underpowered |
| Medium reasoning | Balanced reasoning and action | Strongest performance region | The useful operating zone |
| High reasoning | Reasoning keeps growing | More context pressure and prompt-length failures | Verbosity starts taxing the system |
This matters because a business user does not only want a best benchmark number. They want a control surface. If the same parameter that improves reasoning also unpredictably destroys interaction, it is not a product knob; it is a liability with a Greek letter.
M2A’s $\beta$ is not a complete production policy. The paper itself notes that regime boundaries are empirically selected. Still, the result is useful: after behavior-preserving projection, merge strength becomes more interpretable. It adjusts reasoning intensity without immediately suppressing the external interaction loop.
The trajectory analysis explains what “better reasoning” actually means
The paper’s trajectory-level analysis is where the benchmark number becomes operationally meaningful.
After applying M2A, the agent does not simply produce more tokens everywhere. It reasons more at decision-critical moments, especially around edit steps and late-stage synthesis. It also searches and inspects more before the first edit, while issuing fewer edits overall.
| Metric | Base Agent | M2A | Relative change | Interpretation |
|---|---|---|---|---|
| Resolved rate | 44.0% | 51.2% | +16.4% | More issues solved |
| Reasoning at edit | 305.3 | 401.4 | +31.5% | More reasoning when modifying code |
| Late-stage synthesis | 189.6 | 230.7 | +21.7% | Better final integration before completion |
| Search before first edit | 19.5 | 25.3 | +30.0% | More evidence gathered before intervention |
| Inspect before first edit | 10.4 | 12.3 | +18.2% | More file-level grounding |
| Edit count | 20.1 | 16.7 | -17.0% | Fewer edit operations |
| Unique edit files | 6.3 | 6.0 | -5.0% | Slightly more concentrated edits |
This is the behavioral story: M2A shifts the agent from trial-and-error editing toward evidence-grounded action.
That phrase could easily become fluff, so the appendix case studies help. In a Sphinx issue, the baseline explores many related paths but fails after exceeding the prompt-length limit. M2A identifies the semantic mismatch between :return: and :returns:, applies a localized patch, and verifies it. In a scikit-learn issue, the baseline nearly finds the fix but replaces an existing constructor argument; M2A preserves the existing fit_intercept argument while adding multi_class.
The lesson is not that M2A makes agents “more thoughtful” in a vague human sense. It makes reasoning more useful at the point of commitment. The agent gathers evidence before editing, forms a tighter problem representation, applies fewer changes, and verifies the result. That is the difference between a verbose assistant and a working agent.
The business value is cheaper capability integration, not magic autonomy
For enterprise AI teams, the paper suggests a practical design principle: do not evaluate agent upgrades only by internal reasoning quality. Evaluate whether the upgrade preserves the action loop.
A coding or workflow agent is a coupled system:
observe → reason → act → observe again → revise → act again
If a capability upgrade improves the “reason” box but weakens the transitions between boxes, the agent may look smarter in logs while performing worse in production. This is exactly the kind of failure that can survive a demo and die in a real workflow.
M2A is interesting because it offers a lower-cost way to integrate reasoning capability. The paper reports that Agent-8B SFT used 32 × H200 GPUs for about 30 hours, Multi-Task-8B SFT used 32 × H200 GPUs for about 48 hours, while M2A calibration and merging used 8 × H200 GPUs for about 2 hours. That is still heavy infrastructure by normal business standards. But relative to full training or RL, it is a very different cost profile.
The practical implication is not “any company can now cheaply build frontier coding agents.” Calm down. The implication is more specific: for teams already maintaining specialized models, behavior-preserving merging could become a useful middle layer between prompt engineering and full retraining.
A sensible production translation would look like this:
| Paper result | Business interpretation | Boundary |
|---|---|---|
| M2A improves SWE-Bench Verified resolved rate from 44.0% to 51.2% | Reasoning capability can be integrated without full retraining | Shown for Qwen3-8B coding agents under this benchmark and scaffold |
| Multi-task SFT increases reasoning length but lowers resolved rate | Training mixtures can transfer surface style instead of useful behavior | Does not rule out better-designed multi-task training |
| Null-space projection is the most important ablated component | Protecting action-critical behavior matters | Marker-based subspaces may not capture all production behaviors |
| Medium merge strength performs best | Reasoning intensity needs tuning | Boundaries are empirical, not automatically derived |
| M2A uses fewer edits after more inspection | Better agents may act less impulsively, not merely think longer | Evidence comes from coding-agent trajectories |
For business process automation beyond coding, the same principle applies, but the implementation would need adaptation. A finance reconciliation agent, customer-support agent, or procurement workflow agent also has behavior-critical transitions: when to ask for clarification, when to call an API, when to escalate, when to write to a system of record, and when to stop. But the markers and calibration trajectories would not be identical to software-engineering tool calls.
That is the real business relevance: M2A gives a language for preserving operational behavior while injecting cognitive capability.
The boundary conditions are important and not especially mysterious
The paper is careful about its own boundaries, and those boundaries matter for interpretation.
First, the experiments are concentrated on Qwen3-8B and SWE-Bench Verified. This is a strong coding-agent benchmark, but it is still one domain. The paper does not prove that the same method will work equally well for browser agents, spreadsheet agents, finance agents, robotics agents, or larger model families.
Second, the method depends on calibration trajectories and behavior markers. The authors use 100 code-agent trajectories and show that 100 performs best among the tested sizes: 50 gives 49.1%, 100 gives 51.2%, 200 gives 51.0%, and 300 gives 50.8%. They also test marker window radius, with $r=3$ giving 51.0%, $r=5$ giving 51.2%, and $r=10$ giving 49.5%. These are reassuring robustness checks, not a universal recipe.
Third, the method is training-free but not compute-free. It avoids gradient updates, optimizer states, and RL rollouts, but it still requires forward-pass calibration, parameter-space merging, and serious GPU memory. “Training-free” is not the same thing as “laptop-friendly.” A small detail, but reality often lives in small details.
Fourth, stronger coding agents carry deployment risk. If agents can modify repositories more effectively, they can also make unintended modifications more effectively. For real production systems, sandboxing, access control, logging, human approval for consequential changes, and security-oriented evaluation remain part of the system design. The model merge does not merge away governance.
What Cognaptus would take from this paper
The direct paper result is clear: M2A improves one Qwen3-8B coding agent on SWE-Bench Verified by protecting behavior-critical directions while injecting mathematical reasoning through null-space projected model merging.
The Cognaptus inference is broader but bounded: in enterprise agent design, the central upgrade problem is not simply adding intelligence. It is adding intelligence without damaging workflow behavior.
This distinction changes what teams should measure. It is not enough to ask:
- Does the model produce longer reasoning?
- Does it score better on math?
- Does it explain itself more convincingly?
The better questions are:
- Does it inspect the right information before acting?
- Does it preserve tool-use timing?
- Does it escalate at the right boundary?
- Does it reduce unnecessary edits or actions?
- Does reasoning appear at decision-critical steps rather than everywhere?
- Does a tuning knob change behavior smoothly, or does it collapse the workflow?
This is where M2A is useful as more than a model-merging trick. It reframes reasoning integration as behavioral surgery. Add the new capability, but avoid cutting the nerves that control action.
Conclusion: the agent should think more, but only in the right places
The paper’s best lesson is almost anti-slogan: more reasoning helps only when the agent still behaves like an agent.
M2A succeeds because it does not worship longer chain-of-thought traces. It protects the think-act-observe loop, projects reasoning updates away from behavior-critical directions, calibrates layer-wise merge strength, and then tunes reasoning intensity through a more stable control knob. The result is not merely a model that talks more before editing. It is a model that searches more before the first edit, reasons more when changes matter, edits less often, and solves more tasks.
That is a better mental model for enterprise AI agents. The future is not just models that think harder. It is systems that know when thinking should become action, and when action should wait for evidence.
The glamorous part is reasoning. The valuable part is coordination. Naturally, the benchmark table hides that in the middle, because tables have no sense of drama.
Cognaptus: Automate the Present, Incubate the Future.
-
Junjian Wang, Xin Zhou, Qiran Xu, and Kun Zhan, “M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models,” arXiv:2605.09879, 2026. ↩︎