Think Longer, Act Worse? What M2A Teaches About Reasoning Agents

A coding agent does not fail only because it cannot think.

Sometimes it fails because it keeps thinking after it should inspect the repository. Sometimes it writes a plausible explanation before checking the relevant file. Sometimes it burns the context window by wandering through hypotheses, each one almost reasonable, none of them decisive. The result is not stupidity in the familiar sense. It is a coordination failure: the model does not know when to reason, when to call a tool, when to absorb feedback, and when to edit.

That is the useful starting point for the paper “M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models.”¹ The paper’s headline number is easy to quote: applying M2A to a Qwen3-8B coding agent improves SWE-Bench Verified resolved rate from 44.0% to 51.2%, without additional training. Nice. But the more interesting result is not “more reasoning helps agents.” That would be too convenient, and also wrong.

The paper’s real claim is narrower and more useful: mathematical reasoning can help agentic reasoning only if the integration mechanism preserves the agent’s interaction behavior. In plain terms, a software agent needs its action loop protected. Otherwise, importing a math-style reasoning capability can turn a tool-using agent into a verbose internal monologue with a keyboard nearby.

The mistake is treating “reasoning” as one transferable substance

The paper begins from a distinction that should be obvious but is often ignored in product discussions: mathematical reasoning and agentic reasoning are not the same behavior.

Mathematical reasoning is usually closed-world and single-response. The model receives a problem, reasons internally, and produces a solution. Even when the reasoning trace is long, the loop is mostly inside the model.

Agentic reasoning is different. A coding agent must work in an open environment. It reads an issue, searches a repository, inspects files, runs tests, edits code, receives observations, and revises its plan. The reasoning is not merely a longer text segment. It is interleaved with external action.

That difference matters because the two capabilities can interfere. The paper tests the obvious approach first: multi-task supervised fine-tuning on a mixture of coding-agent and math-reasoning data. The result is a useful warning. The multi-task model reasons longer, but performs worse as an agent.

Method	Resolved rate	Avg. reasoning length per step	Avg. interaction steps	Interpretation
Agent-8B	44.0%	253.3	175.3	Baseline coding agent behavior
Multi-Task-8B	41.1%	347.4	150.1	Longer traces, fewer interactions, worse task resolution
M2A-Agent-8B	51.2%	327.4	178.0	More reasoning while preserving interaction

The multi-task result is the misconception trap. If you only look at reasoning length, the model looks improved. If you look at task completion, it is worse. The model has learned a surface pattern: talk more like a reasoning model. It has not learned to use that reasoning inside the inspect-act-observe loop.

This is the point where generic summaries of the paper tend to become useless. “The authors combine mathematical and agentic reasoning” is technically true and operationally vague. The important question is how they combine them without damaging the behavior that makes an agent an agent.

M2A protects the action loop before injecting reasoning

M2A treats the problem as behavior-preserving capability integration. That phrase sounds like academic packaging, but the mechanism is concrete.

The method starts with two models sharing the same base:

an agent model, trained for coding-agent behavior;
a reasoning model, stronger in mathematical reasoning.

A naive merge would add the reasoning model’s task vector to the agent model:

$$ \theta_{\text{merge}} = \theta_{\text{agent}} + \alpha \Delta\theta_{\text{reason}} $$

This is attractive because it is training-free. It is also dangerous because it assumes all parameter directions are equally safe to change. They are not. Some directions encode the behavioral transitions that tell the agent when to stop thinking and act, when to call a tool, and when to wait for observations.

M2A therefore inserts a protection step.

First, it builds an agent-critical behavior subspace. The paper identifies behavior markers around the agent’s transitions, such as <think>, </think>, <function=, and </function> in the XML-style format, or equivalent tool-call markers in native tool formats. Around these markers, it extracts hidden states from local token neighborhoods. These activations are used to estimate the representation directions most relevant to the think-act-observe behavior.

Second, it projects the mathematical reasoning update into the null space of that protected subspace. In simplified form, if $Q$ spans the agent-critical subspace, the null-space projector is:

$$ P_{\text{null}} = I - QQ^\top $$

The reasoning update is then modified so it does not perturb those protected directions:

$$ \widetilde{\Delta W}\ast{\text{reason}} = \Delta W\ast{\text{reason}} P_{\text{null}} $$

The intuition is not mysterious. Keep the part of the reasoning update that can fit around the agent’s action behavior. Remove the part that would overwrite the transition machinery. The model may think better, but it should not forget how to take turns with the environment.

Third, M2A adds adaptive layer-wise merging. The authors observe that reasoning and agent task vectors vary in magnitude across layers. A uniform merge coefficient can let the reasoning update dominate the agent update in the wrong places. M2A therefore rescales the reasoning update layer by layer and uses a similarity-aware layer mask, merging only layers where the agent and reasoning task vectors are sufficiently aligned.

That last detail matters. The method is not simply “project then add.” It is “protect the behaviorally critical subspace, calibrate update size, and avoid low-alignment layers.” Less elegant as a slogan, more useful as engineering.

The main evidence is not just the 51.2% result

The paper’s main benchmark is SWE-Bench Verified under the OpenHands scaffold. This is a coding-agent setting where the model must resolve real GitHub issues through multi-turn interaction. The reported metric is resolved rate, averaged over three independent runs.

The main result table compares M2A with SFT and several model-merging baselines.

Model or method	Resolved rate	Avg. reasoning length	Avg. step	What the result suggests
Agent-8B	44.0 ± 0.9	253.3	175.3	Strong baseline agent behavior
Reasoning-8B	0.2 ± 0.1	1800.7	9.2	Math reasoning alone is not agent behavior
Multi-Task-8B	41.1 ± 1.4	347.4	150.1	Longer reasoning can reduce agent effectiveness
Task Arithmetic	47.6 ± 0.9	836.9	87.1	Gains, but interaction frequency is nearly halved
TIES-Merging	39.0 ± 0.9	494.6	112.1	Merge interference hurts performance
DARE	22.0 ± 0.8	1219.1	68.6	Severe behavior collapse
SLERP	47.2 ± 0.7	747.4	87.6	Similar tradeoff: more internal reasoning, fewer turns
RAIN-Merging	43.2 ± 1.1	262.7	170.4	Preserves agent behavior but adds little useful reasoning
M2A-Agent-8B	51.2 ± 0.6	327.4	178.0	Stronger reasoning without suppressing interaction

The contrast with the Reasoning-8B model is almost comic, in the dry way benchmark tables sometimes are. The reasoning model produces extremely long reasoning traces and barely interacts. Its resolved rate is 0.2%. This is not a small underperformance. It is a reminder that a single-response reasoning model dropped into an agent environment is not a software engineer. It is a lecturer trapped in a terminal.

The naive merging baselines are subtler. Task Arithmetic and SLERP do improve over the base agent, but they do so while sharply reducing interaction steps and increasing reasoning length. The model reasons much more per step, but acts far less often. This can help in some cases, but it is not stable agentic synergy. It is more like replacing many short inspect-edit cycles with fewer, heavier monologues.

RAIN-Merging shows the opposite failure mode. It preserves behavior but does not inject enough useful reasoning to improve resolved rate. That makes it a useful comparison: preserving behavior alone is not sufficient; injecting reasoning alone is not sufficient. M2A’s contribution is the combination.

The ablations identify the mechanism, not just the decoration

The paper’s ablation study is important because M2A has several moving parts. The authors remove the similarity-aware layer mask, remove merge coefficient calibration, and remove null-space projection.

Variant	Resolved rate	Likely purpose of test	What it supports	What it does not prove
Base Agent	44.0%	Reference point	Baseline coding-agent performance	Nothing about M2A components
Full M2A	51.2%	Main method	Combined mechanism improves performance	Generality beyond tested setting
Without layer mask	49.8%	Ablation	Selecting aligned layers helps	That this mask is globally optimal
Without coefficient calibration	47.7%	Ablation	Layer-wise scale balancing matters	That this calibration is the only viable one
Without null-space projection	45.6%	Core ablation	Protecting agent-critical behavior is central	That all agent behaviors are captured by the marker subspace

The largest degradation comes from removing null-space projection. That is the paper’s strongest internal evidence for its mechanism. If the projection is removed, performance falls close to the base agent. This supports the claim that naive reasoning injection interferes with agent-critical behavior.

The layer mask and coefficient calibration also matter, but they look more like stabilizers. They improve performance by controlling where and how strongly reasoning is injected. The null-space projection is the conceptual core: it is the part that turns “add reasoning” into “add reasoning without damaging the action loop.”

Merge strength becomes a behavior control, not a roulette wheel

One of the more practical parts of the paper is the analysis of merge strength, denoted $\beta$.

In many model-merging workflows, the merge coefficient is a fragile hyperparameter. A slightly stronger merge may help, or it may collapse behavior. The paper shows this pattern for standard merging baselines: reasoning length can jump abruptly toward response-length limits, and performance can collapse when the reasoning model begins to dominate the agent model.

M2A behaves differently. As $\beta$ increases, average reasoning length grows approximately monotonically and becomes nearly linear in the high-performing region. The authors divide the behavior into three regimes:

Regime	Merge strength behavior	Agent behavior	Practical reading
Low reasoning	Weak reasoning injection	Agent behavior mostly preserved, reasoning remains limited	Safe but underpowered
Medium reasoning	Balanced reasoning and action	Strongest performance region	The useful operating zone
High reasoning	Reasoning keeps growing	More context pressure and prompt-length failures	Verbosity starts taxing the system

This matters because a business user does not only want a best benchmark number. They want a control surface. If the same parameter that improves reasoning also unpredictably destroys interaction, it is not a product knob; it is a liability with a Greek letter.

M2A’s $\beta$ is not a complete production policy. The paper itself notes that regime boundaries are empirically selected. Still, the result is useful: after behavior-preserving projection, merge strength becomes more interpretable. It adjusts reasoning intensity without immediately suppressing the external interaction loop.

The trajectory analysis explains what “better reasoning” actually means

The paper’s trajectory-level analysis is where the benchmark number becomes operationally meaningful.

After applying M2A, the agent does not simply produce more tokens everywhere. It reasons more at decision-critical moments, especially around edit steps and late-stage synthesis. It also searches and inspects more before the first edit, while issuing fewer edits overall.

Metric	Base Agent	M2A	Relative change	Interpretation
Resolved rate	44.0%	51.2%	+16.4%	More issues solved
Reasoning at edit	305.3	401.4	+31.5%	More reasoning when modifying code
Late-stage synthesis	189.6	230.7	+21.7%	Better final integration before completion
Search before first edit	19.5	25.3	+30.0%	More evidence gathered before intervention
Inspect before first edit	10.4	12.3	+18.2%	More file-level grounding
Edit count	20.1	16.7	-17.0%	Fewer edit operations
Unique edit files	6.3	6.0	-5.0%	Slightly more concentrated edits

This is the behavioral story: M2A shifts the agent from trial-and-error editing toward evidence-grounded action.

That phrase could easily become fluff, so the appendix case studies help. In a Sphinx issue, the baseline explores many related paths but fails after exceeding the prompt-length limit. M2A identifies the semantic mismatch between :return: and :returns:, applies a localized patch, and verifies it. In a scikit-learn issue, the baseline nearly finds the fix but replaces an existing constructor argument; M2A preserves the existing fit_intercept argument while adding multi_class.

The lesson is not that M2A makes agents “more thoughtful” in a vague human sense. It makes reasoning more useful at the point of commitment. The agent gathers evidence before editing, forms a tighter problem representation, applies fewer changes, and verifies the result. That is the difference between a verbose assistant and a working agent.

The business value is cheaper capability integration, not magic autonomy

For enterprise AI teams, the paper suggests a practical design principle: do not evaluate agent upgrades only by internal reasoning quality. Evaluate whether the upgrade preserves the action loop.

A coding or workflow agent is a coupled system:

observe → reason → act → observe again → revise → act again

If a capability upgrade improves the “reason” box but weakens the transitions between boxes, the agent may look smarter in logs while performing worse in production. This is exactly the kind of failure that can survive a demo and die in a real workflow.

M2A is interesting because it offers a lower-cost way to integrate reasoning capability. The paper reports that Agent-8B SFT used 32 × H200 GPUs for about 30 hours, Multi-Task-8B SFT used 32 × H200 GPUs for about 48 hours, while M2A calibration and merging used 8 × H200 GPUs for about 2 hours. That is still heavy infrastructure by normal business standards. But relative to full training or RL, it is a very different cost profile.

The practical implication is not “any company can now cheaply build frontier coding agents.” Calm down. The implication is more specific: for teams already maintaining specialized models, behavior-preserving merging could become a useful middle layer between prompt engineering and full retraining.

A sensible production translation would look like this:

Paper result	Business interpretation	Boundary
M2A improves SWE-Bench Verified resolved rate from 44.0% to 51.2%	Reasoning capability can be integrated without full retraining	Shown for Qwen3-8B coding agents under this benchmark and scaffold
Multi-task SFT increases reasoning length but lowers resolved rate	Training mixtures can transfer surface style instead of useful behavior	Does not rule out better-designed multi-task training
Null-space projection is the most important ablated component	Protecting action-critical behavior matters	Marker-based subspaces may not capture all production behaviors
Medium merge strength performs best	Reasoning intensity needs tuning	Boundaries are empirical, not automatically derived
M2A uses fewer edits after more inspection	Better agents may act less impulsively, not merely think longer	Evidence comes from coding-agent trajectories

For business process automation beyond coding, the same principle applies, but the implementation would need adaptation. A finance reconciliation agent, customer-support agent, or procurement workflow agent also has behavior-critical transitions: when to ask for clarification, when to call an API, when to escalate, when to write to a system of record, and when to stop. But the markers and calibration trajectories would not be identical to software-engineering tool calls.

That is the real business relevance: M2A gives a language for preserving operational behavior while injecting cognitive capability.

The boundary conditions are important and not especially mysterious

The paper is careful about its own boundaries, and those boundaries matter for interpretation.

First, the experiments are concentrated on Qwen3-8B and SWE-Bench Verified. This is a strong coding-agent benchmark, but it is still one domain. The paper does not prove that the same method will work equally well for browser agents, spreadsheet agents, finance agents, robotics agents, or larger model families.

Second, the method depends on calibration trajectories and behavior markers. The authors use 100 code-agent trajectories and show that 100 performs best among the tested sizes: 50 gives 49.1%, 100 gives 51.2%, 200 gives 51.0%, and 300 gives 50.8%. They also test marker window radius, with $r=3$ giving 51.0%, $r=5$ giving 51.2%, and $r=10$ giving 49.5%. These are reassuring robustness checks, not a universal recipe.

Third, the method is training-free but not compute-free. It avoids gradient updates, optimizer states, and RL rollouts, but it still requires forward-pass calibration, parameter-space merging, and serious GPU memory. “Training-free” is not the same thing as “laptop-friendly.” A small detail, but reality often lives in small details.

Fourth, stronger coding agents carry deployment risk. If agents can modify repositories more effectively, they can also make unintended modifications more effectively. For real production systems, sandboxing, access control, logging, human approval for consequential changes, and security-oriented evaluation remain part of the system design. The model merge does not merge away governance.

What Cognaptus would take from this paper

The direct paper result is clear: M2A improves one Qwen3-8B coding agent on SWE-Bench Verified by protecting behavior-critical directions while injecting mathematical reasoning through null-space projected model merging.

The Cognaptus inference is broader but bounded: in enterprise agent design, the central upgrade problem is not simply adding intelligence. It is adding intelligence without damaging workflow behavior.

This distinction changes what teams should measure. It is not enough to ask:

Does the model produce longer reasoning?
Does it score better on math?
Does it explain itself more convincingly?

The better questions are:

Does it inspect the right information before acting?
Does it preserve tool-use timing?
Does it escalate at the right boundary?
Does it reduce unnecessary edits or actions?
Does reasoning appear at decision-critical steps rather than everywhere?
Does a tuning knob change behavior smoothly, or does it collapse the workflow?

This is where M2A is useful as more than a model-merging trick. It reframes reasoning integration as behavioral surgery. Add the new capability, but avoid cutting the nerves that control action.

Conclusion: the agent should think more, but only in the right places

The paper’s best lesson is almost anti-slogan: more reasoning helps only when the agent still behaves like an agent.

M2A succeeds because it does not worship longer chain-of-thought traces. It protects the think-act-observe loop, projects reasoning updates away from behavior-critical directions, calibrates layer-wise merge strength, and then tunes reasoning intensity through a more stable control knob. The result is not merely a model that talks more before editing. It is a model that searches more before the first edit, reasons more when changes matter, edits less often, and solves more tasks.

That is a better mental model for enterprise AI agents. The future is not just models that think harder. It is systems that know when thinking should become action, and when action should wait for evidence.

The glamorous part is reasoning. The valuable part is coordination. Naturally, the benchmark table hides that in the middle, because tables have no sense of drama.

Cognaptus: Automate the Present, Incubate the Future.

Junjian Wang, Xin Zhou, Qiran Xu, and Kun Zhan, “M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models,” arXiv:2605.09879, 2026. ↩︎

Think Longer, Act Worse? What M2A Teaches About Reasoning Agents#

The mistake is treating “reasoning” as one transferable substance#

M2A protects the action loop before injecting reasoning#

The main evidence is not just the 51.2% result#

The ablations identify the mechanism, not just the decoration#

Merge strength becomes a behavior control, not a roulette wheel#

The trajectory analysis explains what “better reasoning” actually means#

The business value is cheaper capability integration, not magic autonomy#

The boundary conditions are important and not especially mysterious#

What Cognaptus would take from this paper#

Conclusion: the agent should think more, but only in the right places#