Think Longer, Act Smarter: Why Coding Agents Need Behavior-Preserving Reasoning

Software agents fail in a familiar way.

They do not always fail because they are stupid. Sometimes they fail because they are busy. They search too widely, inspect too much, edit too early, revise the wrong file, run out of context, and then collapse under the weight of their own half-formed investigation. In enterprise language: they generate activity before they stabilize a diagnosis. We have seen humans do this too, usually in Slack threads with too many tabs open. The machines are catching up nicely.

The paper “M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models” studies this problem through a sharp technical question: can mathematical reasoning improve coding agents without destroying the behavior pattern that makes an agent an agent?¹

That sounds narrower than the usual “reasoning makes models better” slogan. Good. The slogan is the problem. A coding agent is not a math contest participant wearing a terminal costume. It has to reason, call tools, observe results, inspect files, modify code, and verify patches. Longer internal thinking can help only if it arrives at the right moment and does not suppress the external interaction loop.

M2A’s contribution is therefore not simply “add math reasoning to a coding model.” The more interesting claim is: add mathematical reasoning in parameter directions that do not perturb the model’s think-act-observe behavior. That is the difference between giving an agent a better brain and giving it a monologue problem.

The mistake is treating “more reasoning” as a universal upgrade

The tempting belief is simple: mathematical chain-of-thought is hard, coding is hard, so a model with stronger mathematical reasoning should become a better coding agent. Mix math data with agent data, or merge a math-reasoning model into a code-agent model, and the system should improve.

The paper’s experiments say: not so fast.

Mathematical reasoning and agentic reasoning have different behavioral shapes. Mathematical reasoning is usually closed-world and single-response. The model receives a problem, works internally, and returns a complete answer. Agentic reasoning is open-world and multi-turn. The model alternates between internal thought and external action: inspect, call tool, read result, update hypothesis, edit, test, and sometimes retreat gracefully before breaking the repository.

The problem is not that math reasoning is useless. The problem is that the behavioral interface is different. A math-reasoning update can teach the model to think longer. It can also teach the model to delay action, ignore feedback, or continue generating when it should call a tool. For a coding agent, that is not deeper intelligence. It is operational constipation, but with tokens.

The paper makes this contrast concrete by comparing three routes:

Approach	What it tries to do	What can go wrong
Multi-task SFT	Train on both coding-agent trajectories and math reasoning data	Learns the surface pattern of long reasoning but weakens the agent’s interaction behavior
Naive model merging	Add the math model’s task vector to the agent model	Injects reasoning but may overwrite directions needed for tool-use timing and environment interaction
M2A	Merge the reasoning task vector only after protecting agent-critical behavior directions	Strengthens reasoning while preserving the think-act-observe loop

This is why a mechanism-first reading matters. If we jump directly to the benchmark improvement, M2A looks like another “new method beats baselines” paper. The useful idea is earlier: agent upgrades should preserve the switching behavior before optimizing reasoning length.

M2A treats agent behavior as something to protect, not something to hope for

The method starts by defining what must not be damaged.

In a ReAct-style coding agent, behavior is visible at transition points: entering a reasoning block, leaving it, starting a tool call, ending a tool call, and moving between internal thinking and external observation. M2A uses these transition markers to estimate an agent-critical subspace: representation directions associated with the model’s think-act-observe behavior.

The practical intuition is straightforward. Do not protect every parameter. Do not treat the whole model as sacred. Protect the part of the representation space that governs when the agent switches between thinking and acting.

The paper’s marker-based calibration uses code-agent trajectories. It collects hidden states around behavior-transition markers such as reasoning boundaries and tool-call boundaries. Those hidden states form a calibration matrix. The subspace spanned by that matrix is treated as behavior-critical.

Then M2A modifies the mathematical-reasoning update before merging it into the agent model. In ordinary task-vector merging, one might write:

$$ \theta_{merge} = \theta_{agent} + \alpha \Delta\theta_{reason} $$

Here, $\theta_{agent}$ is the coding-agent model, and $\Delta\theta_{reason}$ is the reasoning task vector obtained from the difference between a reasoning model and the shared base model. The scalar $\alpha$ controls merge strength.

The problem is that $\Delta\theta_{reason}$ may push the model along directions that matter for agent behavior. M2A instead projects the reasoning update into the null space of the agent-critical subspace. Conceptually:

$$ P_{\perp} = I - UU^\top $$

where $U$ is an orthonormal basis for the protected agent-critical subspace. The reasoning update is then refined so that it does not perturb those protected directions.

The business translation is unusually clean: add capability in the spare directions, not through the control wiring.

That metaphor is not decorative. It captures the operational risk. A coding agent’s value does not come from producing an impressive private essay before every edit. Its value comes from using reasoning to decide what evidence to collect, when to act, and how to verify. If the merge makes the agent think more but inspect less, the model may look more intelligent while becoming less useful.

Null-space projection is the core mechanism; layer control makes it usable

M2A has three moving parts, and they should not be treated as equal slogans.

First, it calibrates behavior markers to estimate the agent-critical subspace. Second, it projects the reasoning task vector into the null space of that subspace. Third, it applies adaptive layer-wise merging: it rescales the reasoning vector by layer and merges only in layers where the agent and reasoning task vectors are sufficiently aligned.

The null-space projection is the conceptual center. It is the part that directly answers the paper’s main conflict: how can the model absorb mathematical reasoning without overwriting interaction behavior?

The layer-wise mechanisms are more like engineering controls. They matter because task-vector magnitudes can differ substantially across layers. A uniform merge coefficient can make the reasoning update dominate in some layers and barely appear in others. M2A therefore normalizes the reasoning vector relative to the agent vector at each layer, and uses a similarity-aware mask to skip layers where the agent and reasoning updates point in conflicting directions.

A simple way to read the method is:

Component	Likely purpose	What it protects or controls
Behavior-marker calibration	Identify where agent switching behavior lives	The think-act-observe transition pattern
Null-space projection	Remove reasoning-update components that interfere with protected directions	Agent behavior under reasoning injection
Merge coefficient calibration	Prevent scale imbalance across layers	Stability of merge strength
Similarity-aware layer mask	Avoid merging in incompatible layers	Layer-level conflict between agent and reasoning updates

The appendix implementation detail is also worth noting. The paper presents the subspace projection geometrically, but the experiments use a matrix-free block-wise solver rather than explicitly materializing the full projector for every transformer block. This is not a second thesis; it is an implementation detail that makes the method feasible on real model components such as attention projections and feed-forward layers.

For practitioners, the important point is not the algebra itself. It is the design discipline: before injecting a new capability into an agent, define which behaviors must remain invariant. M2A gives one answer for coding agents. Other agents may need different behavioral markers.

The main result is not just higher SWE-Bench; it is a better behavior mix

The headline result is easy to state. On SWE-Bench Verified under the OpenHands scaffold, M2A improves Agent-8B from 44.0% to 51.2% resolved rate. Average reasoning length rises from 253.3 to 327.4 tokens per step, while average interaction steps remain high at 178.0 versus 175.3 for the base agent.

That last clause is doing a lot of work.

A naive reader may celebrate the longer reasoning trace. The paper shows why that would be sloppy. Multi-Task-8B also reasons longer: 347.4 tokens per step. Yet its resolved rate falls to 41.1%, and average interaction steps drop to 150.1. So longer reasoning alone is not the gain. It can even be part of the failure mode.

The model-merging baselines sharpen the same lesson:

Model or method	Resolved rate	Avg. reasoning length	Avg. interaction steps	Interpretation
Agent-8B	44.0%	253.3	175.3	Strong base agent behavior, weaker reasoning
Multi-Task-8B	41.1%	347.4	150.1	Longer reasoning, worse agent performance
Task Arithmetic	47.6%	836.9	87.1	Some gain, but interaction frequency nearly halves
SLERP	47.2%	747.4	87.6	Similar overthinking/under-interacting pattern
TIES-Merging	39.0%	494.6	112.1	Merge destabilizes behavior
DARE	22.0%	1219.1	68.6	Reasoning-heavy collapse
RAIN-Merging	43.2%	262.7	170.4	Preserves behavior but injects little useful reasoning
M2A-Agent-8B	51.2%	327.4	178.0	More reasoning without losing interaction

This is the most useful evidence pattern in the paper. The baseline failures are not random. They separate three bad outcomes:

Longer thought, weaker agent: Multi-task SFT increases reasoning length but hurts resolved rate.
Reasoning domination: naive merging methods make the model think much longer while interacting far less.
Over-preservation: RAIN-Merging keeps interaction behavior closer to the base agent but does not produce an improvement.

M2A’s result sits in the missing middle. It adds reasoning while keeping interaction alive.

That distinction matters for software automation because repository-level debugging is not a single-shot answer task. The agent must repeatedly convert uncertainty into evidence. A model that solves fewer tasks after “thinking harder” is not a thoughtful engineer. It is a meeting that learned to type.

The ablations identify mechanism, not just decoration

The paper’s Figure 1 includes two useful pieces of evidence. One part shows that M2A improves mathematical benchmarks as well as SWE-Bench. The other part ablates the M2A components.

The math-side gains are sizable: the base agent scores 30.0 on AIME2024, 16.7 on AIME2025, and 79.3 on MATH500; M2A reaches 40.0, 51.7, and 85.4, respectively. IFEval stays nearly unchanged, moving from 23.1 to 24.0. This supports the claim that M2A does inject reasoning capability rather than merely tuning a coding benchmark.

The ablation is more important for the article’s argument. Removing the similarity-aware mask reduces SWE-Bench performance to 49.8%. Removing merge coefficient calibration reduces it to 47.7%. Removing null-space projection drops it further to 45.6%, close to the base agent’s 44.0%.

Test	Likely purpose	What it supports	What it does not prove
Math and general benchmarks	Main evidence for capability synergy beyond SWE-Bench	The merged model retains or improves intrinsic reasoning and does not visibly damage instruction-following in the reported setting	It does not prove general reasoning transfer across all non-coding domains
Component ablation	Ablation	Null-space projection is central; layer mask and coefficient calibration add further value	It does not isolate every possible interaction between components
Multi-task SFT epoch sweep	Robustness/sensitivity test for SFT baseline	Weak multi-task SFT result is not simply because they stopped training too early	It does not rule out all alternative training recipes
Merge-coefficient sweeps for baselines	Fairness and sensitivity test	Baseline numbers are chosen from tuned sweeps, and standard merging methods are fragile	It does not prove that no future merging baseline can do better
Calibration set size/window ablations	Robustness test for calibration	Behavior-critical subspace estimation is not extremely fragile in their tested range	It does not prove the marker strategy transfers unchanged to all agent formats
Trajectory case studies	Qualitative exploratory support	M2A changes how agents investigate, edit, and verify	It is illustrative, not statistical proof by itself

This distinction between evidence types is important. The ablation is not a bonus chart. It is the mechanism check. The paper’s central claim would be much weaker if M2A improved the benchmark but the null-space component did not matter. The fact that removing null-space projection causes the largest degradation supports the paper’s interpretation: preserving agent-critical behavior is not a decorative constraint. It is the point.

The merge-strength knob is useful because it controls behavior, not just a score

M2A introduces merge strength as a control knob for reasoning behavior. This is easy to misunderstand.

A hyperparameter is not automatically a product control. Most hyperparameters are just tiny bureaucrats in the basement. You tune them, they annoy you, and nobody wants them in the user interface.

M2A’s merge strength is more interesting because the paper reports a relatively interpretable relationship between merge strength and reasoning length. As merge strength increases, average reasoning length grows approximately monotonically and becomes nearly linear in the high-performing region. The authors identify low, medium, and high reasoning regimes, with empirical transition points around $\beta = 0.8$ and $\beta = 1.2$.

The regimes are operationally meaningful:

Regime	Behavior pattern	Practical reading
Low reasoning	Agent behavior remains close to the base model, but reasoning is weak	Safe but underpowered
Medium reasoning	More internal analysis while preserving environment interaction	Best tested balance
High reasoning	Reasoning grows but performance stops improving; prompt-length failures increase	Overthinking starts to consume the operating budget

This is a useful product idea. For coding agents, “reasoning intensity” should not simply mean “allow longer outputs.” It should mean a controlled trade-off among diagnosis depth, tool-use frequency, context budget, and verification discipline.

The appendix adds a helpful check: average interaction steps remain relatively stable across the low and medium merge-strength regimes. In the high-reasoning regime, steps decline mildly because more trajectories hit prompt-length limits. That supports the interpretation that, within the effective range, M2A controls internal reasoning depth more than it suppresses external action.

For business deployment, that is the difference between a knob and a hazard. A real engineering team may want a low-reasoning mode for routine lint-style issues, a medium-reasoning mode for repository bugs, and a high-reasoning mode only when context budget and execution safeguards permit it. The paper does not implement such a product policy. Cognaptus would infer it as a plausible design direction.

The trajectory analysis explains where the gain probably comes from

The paper’s trajectory-level analysis is where the benchmark result becomes more interpretable.

After M2A, the model reasons more at decision-critical moments. Reasoning at edit steps rises from 305.3 to 401.4 tokens, a 31.5% increase. Late-stage synthesis rises from 189.6 to 230.7, a 21.7% increase. Before the first edit, search actions increase from 19.5 to 25.3, and file inspections increase from 10.4 to 12.3.

At the same time, edit count falls from 20.1 to 16.7, and unique edited files fall slightly from 6.3 to 6.0.

That combination is the behavioral signature: more evidence before intervention, fewer edits after diagnosis. The agent is not merely talking more. It is shifting from trial-and-error editing toward evidence-grounded action.

The appendix case studies make this pattern concrete. In a Sphinx issue, the baseline agent explores broadly and repeatedly revises hypotheses until it exceeds the maximum prompt length. M2A identifies a specific mismatch between :return: and :returns:, applies a localized patch, and verifies it. In a scikit-learn issue, the baseline recognizes that multi_class is relevant but makes an incomplete constructor edit that drops fit_intercept. M2A traces the scoring path more carefully and preserves the existing argument while adding the needed one.

These cases should not be oversold. They are qualitative examples, not a new benchmark. But they illustrate the same mechanism suggested by the trajectory metrics: better agents do not just produce more reasoning. They place reasoning where it changes the next action.

That is the key managerial insight. The performance gain appears to come from better timing of cognition, not from a larger pile of internal text.

The business value is cheaper behavioral upgrade, not magic autonomy

M2A’s practical attraction is that it is training-free. It does not require additional SFT or RL after the source models are available. The paper reports that Agent-8B and Multi-Task-8B SFT used 32 H200 GPUs with 140 GB peak memory per GPU, while M2A calibration and merging used 8 H200 GPUs with 50 GB peak memory per GPU. The table text reports hardware and memory but does not clearly expose the wall-clock times in the HTML extraction, so the conservative takeaway is about resource profile, not total cost.

For companies building software agents, this suggests a useful upgrade path:

Start with a competent coding-agent model.
Identify a reasoning-specialized model with the same base lineage.
Calibrate behavior markers from agent trajectories.
Merge reasoning capability while protecting agent-critical switching behavior.
Tune reasoning intensity within a monitored range.
Evaluate not only resolved rate, but search/edit/verify behavior.

The last point is not optional. A coding-agent benchmark score is useful, but production risk lives in the trajectory. Does the agent inspect before editing? Does it verify after modifying? Does it touch too many files? Does it preserve existing semantics when applying a fix? Does it fail by exhausting context, looping, or making confident near-correct edits?

M2A pushes the evaluation conversation in the right direction. It treats agent behavior as a process-control object. That matters more for enterprise automation than another leaderboard decimal, although leaderboards do keep marketing departments hydrated.

A Cognaptus-style deployment interpretation would separate three layers:

Layer	What the paper directly shows	Business interpretation	Boundary
Technical method	Null-space model merging can inject math reasoning while preserving agent-critical behavior	Upgrade agents without full retraining when compatible source models exist	Demonstrated mainly on Qwen3-8B and coding-agent tasks
Evaluation	SWE-Bench Verified improves from 44.0% to 51.2% under OpenHands	Repository-level issue resolution can benefit from controlled reasoning injection	Score may vary under other scaffolds, repositories, tool formats, and inference settings
Behavior	M2A increases reasoning at edit and synthesis stages, increases pre-edit evidence collection, and reduces edit count	Agent quality should be measured through investigation-edit-verification patterns	Trajectory metrics need adaptation for each enterprise workflow
Product control	Merge strength changes reasoning length in a predictable region	Reasoning intensity can become a deployment knob	Regime boundaries are empirically selected, not yet principled

The implication is not “replace developers.” The implication is more sober and more useful: coding agents can be improved by protecting operational behavior while adding reasoning capability. That points toward agent engineering as process design, not merely model selection.

Where the result should not be stretched

The paper’s boundaries are clear enough if we do not force them to carry a press release.

First, the main evidence is built around Qwen3-8B, SWE-Bench Verified, and the OpenHands scaffold. SWE-Bench Verified is a challenging and relevant benchmark, but it is still one domain: real-world Python GitHub issue resolution. The method may generalize to other agentic settings, but this paper does not prove that it will work equally well for browser agents, spreadsheet agents, data-analysis agents, legal-document agents, or multi-agent workflow systems.

Second, M2A depends on behavior markers. In this paper, the markers correspond to the agent’s reasoning and tool-call format. Other agent frameworks may serialize tools differently, compress history differently, or hide state transitions behind orchestration code. The method’s spirit may transfer; the marker set probably should not be copied blindly.

Third, merge strength is empirically controlled. The paper finds useful regimes and reports a stable high-performing region, but it does not provide a principled rule for choosing merge strength before evaluation. For production systems, this means the knob must be validated, logged, and monitored. A knob without monitoring is just a future incident report with a nicer name.

Fourth, “training-free” does not mean “free.” M2A avoids additional gradient updates, but it still requires source models, calibration trajectories, forward-pass calibration, merging computation, and serious evaluation. The resource burden is lower than full SFT or RL in the reported setup, but it is not a weekend spreadsheet exercise.

Finally, stronger coding agents raise ordinary deployment risks: unintended file modifications, insecure patches, excessive permissions, and misuse for vulnerability discovery. The paper notes the need for sandboxed execution, human oversight, access control, logging, and security-oriented evaluation. That is not legal boilerplate. For agents that touch code, it is the minimum adult supervision package.

The real lesson: preserve the loop before lengthening the thought

M2A is valuable because it attacks a specific failure in how people discuss LLM reasoning.

The common story says: more reasoning is better. The paper’s better story says: more reasoning is useful only when it does not damage the behavior pattern required by the task. For coding agents, that behavior pattern is not a single answer. It is a loop: think, act, observe, revise, edit, verify.

The evidence fits that story. Multi-task SFT makes the model reason longer but solve fewer tasks. Some naive merging baselines produce very long reasoning and far fewer interactions. RAIN-Merging preserves behavior but fails to inject enough useful reasoning. M2A lands in the middle: reasoning gets deeper, interaction remains active, and resolved rate rises.

For business teams, the practical message is not to chase the longest chain-of-thought or the newest reasoning model. The practical message is to define the operational behaviors that matter, protect them during capability upgrades, and evaluate trajectories rather than just final answers.

In software automation, a good agent is not the one that thinks the longest. It is the one that knows when thinking has earned the right to edit.

Cognaptus: Automate the Present, Incubate the Future.

Junjian Wang, Xin Zhou, Qiran Xu, and Kun Zhan, “M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models,” arXiv:2605.09879v1, 2026. https://arxiv.org/abs/2605.09879 ↩︎

The mistake is treating “more reasoning” as a universal upgrade#

M2A treats agent behavior as something to protect, not something to hope for#

Null-space projection is the core mechanism; layer control makes it usable#

The main result is not just higher SWE-Bench; it is a better behavior mix#

The ablations identify mechanism, not just decoration#

The merge-strength knob is useful because it controls behavior, not just a score#

The trajectory analysis explains where the gain probably comes from#

The business value is cheaper behavioral upgrade, not magic autonomy#

Where the result should not be stretched#

The real lesson: preserve the loop before lengthening the thought#