TL;DR for operators
AdaMame is a paper about a very practical failure: a model can answer a user in one language while doing its reasoning in another. That is not just inelegant. It is a product, trust, and governance problem wearing a linguistics hat.1
The paper’s useful move is to stop treating multilingual reasoning as a translation issue. The authors train for language fidelity directly. First, they supervised fine-tune models on 30,000 naturally occurring reasoning traces across five languages. Then they run reinforcement learning with AdaMame-GRPO, a GRPO variant that gives extra reward when a correct rollout reasons in the query language. The extra reward grows during training, so the model first explores useful reasoning languages and later converges toward the user’s language.
The headline result is not merely “better multilingual math.” The operational result is a three-way improvement: answer accuracy, reasoning-language fidelity, and lower test-time token use move together more often than the usual trade-off story would suggest. On Distill-Qwen 1.5B, AdaMame reaches 67.9% Pass@4 accuracy and 70.1% LCPR on MGSM-Rev2, compared with 66.6% and 67.8% for standard GRPO. On Qwen3 4B, it reaches 85.9% accuracy and 89.9% LCPR on MGSM-Rev2, again beating standard GRPO on both. It also keeps test-time compute low.
The business interpretation is straightforward, and therefore easy to ignore: if your AI product operates across languages, “respond in the customer’s language” is not enough. You need to know whether the system’s intermediate reasoning, explanations, audit artifacts, and validation traces remain linguistically aligned. A prompt is a request. A training objective is policy. Guess which one survives contact with deployment.
The boundary is equally clear. This is mathematical reasoning, tested on two relatively small open models, using a language detector inside the reward loop and reasoning traces generated by GPT-5 nano. Useful, yes. Universal, no. Please do not staple it onto every multilingual workflow and call it global readiness. That would be the usual ceremony, not the actual control.
The problem is not translation; it is reasoning-language collapse
Most multilingual AI product failures are described too politely. The model “occasionally switches languages.” It “prefers English for complex reasoning.” It “may not fully localize intermediate outputs.” These phrases sound harmless, like a consultant avoiding eye contact with a dashboard.
The paper uses a sharper label: language collapse. In this setting, a large reasoning model receives a query in a non-English language but defaults to reasoning in English, or mixes languages inside the reasoning trace. The final answer may still be correct. The visible explanation may even look vaguely helpful. But the reasoning process has stopped respecting the linguistic context of the user.
That matters for three reasons.
First, language is not just display formatting. A translated answer can be correct while the model’s reasoning artifacts remain misaligned with the user’s language, culture, conventions, and domain vocabulary. This is especially visible in mathematical word problems, where the reasoning path may carry assumptions, quantities, and transformations that need to remain interpretable.
Second, language collapse makes auditing harder. If a multilingual customer support, compliance, education, or financial-advice system produces intermediate explanations in the wrong language, human review becomes slower and less reliable. The reviewer now has to check both the logic and the language drift. Wonderful. The audit process has become a translation exercise with liability attached.
Third, prompt-level control is weak. The authors test a Prompt baseline that prepends language-specific instructions telling the model to reason in the query language. It helps only marginally. For Distill-Qwen 1.5B, prompt steering improves LCPR by only 0.8 points on MGSM-Rev2 and 1.9 points on MSVAMP. For Qwen3 4B, the gains are 1.1 and 0.2 points. That is not an intervention. That is a polite note left on the refrigerator.
The paper’s central correction is that reasoning-language fidelity has to be trained as behavior. Not requested. Not hoped for. Trained.
AdaMame makes language fidelity part of the training mechanism
AdaMame uses a two-stage recipe. The structure matters because each stage solves a different failure.
The first stage is supervised fine-tuning. The authors construct a multilingual reasoning-trace corpus across five in-domain languages: French, Portuguese, Japanese, Korean, and Thai. For each language, they sample 6,000 training examples, producing a 30,000-example SFT corpus. The traces are generated by GPT-5 nano and filtered so that retained examples satisfy three conditions: the reasoning trace is in the query language, the final answer is correct, and the output follows the required format. The average retain rate is 72.2%.
This stage teaches the model that reasoning can be performed in languages other than English. That sounds obvious until one remembers that models do not learn from obviousness. They learn from training distributions. If their reasoning-heavy diet is English, English becomes the comfortable internal workspace.
The second stage is reinforcement learning. Standard GRPO rewards correctness. That improves generalizable reasoning, but it does not care which language the model reasons in. If English gives a more reliable path to correct answers, the reward happily nudges the model there. The reward is not xenophobic; it is just myopic. Incentives do not need bad intentions to produce bad behavior.
AdaMame-GRPO changes the reward structure. It keeps accuracy as the base objective, then scales the reward upward when the rollout’s detected reasoning language matches the query language. The scaling factor follows a cosine growth schedule: weak at the beginning of training, stronger later. The default query-alignment factor is 2.0.
The mechanism is important. AdaMame does not immediately force the model to reason in the query language at all costs. Early in training, the model can explore reasoning strategies across languages. Later, as the alignment factor grows, correct reasoning in the query language receives stronger reinforcement. The curriculum is roughly: first learn what works; then learn to make it work in the user’s language.
That is the paper’s main operational idea.
| Mechanism component | Technical purpose | Business interpretation | Boundary |
|---|---|---|---|
| Naturally occurring multilingual traces | Teach language-specific reasoning patterns during SFT | Localization starts inside the reasoning behavior, not only in output translation | Traces are generated and filtered, not collected from real enterprise tasks |
| LoRA-based SFT | Adapt without fully overwriting model behavior | Lower-cost specialization with less catastrophic forgetting risk | Demonstrated mainly on Distill-Qwen 1.5B in the appendix |
| Accuracy-first GRPO base | Preserve answer correctness as the core objective | Language compliance should not become decorative correctness loss | Accuracy is math-answer accuracy, not domain-task success |
| Query-conditioned reward scaling | Reward correct reasoning in the query language | Language fidelity becomes an optimization target | Depends on reliable language detection |
| Growing alignment schedule | Let exploration precede alignment | Avoids turning multilingual control into a brittle hard constraint | The best schedule may vary by model, domain, and language set |
The recipe is not philosophically grand. It is better than that: it is operationally specific.
The first stage teaches native-looking traces; the second stage makes them travel
The paper’s results support a clean division of labor between SFT and RL.
SFT is the first major step away from language collapse. Across both backbones and both datasets, SFT produces much larger LCPR gains than prompt steering. For Qwen3 4B, the jump is dramatic: compared with prompting, SFT improves LCPR by 64.1 points on MGSM-Rev2 and 56.2 points on MSVAMP. For Distill-Qwen 1.5B, the gains are smaller, 9.4 and 4.0 points, likely because that model already had multilingual reasoning exposure.
But SFT alone has a familiar weakness: it generalizes poorly outside the languages it has seen. The paper explicitly separates in-domain languages from out-of-domain languages, and this distinction is where the mechanism-first reading becomes useful. SFT can imitate and internalize the training languages. It is less reliable when asked to extend that behavior to Bengali, English, Spanish, Russian, Swahili, Telugu, Chinese, and German.
Standard GRPO helps with generalization because reinforcement learning rewards successful problem-solving over memorized trace style. In the authors’ framing, SFT gives the model multilingual reasoning capability; RL helps it generalize. But standard GRPO is blind to language fidelity. It optimizes correctness, and if correctness routes through English, then English wins. The spreadsheet smiles. The user does not.
AdaMame-GRPO is the bridge between these two needs. It retains the generalization benefits of RL while adding an adaptive pressure toward the query language. That is why the paper should not be read as “SFT versus RL.” The stronger reading is “SFT gives the model multilingual options; adaptive RL teaches it when to use the option the user actually asked for.”
This distinction matters for enterprise AI. Many organizations already have localized data, translated prompts, regional UI layers, and language-specific support scripts. Those are useful. They are not equivalent to a trained reasoning behavior. AdaMame’s contribution is to show where the control surface should sit: not only in the prompt or final answer, but inside the post-training objective.
The main result is a Pareto shift, not a leaderboard flourish
The paper evaluates AdaMame on two multilingual mathematical reasoning benchmarks, MGSM-Rev2 and MSVAMP, using two backbone models: Distill-Qwen 1.5B and Qwen3 4B. It reports three metrics: Pass@4 accuracy, LCPR, and test-time compute.
LCPR, or Language Confusion Pass Rate, is a useful metric choice because it penalizes code-switching within reasoning traces. This matters because a simpler “dominant detected language” metric can declare victory even when the trace contains ugly midstream language mixing. The appendix gives examples where top-1 language consistency is perfect while LCPR is zero. That is exactly the kind of metric loophole models enjoy walking through with the confidence of a tax attorney.
The core results are best read as a three-objective comparison.
On Distill-Qwen 1.5B, AdaMame-GRPO achieves 67.9% accuracy and 70.1% LCPR on MGSM-Rev2, compared with 66.6% and 67.8% for standard GRPO. On MSVAMP, it reaches 77.0% accuracy and 60.7% LCPR, compared with 76.6% and 58.7% for standard GRPO. Test-time compute remains essentially unchanged or low: 1.9% of maximum context length on MGSM-Rev2 and 1.4% on MSVAMP.
On Qwen3 4B, AdaMame-GRPO reaches 85.9% accuracy and 89.9% LCPR on MGSM-Rev2, compared with 84.9% and 89.6% for standard GRPO. On MSVAMP, it reaches 89.0% accuracy and 86.0% LCPR, compared with 88.6% and 85.0% for standard GRPO. Test-time compute falls to 1.5% on MGSM-Rev2 and 0.9% on MSVAMP.
The gains over standard GRPO are not always enormous. They do not need to be. The significant point is directional: AdaMame generally improves language fidelity without giving up accuracy, while keeping token use low. In this area, avoiding the usual accuracy-language-fidelity trade-off is already meaningful.
The contrast with M-Thinker is also instructive. On Distill-Qwen 1.5B, M-Thinker Iter2 can be competitive on accuracy, but its LCPR collapses badly: 11.8% on MGSM-Rev2 and 19.7% on MSVAMP. AdaMame-GRPO reaches 70.1% and 60.7% respectively, with fewer training instances than M-Thinker Iter2: 35,000 versus 50,000. The paper attributes the M-Thinker gap to heavy code-switching that prior language-fidelity metrics may overlook.
This is the business lesson: the metric decides what failure is visible. If your evaluation only checks final answer language, you can miss reasoning-language contamination. If your evaluation only checks accuracy, you can ship a model that solves the task while producing review artifacts your local operators cannot comfortably inspect. It will look efficient until something has to be explained.
The appendix explains why the recipe is so annoyingly specific
The appendix is not a second thesis. It is mostly a set of design validations: why this trace source, why this fine-tuning method, why this reward design, why this sampling strategy, and how reliable the language detector is.
That matters because AdaMame is a recipe paper. In recipe papers, the difference between “principled method” and “pile of knobs” is whether the authors show why the knobs are there.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Naturally occurring traces vs. naturally occurring plus machine-translated traces | Ablation | Adding 65,000 translated traces hurts both accuracy and LCPR versus 30,000 naturally occurring traces | It does not prove all machine translation is harmful in every domain |
| LoRA vs. full fine-tuning | Implementation validation | LoRA performs better on Distill-Qwen 1.5B: 60.6% accuracy and 67.7% LCPR versus 59.2% and 65.4% | It does not settle adapter strategy for larger or proprietary models |
| Accuracy reward vs. accuracy plus format reward | Reward-design ablation | Accuracy-only reward performs better: 67.9% accuracy and 70.1% LCPR versus 66.2% and 67.1% | It does not prove format rewards are generally bad |
| GRPO vs. Dr.GRPO | RL variant test | Dr.GRPO improves accuracy substantially while keeping comparable LCPR: 67.9% and 70.1% versus 63.4% and 69.9% | It does not isolate every optimization factor in RL training |
| Random, conditional, and rejection sampling | Training-data selection ablation | Rejection sampling wins: 67.9% accuracy and 70.1% LCPR versus roughly 60% accuracy for the alternatives | It does not prove rejection sampling is optimal under different rollout budgets |
| Lingua language-detector validation | Robustness check | Detector accuracy is 100% on short MGSM-Rev2 queries and 99.2% on sampled long reasoning traces | It does not guarantee reliability on noisy enterprise text, mixed dialects, or domain jargon |
The most important appendix result is the trace-source ablation. Adding machine-translated traces increases the training data from 30,000 to 95,000 examples, yet performance gets worse: accuracy drops from 60.6% to 56.4%, and LCPR drops from 67.7% to 63.0%. More data loses to better-aligned data. This should not shock anyone, but it will, because “more data” still has excellent lobbyists.
The implication is not that translation is useless. The implication is that translated English reasoning traces may carry English reasoning structure disguised in local-language tokens. For user-facing localization, that might be acceptable. For training a model to reason in a language, it may be a weak signal.
That distinction is commercially important. Enterprises often have translated content libraries, localized manuals, and multilingual support scripts. Those assets may be useful for retrieval and response generation. They should not automatically be treated as native reasoning supervision. A translated workflow is not always a local workflow. It is often an English workflow wearing a regional uniform.
Lower-resource gains are the result operators should not skip
The paper’s lower-resource language findings deserve attention because multilingual AI systems often improve first where data is already abundant. That is operationally convenient and socially unromantic. AdaMame’s gains are strongest in lower-resource settings.
The paper groups languages by resource level using speaker counts and Wikipedia article counts. Low-resource languages include Thai, Bengali, Swahili, and Telugu. AdaMame-GRPO performs especially well in this group. The authors highlight Bengali on MSVAMP: AdaMame improves accuracy over the Vanilla baseline by 10.8 points on Distill-Qwen 1.5B and by 28.4 points on Qwen3 4B.
The noteworthy detail is that only Thai appears in the SFT and RL training corpora among the low-resource group. Bengali, Swahili, and Telugu are out-of-domain. So the result is not merely “we trained on a language and got better at that language.” It suggests that the combination of multilingual trace exposure and adaptive RL can transfer some language-alignment behavior across languages.
That is useful, but not magical. The paper evaluates mathematical reasoning. The transfer may depend on shared scripts, model pretraining coverage, problem structure, and detector reliability. Still, for operators, this is the part of the paper that points toward practical deployment value: not perfect coverage of every language, but a training design that does not reserve quality improvements only for English, French, Spanish, and other usual winners.
The business relevance is especially strong in markets where user trust depends on localized reasoning. Education, customer support, insurance, banking, procurement, legal intake, public-sector services, and healthcare-adjacent workflows all contain moments where the user or reviewer needs to understand how a conclusion was reached. If the explanation drifts into English or code-switches midstream, the system may still be technically functional but operationally annoying. In regulated settings, operationally annoying has a way of becoming expensive.
The beta knob is not cosmetic; it prices language fidelity
The paper includes a sensitivity test on the query alignment factor, $\beta$. This is not a side detail. It is the knob that makes the trade-off legible.
As $\beta$ increases, LCPR rises and accuracy tends to fall. On Distill-Qwen 1.5B with MGSM-Rev2, moving from $\beta = 1$ to $\beta = 5$ changes accuracy from 68.0% to 59.3%, while LCPR rises from 69.8% to 71.6%. On Qwen3 4B with MSVAMP, moving from $\beta = 1$ to $\beta = 5$ changes accuracy from 90.1% to 86.9%, while LCPR rises from 78.7% to 88.0%.
This is a clean reminder that language fidelity is not free. AdaMame’s default setting, $\beta = 2.0$, is a practical compromise rather than a universal constant handed down from the mountain. Higher alignment pressure buys stronger language fidelity, but it can also pull against answer accuracy.
For enterprise use, that is not a flaw. It is exactly what a control surface should expose. Different workflows should price the trade-off differently.
A multilingual tutoring system might tolerate a small accuracy cost if it substantially improves local-language reasoning and student comprehension. A compliance workflow might prioritize exactness and require additional validation before increasing language-alignment pressure. A customer support bot might tune differently by language, region, and product category. The point is not to choose one $\beta$ forever. The point is to stop pretending the trade-off is invisible.
The sarcastic version: congratulations, multilingual AI has a budget line now.
What Cognaptus infers for business use
The paper directly shows that AdaMame improves multilingual mathematical reasoning behavior across the tested models, datasets, and languages. It directly shows that prompt steering is weak, naturally occurring traces outperform a much larger translated-trace mixture in one tested setting, adaptive reward scaling improves LCPR over standard GRPO, and the query-alignment factor creates a controllable accuracy-fidelity trade-off.
Cognaptus infers a broader operating principle: multilingual AI systems need reasoning-language governance, not just interface localization.
That means businesses should separate at least four layers:
| Layer | Common shortcut | Better operating question |
|---|---|---|
| User interface language | “The UI is translated.” | Does the user’s task context remain language-consistent end to end? |
| Final answer language | “The model responds in the right language.” | Does the explanation or rationale also remain in the right language? |
| Intermediate reasoning artifacts | “The reasoning is internal anyway.” | Are logs, traces, tool comments, and review artifacts inspectable by local operators? |
| Training objective | “We prompted it to comply.” | Is language fidelity rewarded, evaluated, and monitored as model behavior? |
The strongest business use case is not merely “better multilingual chatbots.” That is too small. The stronger use case is multilingual AI operations where explanation, review, and trust depend on language consistency. Examples include localized education, financial service explanations, claims handling, customer dispute resolution, field-service troubleshooting, and public-facing administrative workflows.
In these workflows, English-default reasoning creates a subtle hierarchy: the user operates in one language, the model reasons in another, and the organization audits in whichever language the logs happen to reveal. That is not localization. That is operational outsourcing to English.
AdaMame suggests a better design principle: train the system so that reasoning-language alignment is part of the task objective. Then evaluate it with a metric that catches code-switching rather than politely ignoring it.
Boundaries: this is math reasoning, not universal multilingual governance
The paper’s limitations are material, not ceremonial.
First, the task domain is mathematical reasoning. Math word problems are useful because they provide clear answer checking and multilingual benchmarks. They are not the same as legal reasoning, medical triage, contract review, claims adjudication, product troubleshooting, or multi-document enterprise analysis. Those tasks have messier evidence, more ambiguous outputs, and domain-specific language patterns.
Second, the tested models are relatively small open reasoning models: Distill-Qwen 1.5B and Qwen3 4B. Larger models may behave differently. Proprietary frontier models may already have stronger multilingual reasoning priors, or they may fail in more expensive ways. Either outcome would need testing.
Third, the reward mechanism depends on language detection. The paper validates the Lingua detector on short queries and sampled long traces, with high reported reliability. That is reassuring within the experimental setup. It does not eliminate risk in noisy enterprise text, mixed-language customer messages, dialectal variation, transliteration, abbreviations, or domain-specific terminology. A detector inside a reward loop is not just measurement; it becomes governance infrastructure. Treat it accordingly.
Fourth, the training traces are generated by GPT-5 nano. The authors chose it because it had the highest average retain rate among candidate generators, but generated traces can carry model-specific biases in how reasoning is expressed across languages. If the teacher model has uneven multilingual habits, the student may inherit them with a fresh coat of evaluation paint.
Fifth, the language set is broad but still finite. Twelve languages is useful evidence, not planetary coverage. There are many languages, dialects, scripts, and mixed-language practices that remain outside this evaluation. Global deployment has a talent for finding the one case your benchmark did not include.
None of these boundaries weaken the paper’s core contribution. They define its scope. The correct takeaway is not “AdaMame solves multilingual reasoning.” The correct takeaway is “AdaMame shows where the control surface belongs, and demonstrates that it can move multiple operational metrics in the right direction under a bounded setup.”
That is enough to be useful. It is not enough to stop measuring.
The operating model: how teams should absorb this paper
A serious multilingual AI team should not copy AdaMame blindly. It should copy the discipline.
Start by separating answer accuracy from language fidelity. If the only metric is task success, English-default reasoning can hide behind correct outputs. Add a trace-level or explanation-level metric that detects code-switching and language drift. If the system does not expose full reasoning traces, evaluate whatever artifacts are available: explanations, rationales, tool-use summaries, generated work notes, reviewer-facing logs, or structured justifications.
Next, audit the training data source. Native or naturally occurring reasoning artifacts are not equivalent to machine-translated English traces. If the domain is insurance claims in Thai, translated English claim rationales may not teach the model how Thai claim handlers actually explain decisions. The paper’s translated-trace ablation is a warning against volume worship.
Then decide where language fidelity belongs in the objective. For low-risk applications, prompt steering may be enough. For high-trust workflows, it probably is not. The reward, fine-tuning objective, or evaluator should encode language fidelity explicitly. Otherwise the system will optimize for whatever is easiest, and “whatever is easiest” has a long and distinguished history of becoming English.
Finally, tune the trade-off by workflow. A single global setting for language alignment is convenient, but convenience is not governance. Some use cases need maximum correctness. Some need maximum inspectability. Some need a region-specific compromise. AdaMame’s $\beta$ sensitivity results show that this trade-off can be surfaced and managed instead of discovered accidentally after deployment.
Conclusion: prompts are manners; training is policy
AdaMame is valuable because it frames multilingual reasoning as an alignment problem inside the model’s learned behavior. Not a translation layer. Not a UI setting. Not a prompt template with an exclamation mark at the end.
The paper’s mechanism is modest and effective: teach the model with naturally occurring multilingual reasoning traces, then use adaptive reinforcement learning to reward correct reasoning in the query language. The resulting system improves language fidelity while preserving or improving accuracy and keeping token use low across the tested setting.
For business leaders, the lesson is simple enough to be dangerous: multilingual AI quality is not measured by whether the final answer looks localized. It is measured by whether the reasoning process, the explanation surface, and the review artifacts remain usable in the language of operation.
The model speaking your language is nice. The model reasoning in it is the product requirement.
Cognaptus: Automate the Present, Incubate the Future.
-
Dayeon Ki, Kevin Duh, and Marine Carpuat, “AdaMame: A Training Recipe for Adaptive Multilingual Reasoning,” arXiv:2606.15080v1, June 13, 2026. ↩︎