Reasoning-Models

The Model Spoke Your Language. Its Reasoning Did Not.

TL;DR for operators AdaMame is a paper about a very practical failure: a model can answer a user in one language while doing its reasoning in another. That is not just inelegant. It is a product, trust, and governance problem wearing a linguistics hat.1 The paper’s useful move is to stop treating multilingual reasoning as a translation issue. The authors train for language fidelity directly. First, they supervised fine-tune models on 30,000 naturally occurring reasoning traces across five languages. Then they run reinforcement learning with AdaMame-GRPO, a GRPO variant that gives extra reward when a correct rollout reasons in the query language. The extra reward grows during training, so the model first explores useful reasoning languages and later converges toward the user’s language. ...

You Can’t Reweight a Dead End: TRD and the Prefix Failure Problem

TL;DR for operators The paper’s main message is simple: if a reasoning model has already walked into a dead end, per-token distillation often keeps supervising it from inside the dead end. A clever loss cap is not a map. A top-k filter is not a tow truck. Trajectory-Refined Distillation, or TRD, repairs the student’s own rollout before using it for distillation. The pipeline is: sample the student’s attempt, ask a teacher or privileged self-teacher to rewrite the trajectory into a better one, then train on the refined trajectory rather than on the original failed rollout. The technical contribution is not “better prompting”, although prompts are used. It is the shift from token-level correction to trajectory-level correction. ...

Think Meter, Not Think Bigger: The New Control Layer for AI Reasoning

Most companies do not actually want an AI system that “thinks longer.” They want one that knows when extra thinking is worth the bill. That distinction is becoming more important. Reasoning models are moving from demo-stage math puzzles into document review, financial research, compliance analysis, customer support escalation, and agentic workflows. In these settings, reasoning has three costs: latency, compute, and misplaced confidence. A model that spends 30 seconds producing an elegant wrong answer has not reasoned. It has performed expensive theatre. Very fluent theatre, admittedly. ...

Jailbreak Risk Needs a Stopwatch, Not Just a Scorecard

Jailbreak Risk Needs a Stopwatch, Not Just a Scorecard For many organizations, LLM safety is still treated like a checkpoint: run a benchmark, report an attack success rate, add a few guardrails, and move on. The resulting dashboard looks reassuringly official. It may even have decimals. Unfortunately, adversarial users do not attack dashboards. They attack systems. ...

Look Who’s Reasoning Now: UpstreamQA and the Fine Print of Video AI

Opening — Why this matters now Video is becoming one of the most tempting inputs for business AI. Warehouses have cameras. Clinics have consultation rooms. Retailers have shelves, queues, and checkout counters. Property managers have inspection footage. Factories have safety recordings. Everyone wants to ask the same beautifully dangerous question: Can the model just watch the video and tell us what happened? ...

When RL Needs a Tour Guide: OGER and the Business of Smarter Exploration

Training a reasoning model is starting to look less like feeding a student more textbooks and more like taking that student into a difficult city with a very opinionated guide. The guide should not carry the student through every street. That creates a tourist, not a navigator. But leaving the student alone with a reward signal that says only “correct” or “wrong” is not exactly enlightened pedagogy either. The student may find one narrow route, repeat it forever, and call that intelligence. We have all seen corporate training programs with roughly this level of imagination. ...

When AI Answers the Wrong Question — And Why That Matters More Than Being Wrong

A support ticket arrives with a simple request: “Can I cancel this order after the trial ends?” The AI assistant replies with a polished explanation of the company’s refund policy. The paragraph is fluent. The tone is calm. The answer is probably useful to someone. Unfortunately, it may not answer the question that was asked. ...

Don’t Train Harder—Train Smarter: The Hidden Economics of RL for LLMs

The GPU bill is not the strategy The easiest way to make reinforcement learning for reasoning models sound impressive is to say: sample more responses, train longer, scale harder. It is also the easiest way to make the finance team develop a facial twitch. Modern reasoning-focused LLMs increasingly rely on reinforcement learning with verifiable rewards: generate multiple candidate answers, score them with a rule-based signal, and update the model toward better reasoning behavior. In mathematics and coding tasks, this has become one of the most important post-training recipes. But it has a small accounting problem, in the same way a leaking ship has a small moisture problem. ...

The Cost of Knowing You’re Wrong: Why Two Samples Beat Eight in AI Reasoning

An AI system gives an answer. The answer looks plausible. The reasoning trace is long enough to seem serious. The user asks the next question, which is the one that actually matters: How sure is it? For ordinary software, this question is already annoying. For reasoning language models, it is worse. These models do not just emit a short response; they may spend thousands of tokens walking through a problem before landing on an answer. Asking them again is not free. Asking them eight times is not diligence. It is a budget line with philosophical decoration. ...

When Right Meets Wrong: Teaching LLMs by Letting Their Mistakes Talk

Training a reasoning model is often treated like running a classroom with a very impatient teacher: give the model a problem, let it produce several answers, mark each answer right or wrong, and push the policy toward the winners. That is already useful. It is also slightly wasteful. Because in a real classroom, the wrong answers are not just trash to be swept off the floor. They reveal what the student misunderstood. They show which shortcuts are tempting, which algebra step keeps breaking, and which false pattern looks suspiciously persuasive. A good teacher does not only praise the correct solution. A good teacher puts the correct and incorrect attempts side by side and asks: what exactly changed? ...