Distilling the Thought, Watermarking the Answer: When Reasoning Models Finally Get Traceable

Traceability sounds simple until a reasoning model enters the room.

For ordinary generated text, watermarking usually means nudging token choices so the final output carries a statistical signature. That is already a delicate game. Push too weakly and the detector sees nothing. Push too hard and the writing starts to smell like machine-selected confetti.

Reasoning models make the trade-off nastier. They do not merely continue text. They think, or at least they produce a visible reasoning phase that behaves like thinking: intermediate steps, candidate paths, self-correction, sometimes a little theatrical introspection. If a watermarking method randomly biases token choices inside that phase, it may not just change style. It may change the path that leads to the answer. In math, code, legal analysis, or planning, one corrupted intermediate step can quietly poison the final result. Excellent. We solved provenance by making the model worse.

The paper behind today’s article, Distilling the Thought, Watermarking the Answer, proposes a cleaner split: leave the reasoning trace alone, use it to extract a semantic direction, and watermark only the final answer.¹ The method is called ReasonMark. Its central idea is not “make the watermark stronger.” It is “stop inserting the watermark where the model is still doing the fragile part.”

That placement decision is the whole story.

The old watermarking trade-off breaks differently in reasoning models

Classic text watermarking often works by dividing the vocabulary into “green” and “red” token sets, then giving green tokens a small logit bonus during generation. Later, a detector checks whether the output contains more green tokens than expected by chance. This is elegant because detection can be statistical and lightweight.

The problem is that the green list is usually generated through a pseudo-random rule. That is useful for detection but indifferent to meaning. In a normal paragraph, the damage may be tolerable. In a reasoning trace, the damage can hit the part of generation where the model is deciding what matters.

ReasonMark starts from a simple observation: reasoning models often expose a two-phase output structure.

Prompt
  ↓
Thinking phase: internal reasoning / chain-of-thought-like trace
  ↓
Answering phase: final response shown as the answer

The paper’s move is to decouple these phases. The thinking phase is not watermarked. It is treated as evidence. The answering phase is watermarked. It is treated as the surface that needs provenance.

That sounds almost too obvious, which is usually where good engineering hides.

The misconception to kill early is that watermarking is inevitably a random bias sprayed across generation. ReasonMark still uses a green-list watermarking mechanism, but it does not apply the same blind pressure everywhere. It first reads the model’s own reasoning trace, extracts the most semantically important tokens, compresses them into a vector, and uses that vector to decide which green tokens deserve stronger nudging.

The watermark is still statistical. The guidance is semantic.

ReasonMark turns the reasoning trace into a semantic compass

The mechanism has four moving parts.

First, the model generates its thinking phase normally. ReasonMark assumes the system can identify the boundary between thinking and answer, such as through structural delimiters like <think> and </think>. This is not a small assumption, and we will return to it later.

Second, the method identifies Critical Tokens in the thinking phase. These are not merely frequent words. The paper defines a Criticality Score based on two ideas:

Component	What it tries to capture	Operational intuition
Global Causal Contribution	Whether a token appears at distributional turning points and remains influential later	Important concepts shape the trajectory, not just one local word choice
Competitive Persistence Scoring	Whether a token survives in high-competition candidate sets across nearby steps	Useful concepts keep reappearing when the model is deciding what to say next

The consolidated Criticality Score is roughly:

$$ CS(w) = GCC(w) \cdot \log(1 + CPS(w)) $$

Do not let the notation obscure the point. The score is trying to find tokens that are both influential and persistent. A token matters if it appears when the model’s predictive distribution shifts and if it keeps competing across subsequent reasoning steps.

Third, these Critical Tokens are embedded and compressed into a Principal Semantic Vector. The paper stacks the embeddings of the selected Critical Tokens, applies PCA, and takes the first principal component as the initial semantic direction:

$$ R_0 = PCA_1(H) $$

This is the “distilling the thought” part. The reasoning trace is not copied into the watermark. It is reduced into a dominant direction in embedding space.

Fourth, during the answering phase, ReasonMark applies the watermark adaptively. It still partitions the vocabulary into green and red lists using a hash-based method, but not every green token receives the same bonus. A green token receives more support when its embedding aligns with the current Principal Semantic Vector:

$$ \delta_{i,w} = \delta_0 + \delta_\lambda \cdot s_{w,i-1} $$

Here, $s_{w,i-1}$ is the cosine similarity between the candidate token and the current semantic vector. The semantic vector is also updated as the answer unfolds, using a moving average anchored in the initial reasoning direction.

The practical consequence is subtle but important. ReasonMark is not asking, “Which green token should we force?” It is asking, “Among green tokens, which ones are already semantically compatible with what the model reasoned?”

That is why the paper’s mechanism-first framing matters. If we only summarize the benchmark table, ReasonMark looks like another method with better numbers. The better interpretation is that the method changes where the watermark is allowed to interfere.

Detection stays ordinary; embedding becomes selective

One clever part of ReasonMark is that the detector does not need to reconstruct the Principal Semantic Vector. Detection still uses the standard statistical test for green-token overrepresentation. The dynamic semantic guidance happens during embedding, not detection.

That separation is commercially meaningful.

In an enterprise deployment, detection complexity matters. A watermarking system that requires the detector to replay the prompt, retrieve the hidden reasoning, rebuild vectors, or access internal model states would be awkward for audits and downstream verification. ReasonMark’s detector remains lightweight: it checks whether the final answer contains an unusually high proportion of green tokens.

The intelligence sits upstream, during generation. The detector remains boring. Boring is underrated in compliance systems.

There is also a deployment trade-off. ReasonMark needs more access during generation than a purely post-hoc detector. It needs the thinking phase, token probabilities or candidate information, model embeddings, and logit control. This makes it more suitable for model providers, self-hosted deployments, or enterprise systems with generation-layer control. It is less suitable for users calling a black-box API that hides reasoning traces and logits.

So the business category is not “universal watermark detector.” It is closer to “traceable generation layer for reasoning-model infrastructure.”

The main results support the placement argument

The main experimental table evaluates ReasonMark on two 32B reasoning models: Qwen3-32B and DeepSeek-R1-Distill-Qwen-32B. The tasks cover text continuation on C4, German-English translation on WMT16, and mathematical reasoning on AIME and GSM8K. Metrics include perplexity for text quality, BLEU for translation quality, mathematical accuracy for reasoning tasks, and AUC for watermark detection.

The high-level result is that ReasonMark keeps output quality close to the no-watermark baseline while maintaining strong detectability.

Evidence	Likely purpose	What it supports	What it does not prove
Main benchmark table across C4, WMT16, AIME, GSM8K	Main evidence	ReasonMark improves the quality-detectability balance across text, translation, and math tasks	It does not prove universal robustness across all domains or model families
Ablation on Critical Tokens, GCC, and CPS	Ablation	The Critical Token selection and its two scoring components contribute to quality and detection	It does not prove the scoring formula is uniquely optimal
Attack tests with deletion, insertion, synonym replacement, translation, and paraphrase	Robustness test	The watermark remains detectable under several perturbations, especially word-level attacks	It does not make the watermark immune to heavy rewriting
Hyperparameter sensitivity for $\beta_0$, top-k, $\delta_0$, and $\delta_\lambda$	Sensitivity test	Performance is not a knife-edge artifact of one exact setting	It does not remove the need for deployment tuning
Latency table on C4	Implementation detail	The method adds modest overhead relative to no watermark and is faster than several semantic-aware baselines	It does not establish production cost across long-context, multi-turn, or tool-using workflows

The reasoning-task results are the most important. On AIME, ReasonMark reports mathematical accuracy of 69.86 for Qwen3 and 71.34 for DeepSeek, close to the no-watermark baselines of 70.03 and 71.52. On GSM8K, it reports 93.96 and 95.14, again close to no-watermark results of 94.01 and 95.21.

That is the core evidence for the paper’s claim: the watermark does not meaningfully damage reasoning accuracy in these tested settings.

The C4 results are also strong. ReasonMark reports lower perplexity than the no-watermark baseline for both models: 10.31 versus 10.55 for Qwen3, and 10.54 versus 10.82 for DeepSeek. The paper interprets this as improved fluency. A more cautious reading is that the semantically guided sampling may regularize output under the experimental setup. I would not turn that into a universal claim that watermarking improves writing quality. That would be the sort of sentence that makes benchmarks roll their eyes.

For translation, ReasonMark reports the best BLEU scores among watermarking methods: 9.916 for Qwen3 and 9.653 for DeepSeek in the main table. The detection AUC on WMT is much lower than on C4 or math tasks, sitting in the mid-to-high 80s. That matters. The method is strong, but detectability under translation-style outputs is not the same story as near-perfect AUC on math reasoning.

There is also a small reporting wrinkle: the main table lists DeepSeek WMT AUC for ReasonMark as 85.10, while the appendix text and detailed translation table report 87.10. The safe business interpretation is not the second decimal. It is that translation quality is very strong, while WMT detectability is materially lower than the near-99 AUC seen in several other settings.

The ablation tells us the semantic compass is doing real work

The ablation study removes or weakens core components on C4 with Qwen3.

The full ReasonMark configuration reports perplexity of 10.3080 and AUC of 99.31. Replacing Critical Token selection with random sampling increases perplexity to 12.8801 while keeping AUC high at 99.21. That is the important failure mode: the watermark remains detectable, but quality collapses.

This is exactly what one would expect if the semantic guidance is doing the quality-preservation work. Random tokens can still guide a detectable signal, but they do not guide a coherent one. The detector may be happy. The reader may not be.

Removing the Global Causal Contribution component raises perplexity to 11.1510 with AUC of 99.11. Removing Competitive Persistence Scoring gives perplexity of 11.0597 and AUC of 98.69. The paper reads this as GCC being especially important for semantic coherence, while CPS contributes more to robust detectability.

That interpretation is plausible. GCC looks for concepts that shape the reasoning path; CPS rewards concepts that keep mattering across competitive generation moments. Together, they make the Critical Token set less like a bag of keywords and more like a compressed trace of what the model was trying to do.

For business readers, the ablation has a practical message: the value is not “PCA was used.” PCA is the compression step. The value begins earlier, in selecting tokens that are worth compressing.

Robustness is good, but paraphrase remains the hard wall

The attack experiments test watermark detection after word deletion, word insertion, synonym replacement, translation, and paraphrasing. On Qwen3, ReasonMark reports unattacked AUC of 99.31. Under word-level attacks, it stays around the low-to-mid 90s: 94.36 for deletion, 93.60 for insertion, and 93.52 for synonym replacement.

Under semantic attacks, the numbers drop. Translation gives 82.58. Paraphrase gives 70.54. On DeepSeek, the pattern is similar: 99.52 unattacked, 82.79 under translation, and 70.75 under paraphrase.

This is still useful robustness. It is also a reminder that watermarking is not magic ink. Heavy semantic rewriting remains a serious attack surface. A paraphrase AUC around 70 is better than random, but it is not enough to treat the detector as courtroom-grade proof by itself.

The right operational framing is tiered evidence:

Detection scenario	Operational meaning
Unattacked or lightly edited output	Strong provenance signal
Word-level modified output	Still useful for audit triage
Translated and back-translated output	Moderately useful signal, especially as one component of a broader audit system
Heavily paraphrased output	Weak-to-moderate signal; should not be used alone for high-stakes attribution

This distinction matters because many business cases for watermarking are not about winning a philosophical debate over authorship. They are about triage: which documents need review, which model outputs may have leaked, which student submissions deserve closer inspection, which vendor content came from the approved model, and which generated reports can be traced back to an internal system.

For triage, a statistical signal can be valuable even when it is imperfect. For legal proof, imperfect is a polite way of saying “bring more evidence.”

The latency story is modest, not miraculous

ReasonMark is designed to avoid the heavy cost of semantic-aware watermarking methods that require auxiliary models or multiple sampling passes. The latency table reports average time per token on 200 C4 samples.

The no-watermark baseline is 0.06109 seconds per token. ReasonMark is 0.06613 seconds per token. That is roughly an 8.25% overhead relative to no watermark. It is slower than lightweight static methods such as KGW, but faster than several semantic-aware methods such as SemStamp, k-SemStamp, and SimMark.

This is a reasonable deployment story. Not free. Not frightening.

The architectural reason is that ReasonMark front-loads the expensive semantic extraction after the thinking phase. Once it has the Principal Semantic Vector, answer-phase watermarking mainly uses vector similarity and moving-average updates. The method avoids repeatedly calling an auxiliary semantic model for every token.

In enterprise terms, that means ReasonMark may be acceptable where provenance matters enough to justify a single-digit-percentage latency premium. It is less attractive for ultra-low-latency chat where every millisecond is treated as a personal insult from the infrastructure team.

What this means for enterprise AI systems

The paper directly shows that, in the tested models and tasks, a reasoning-aware watermark can preserve answer quality and mathematical accuracy while embedding a detectable statistical signal.

Cognaptus’ business inference is narrower and more useful: ReasonMark points toward a generation-layer provenance architecture for reasoning-model deployments.

A possible enterprise workflow looks like this:

1. User sends task to reasoning model
2. Model generates internal reasoning trace
3. System extracts Critical Tokens from the trace
4. Critical Tokens are compressed into a Principal Semantic Vector
5. Final answer is generated with semantic-guided watermarking
6. Later detector checks final answer for green-token bias
7. Audit system treats detection as a probabilistic provenance signal

This architecture is relevant for several business settings.

In financial analysis, internal research memos generated by a reasoning model may need provenance controls before they enter investment committee workflows. In legal or compliance support, generated drafts may need to be traceable without damaging the reasoning process that determines the final answer. In education, watermarking is often discussed badly, usually with too much certainty and not enough humility; ReasonMark would still not solve student authorship detection by itself, but it could support controlled environments where the institution owns the generation stack. In enterprise knowledge automation, it could help distinguish approved model-generated content from external or unapproved generated text.

The ROI is not “we can now detect everything.” The ROI is cheaper audit routing. A strong watermark signal can reduce manual review load, strengthen internal content governance, and create a provenance trail for generated outputs. That is useful, even if it is not a silver bullet. Silver bullets are mostly useful in PowerPoint.

The boundary conditions are not minor footnotes

ReasonMark depends on several conditions that affect practical use.

First, the method assumes access to a distinct thinking phase. If the model does not expose reasoning, or if the provider hides it, the system cannot directly apply the paper’s method as described. One could induce a reasoning step through prompting, but that changes the pipeline and may change model behavior.

Second, the quality of the watermark guidance depends on the quality of the reasoning trace. If the trace is short, generic, or hallucinated, the Principal Semantic Vector becomes less reliable. A bad compass is still a compass; it just points confidently in the wrong direction, which is worse than asking for directions.

Third, the method requires generation-layer access. It needs token probabilities, embeddings, and logit modification. This is realistic for model providers and some self-hosted enterprise systems. It is not realistic for ordinary users interacting with a closed model through a high-level API that exposes only text.

Fourth, detection remains statistical. AUC is a population-level evaluation metric, not a promise that any individual document can be attributed with certainty. This matters for governance. A watermark detector should be part of an evidence bundle, not the whole trial.

Fifth, robustness against paraphrase is still limited. The paper’s paraphrase AUC around 70 shows that semantic rewriting weakens the signal substantially. In adversarial settings, watermarking should be combined with access logs, cryptographic signing, document lineage, policy controls, and possibly retrieval-side provenance.

The real contribution is phase discipline

ReasonMark’s contribution is not that it invented watermarking for LLMs. It did not. It also did not eliminate the detection-quality-robustness trade-off. Nothing in the paper justifies that level of celebration.

The contribution is more specific and more interesting: it shows that reasoning models need phase-aware watermarking. The thinking phase and the answering phase should not be treated as the same surface. The thinking phase is where the model builds the semantic basis for the answer. The answering phase is where provenance can be embedded with less risk to the reasoning path.

That idea generalizes beyond watermarking. Many enterprise controls for reasoning models will need the same discipline. Do not interrupt the fragile cognitive workspace unless you must. Extract signals from it. Apply controls where they do the least damage. Verify downstream with lightweight mechanisms.

ReasonMark is therefore best read not just as a watermarking algorithm, but as a design pattern for governing reasoning systems: observe the thought, protect the answer, audit the output.

For companies deploying reasoning models, that is the practical lesson. Traceability should not be bolted onto the final text with blind pressure. It should be engineered around how the model actually produces the answer.

The watermark does not need to think. But it should know where the thinking happened.

Cognaptus: Automate the Present, Incubate the Future.

Shuliang Liu, Xingyu Li, Hongyi Liu, Dong Fang, Yibo Yan, Bingchen Duan, Qi Zheng, Lingfeng Su, and Xuming Hu, “Distilling the Thought, Watermarking the Answer: A Principle Semantic Guided Watermark for Large Reasoning Models,” arXiv:2601.05144v2, published as a conference paper at ICLR 2026, https://arxiv.org/pdf/2601.05144. ↩︎

The old watermarking trade-off breaks differently in reasoning models#

ReasonMark turns the reasoning trace into a semantic compass#

Detection stays ordinary; embedding becomes selective#

The main results support the placement argument#

The ablation tells us the semantic compass is doing real work#

Robustness is good, but paraphrase remains the hard wall#

The latency story is modest, not miraculous#

What this means for enterprise AI systems#

The boundary conditions are not minor footnotes#

The real contribution is phase discipline#