Inked in the Code: Can Watermarks Save LLMs from Deepfake Dystopia?

TL;DR for operators

BiMark is a proposed watermarking method for large language models that tries to solve a practical trilemma: keep generated text quality intact, detect the watermark without access to the original model, and embed more than a yes/no signal.¹

The important part is not that it “detects AI text.” That is the shallow version, beloved by procurement decks and policy panels that have never met a paraphraser. The more useful claim is that BiMark can encode provenance-like metadata—model identity, timestamp, source label, policy context—inside the token sampling process, then recover that information later using statistical evidence and the right secret key.

Its mechanism has three moving parts. First, it uses a fair bit-flip to shift probability mass between two balanced token partitions while preserving the expected distribution. Second, it repeats that process across multiple independent layers, making the watermark easier to detect without simply cranking up distortion. Third, it uses XOR with random masks so message bits can be encoded while the reweighting signals still behave like fair coin flips.

The experiments support the core promise, with the usual academic boundary fence. On C4-style generation with Llama3-8B, BiMark beats MPAC on short-text multi-bit extraction, especially when messages are longer and text is shorter. For 50-token text, the paper reports relative gains over stronger MPAC settings of 20.87%, 29.50%, and 30.02% for 8-, 16-, and 32-bit messages. It also reports lower perplexity than MPAC and comparable downstream summarization and translation scores to non-watermarked or unbiased baselines.

For business use, the result points toward provenance infrastructure, not a magic deepfake detector. BiMark reduces one technical obstacle: carrying useful metadata without obviously damaging text. It does not remove the harder operational obstacles: key custody, standards, provider adoption, text editing, paraphrasing, sufficiently long samples, and the lovely human tradition of trying to break every control the moment it becomes useful.

The real problem is not “AI text”; it is useful provenance without visible damage

The existing article framed BiMark as a possible answer to the synthetic-content flood. That framing is directionally right, but too generous. The world does not merely need another detector that says, with theatrical confidence, “this might be AI.” We have plenty of those. Some are useful. Some are astrology with ROC curves.

The more serious business problem is provenance. A bank, news platform, law firm, marketplace, public agency, or enterprise software vendor may not simply need to know whether a passage was generated by a model. It may need to know which model, under which account, at what time, under which policy setting, for which workflow, and possibly with which client or tenant identifier. A binary watermark is a doorbell. Provenance metadata is a delivery manifest.

That is why multi-bit watermarking matters. A zero-bit watermark answers one question: does this text contain the watermark? A multi-bit watermark attempts to carry a message. That message could represent a model identifier, timestamp bucket, platform label, generation policy, or audit trail reference. Not the entire compliance database, obviously. Nobody is stuffing a full governance dossier into 100 tokens unless they also believe PDFs are a lifestyle. But even compact metadata can be operationally valuable if the watermark survives enough real-world handling.

The catch is that LLM text is fragile terrain. Traditional watermarking for images or documents can often hide signals in pixels, formatting, or metadata. LLM watermarking has to work through token probabilities. Push too hard, and the model writes worse. Push too softly, and the watermark vanishes. Require access to the generating model, and third-party verification becomes impractical. Require the detector to know the message space in advance, and extraction becomes operationally clumsy.

BiMark positions itself inside that trade-off. It is an inference-time method: the watermark is added while the model generates text, by changing how next tokens are sampled. It is designed to be model-agnostic at detection time: the verifier should not need the original model or its logits. And it is message-agnostic in the multi-bit sense: the detector can recover the embedded message without first enumerating all possible messages.

The last sentence needs one correction before it becomes sales literature. Model-agnostic does not mean trustless. Message-agnostic does not mean keyless. The detector still needs the watermarking scheme, vocabulary partitions, pseudorandom functions, and secret-key material. BiMark removes one deployment dependency—access to the model—not all deployment dependencies. Reality, as usual, remains annoyingly attached to cryptographic plumbing.

Fair coin flips are the quality-preservation trick

BiMark’s first mechanism is bit-flip unbiased reweighting. It sounds like a small technical detail. It is actually the conceptual hinge of the paper.

Earlier green-list watermarking schemes divide the vocabulary into favoured and disfavoured token sets. During generation, they increase the probability of green-list tokens. Later, the detector checks whether the generated text contains more green-list tokens than clean text should. Simple. Effective. Also capable of nudging the model away from its original distribution, which is where text quality starts paying the invoice.

Unbiased watermarking tries to avoid that invoice by preserving the expected output distribution. BiMark does this with a symmetric probability shift. Imagine the vocabulary is split into two balanced partitions. A fair coin decides which side gets boosted and which side gets reduced. Heads: boost one partition. Tails: boost the other. Because the direction is random and balanced, the expected probability distribution remains unchanged.

That does not mean every generated answer looks exactly like non-watermarked text. This is the common misconception worth killing early. “Unbiased” here is not a guarantee that every individual sample is indistinguishable, nor that the watermark survives every edit. It means the reweighting is designed so that, in expectation, the original distribution is preserved. Individual samples can still contain statistical evidence that a detector with the right key can accumulate.

This distinction matters for business interpretation. A watermarking vendor saying “quality-preserving” may sound like “no user will notice anything, ever.” That is not what the paper proves. The stronger and more precise claim is that the watermarking procedure avoids systematic distortion of the token distribution while still creating detectable token-membership patterns.

That is useful. It is also less cinematic. Good infrastructure often is.

Multilayering turns a weak nudge into evidence

A single unbiased bit-flip layer is elegant, but elegance is not the same as detectability. The paper recognises the problem: if each individual adjustment is constrained by the need to preserve the expected distribution, the signal may be weak. Soft Red List can increase watermark strength by pushing harder on the green list. BiMark cannot simply do that without damaging its own unbiasedness argument.

So it stacks layers.

Each layer uses an independent vocabulary bipartition and an independent fair coin flip. The output distribution is reweighted iteratively. Each layer adds another statistical opportunity for the detector to observe whether sampled tokens fall into the expected partition. The detector does not inspect model logits. It reconstructs the relevant partitions and counts token memberships.

This is the paper’s most important engineering move. It says: instead of one loud watermark, use several quieter ones. That gives the detector more evidence while keeping each probability adjustment bounded.

There is a nice business analogy here, though one should not overdo it. A single audit control can be bypassed or produce noisy evidence. Several independent controls can make misconduct harder to hide. BiMark’s multilayering is similar in spirit: it does not make each token scream “watermarked”; it makes many small decisions slightly informative.

But the paper also shows why “more layers” is not a free lunch. The ablation study finds that detectability improves with more layers up to a point, then declines when layers become excessive. The explanation is mechanical rather than mystical. Reweighting is iterative. Shallow layers affect the distribution more strongly; once early layers shrink some token probabilities close to zero, later layers cannot fully revive them. Too many layers can add noise instead of useful evidence.

That is a deployment-relevant result. The practical tuning problem is not “maximise watermark strength.” It is “find the layer count and scaling factor that give enough evidence without degrading generation or wasting inference compute.” Less exciting than a launch slogan, more useful than a launch slogan.

XOR lets the watermark carry a message without breaking the fairness

The third mechanism is where BiMark becomes multi-bit. The method borrows the general idea of position allocation: for each generated token, pseudorandomly select which message bit that token will help encode. Over many tokens, the detector aggregates votes for each bit and uses majority voting to reconstruct the message.

The obstacle is unbiasedness. If the watermark directly used message bits as coin flips, then the distribution of reweighting directions would depend on the message. A message with too many zeros or ones could bias the process. That would undermine the whole “fair coin” premise.

BiMark solves this with XOR-enhanced position allocation. For each token, it selects a message bit and generates random mask bits. It then XORs the message bit with the random mask. The result behaves like a fair coin flip, regardless of whether the message bit is zero or one. During detection, the same pseudorandom mask can be reconstructed using the key and context, allowing the detector to reverse the XOR and recover votes for the original message bit.

The mechanism is doing two jobs at once:

Mechanism	Technical role	Operational consequence
Bit-flip unbiased reweighting	Preserves the expected token distribution while creating token-count evidence	Quality degradation is reduced compared with biased green-list approaches
Multilayer reweighting	Adds multiple independent evidence channels per token	Shorter texts and harder messages become more recoverable
XOR-enhanced position allocation	Encodes message bits while keeping reweighting directions fair	Provenance metadata can be embedded without requiring the detector to know the message in advance

This is why the paper deserves a mechanism-first reading. If you start with the benchmark table, BiMark looks like “higher extraction, lower perplexity.” Fine. But the real contribution is the architecture of the compromise: fairness for quality, layers for evidence, XOR for payload.

The experiments test three different claims, not one big victory lap

The paper’s evidence is best read by purpose. Some experiments test the main claim. Some test robustness. Some explain the mechanism. Mixing them into one “BiMark works” bucket would be convenient and faintly lazy, so naturally we will not do that.

Evidence block	Likely purpose	What it supports	What it does not prove
Message extraction on C4-style prompts using Llama3-8B	Main evidence and comparison with prior work	BiMark improves multi-bit extraction over MPAC, especially on short text and longer messages	It does not prove universal performance across all models, languages, domains, or decoding settings
Perplexity using Gemma-9B as an oracle	Quality comparison	BiMark’s outputs appear less degraded than MPAC under tested settings	Perplexity is not a full proxy for business quality, legal reliability, tone, or factuality
Summarisation and translation tasks	Downstream quality preservation	BiMark performs comparably to non-watermarked text and other unbiased methods on these tasks	It does not prove quality preservation for all enterprise workflows
Synonym substitution tests	Robustness test	BiMark keeps higher extraction rates than MPAC under controlled word replacement	It does not prove resilience against adaptive adversaries
DIPPER paraphrasing appendix	Robustness/sensitivity test	BiMark detection remains competitive under paraphrasing, especially with longer text	It does not prove embedded message recovery under every paraphrase regime
Layer/scaling ablations	Mechanism explanation	Multilayering helps up to an optimum; too much layering can hurt	It does not produce a universal tuning rule for production deployments

The headline result is strongest in the multi-bit extraction table. With 8-bit messages, BiMark reports extraction rates of 95.26% at 50 tokens, 97.62% at 100 tokens, 98.15% at 200 tokens, and 97.88% at 300 tokens. MPAC improves when its logit boost is increased, but the quality cost rises as well. For 50-token text, BiMark’s relative gains over MPAC with the stronger setting are reported as 20.87% for 8-bit messages, 29.50% for 16-bit messages, and 30.02% for 32-bit messages.

The pattern is exactly where operators should care: short text and longer payloads. Long text gives watermark detectors more evidence. Short text is the annoying real world: social posts, customer-service replies, product blurbs, internal summaries, comments, captions, snippets. The paper’s claim that BiMark is stronger under shorter texts is therefore not a decorative result. It points to the deployment zone where watermarking usually becomes brittle.

The quality evidence is also relevant. In the downstream task table, BiMark reports summarisation BERTScore of 32.48 and ROUGE-1 of 38.32, compared with 32.45 and 38.32 for no watermark. For machine translation, BiMark reports BERTScore of 56.14 and BLEU of 22.15, compared with 56.21 and 21.93 for no watermark. These are not sweeping proof that users will never notice anything. They do show that, under these task settings, BiMark behaves much closer to clean or unbiased baselines than stronger biased watermarking.

The robustness evidence is promising but should be interpreted as controlled stress testing. Under synonym substitution, BiMark keeps higher 8-bit extraction rates than MPAC across substitution ratios. At 100 tokens, BiMark reports 91.02%, 81.71%, and 70.37% extraction under 0.1, 0.2, and 0.3 substitution ratios respectively, compared with 76.85%, 67.36%, and 58.11% for the stronger MPAC setting.

The appendix paraphrasing test is also useful, but it is not a second thesis. It evaluates watermark detection under DIPPER paraphrasing with different lexical and order diversity settings. BiMark remains competitive, especially at longer text lengths. For example, at 300 tokens it reports detection rates of 98.93%, 100%, and 98.35% under the three paraphrasing settings shown in the paper. At 50 tokens, the hardest paraphrasing condition is much rougher: BiMark reports 59.94%. That lower number is not a failure to hide; it is a reminder that short, aggressively rewritten text is where watermark evidence becomes thin.

Good controls do not abolish physics. They bargain with it.

The business value is provenance routing, not courtroom certainty

The business interpretation should be separated from what the paper directly proves.

What the paper directly shows is a technical method for embedding and extracting multi-bit messages through inference-time watermarking, with experimental evidence that the method can improve extraction and preserve quality under the tested settings.

What Cognaptus infers is that methods like BiMark could become useful infrastructure for provenance routing. A platform could embed compact identifiers into model-generated text. A downstream verifier, regulator, marketplace, or internal audit system could extract those identifiers without querying the original model. That changes the operational shape of governance. Instead of asking “does this look synthetic?”, a verifier could ask “does this text carry a valid platform watermark, and what message does it encode?”

That opens several possible use cases:

Use case	What BiMark-like watermarking could support	Boundary
Platform provenance	Encode model or provider identifiers	Requires standardised keys, policies, and detector access
Enterprise audit trails	Link generated content to workflow, tenant, or policy state	Message capacity is limited; detailed records must live elsewhere
Marketplace moderation	Distinguish authorised synthetic content from unmarked content	Unmarked does not mean human-written
Regulatory reporting	Provide technical evidence of origin or generation channel	Legal acceptance depends on governance, not just algorithmic performance
Synthetic-data hygiene	Track generated text entering training or knowledge pipelines	Paraphrasing and transformation can weaken evidence

The deepest ROI is not “catch every fake.” That is fantasy procurement. The more plausible ROI is reducing ambiguity in high-volume systems. If a content platform, CRM, legal knowledge base, coding assistant, or government chatbot can tag generated text at source, later systems can route, audit, label, or quarantine it more cheaply.

That is a boring sentence. Boring sentences often describe valuable infrastructure.

The detector still needs a trust model

Watermarking conversations often quietly skip the trust model. BiMark should not be allowed to.

Detection is model-agnostic, not infrastructure-free. A verifier does not need the original LLM or its probability outputs. But it does need access to the relevant watermarking configuration and secret-key-derived reconstruction process. That means the deployment question quickly becomes organisational: who holds the keys, who can verify, who can revoke, who can rotate, and who is trusted not to leak or misuse the detector?

A public detector creates one set of risks. A private detector creates another. If verification is too open, attackers may use it to optimise evasions. If verification is too closed, third parties cannot audit provenance claims. If keys are shared too broadly, compromise becomes likely. If keys are held too tightly, the watermark becomes a vendor-controlled assertion rather than an ecosystem standard.

None of these issues invalidate BiMark. They simply move it from paper to operations, where good ideas are traditionally introduced to bureaucracy and promptly asked for three approval forms.

For enterprise use, a BiMark-like system would need a governance architecture around it:

key generation and rotation;
detector access policy;
audit logging for verification requests;
versioning across models and tokenizer changes;
message schema design;
handling of edited, paraphrased, or translated text;
legal rules for when watermark evidence is enough to trigger action.

The paper is solving a sampling-and-detection problem. Businesses still have to solve the institutional problem. Confusing the two is how technical controls become compliance theatre.

The limits are precise, not fatal

The main limitation is not that BiMark is “only a paper.” That phrase is true but too lazy to be useful.

The first boundary is text length. Watermarking accumulates evidence over tokens. Shorter text gives fewer observations. BiMark improves the short-text setting relative to MPAC, but it does not make 10-token fragments magically reliable.

The second boundary is transformation. Substitution and paraphrasing tests are encouraging, but adaptive attacks are different from benchmark perturbations. A determined adversary who can repeatedly paraphrase, translate, summarise, splice, or regenerate text may reduce or destroy the watermark. The paper’s robustness tests should be read as resilience evidence, not as proof of adversarial permanence.

The third boundary is domain coverage. The main generation experiments use Llama3-8B with C4-RealNewslike prompts, and quality experiments use standard summarisation and translation tasks. Those are reasonable tests. They are not the same as legal drafting, medical triage, financial disclosures, customer complaints, multilingual social posts, or code comments.

The fourth boundary is message design. Multi-bit capacity is valuable, but compact. Operators should not imagine embedding rich provenance records directly into generated text. The better design is to embed a short identifier that points to a secure external record.

The fifth boundary is ecosystem adoption. Watermarking becomes materially more useful when providers, platforms, auditors, and regulators agree on standards. A clever watermark used by one model vendor is a local control. A compatible provenance ecosystem is infrastructure. The distance between those two is where many AI governance proposals go to become slideware.

BiMark’s lesson is bigger than watermarking

BiMark is interesting because it treats watermarking as a probability-design problem rather than a sticker slapped onto text after generation. That is the right instinct. LLM outputs are not static documents; they are sampled sequences. If provenance is going to survive in this medium, it probably has to be built into sampling behaviour, not bolted on after the paragraph has already escaped into the wild.

The paper’s most useful contribution is therefore not one number in a table. It is the compromise architecture: keep the expected distribution stable, multiply weak signals across layers, and encode message bits without breaking the fairness assumption. That is a disciplined way to handle the watermarking trilemma.

For operators, the conclusion is neither “watermarks will save us” nor “watermarks are useless.” Both positions are suspiciously convenient. BiMark suggests a narrower and more useful answer: watermarking can become part of provenance infrastructure if it carries metadata, preserves quality, survives enough normal editing, and fits inside a credible trust model.

Can watermarks save LLMs from deepfake dystopia? No, not alone. Dystopia, regrettably, has a diversified portfolio.

Can methods like BiMark make provenance cheaper, more inspectable, and less dependent on vibes-based detection? Yes. That is the more realistic promise—and, for serious organisations, the more valuable one.

Cognaptus: Automate the Present, Incubate the Future.

Xiaoyan Feng, He Zhang, Yanjun Zhang, Leo Yu Zhang, and Shirui Pan, “BiMark: Unbiased Multilayer Watermarking for Large Language Models,” arXiv:2506.21602, 2025. https://arxiv.org/abs/2506.21602 ↩︎

TL;DR for operators#

The real problem is not “AI text”; it is useful provenance without visible damage#

Fair coin flips are the quality-preservation trick#

Multilayering turns a weak nudge into evidence#

XOR lets the watermark carry a message without breaking the fairness#

The experiments test three different claims, not one big victory lap#

The business value is provenance routing, not courtroom certainty#

The detector still needs a trust model#

The limits are precise, not fatal#

BiMark’s lesson is bigger than watermarking#