The Receipt Is in the Pixels: Model Attribution After the Watermark Fantasy

TL;DR for operators

Generated images may carry a more durable signature than most teams assume. Not a cute watermark. Not a metadata tag. Not a visible logo hiding in the corner like a nervous intern. A model-level statistical signature.

The paper Guess the Unified Model: How Much Can We Recover from Generated Images? studies whether images produced by unified multimodal models can be attributed back to the model that generated them.¹ The authors train a ConvNeXT classifier to identify the generating model from images produced by five open-source unified models, then extend part of the analysis to include two closed-source systems. The core result is blunt: attribution works surprisingly well. With 100 training images per model, accuracy is already 36% in a five-way task where chance is 20%. With 3K images per model, it reaches 93.9%. With 25K images per model, it reaches 99.9%.

The operational lesson is not that synthetic-media provenance is solved. Please resist the urge. The result is strongest when the candidate model set is known, the defender can collect enough labeled examples, and the image distribution resembles the tested conditions. The paper does not show that one can reliably identify any generator on the internet, infer the exact prompt, recover the user’s intent, or defeat every post-processing workflow.

What it does show is more useful: generated images from unified models appear to contain stable, model-specific visual regularities. These regularities survive common corruptions, remain partly visible under structural transformations, and generalize across semantic domains better than a pure “it learned the subject matter” explanation would predict. In business terms, this points toward model-origin auditing as a practical monitoring layer: useful for platform integrity, vendor accountability, brand-protection workflows, internal creative asset governance, and provenance triage.

The less glamorous but more important conclusion is that attribution should be treated as measurement infrastructure, not oracle infrastructure. It can narrow suspicion. It can reveal generator fingerprints. It can audit controlled pipelines. It cannot turn every JPEG into a sworn affidavit.

The image is not just content; it is a production trace

Most business discussions about AI-generated images still orbit two familiar ideas. First, the visible content: what the image depicts, whether it looks fake, whether it violates policy, whether the hands have stopped resembling seafood. Second, the declared provenance: metadata, platform labels, watermarking, content credentials, or whatever governance layer the product team says will definitely be implemented after launch.

This paper asks a different question: if the declaration layer is absent, can the image itself reveal which model made it?

That framing matters because unified multimodal models are not just another image generator category. The paper focuses on systems that jointly handle language and vision, including open-source models such as BAGEL, Emu3.5, Janus-Pro-7B, MMaDA, and Show-o2, plus closed-source systems identified in the paper as Gemini 2.5 Flash Image and GPT-Image-1. These models do not merely pass text into an image pipeline. Their generation behavior is shaped by cross-modal representations, model architecture, training data, alignment procedures, decoding choices, and all the other wonderful little decisions that make production AI systems behave like products of engineering rather than acts of nature.

The paper’s practical premise is simple. If different unified models leave different statistical traces in their images, then attribution becomes a measurable classification problem. Given an image, predict the source model.

That sounds almost too straightforward, which is usually where useful empirical papers begin.

The main result is a scaling curve, not a magic trick

The central experiment is the paper’s main evidence. The authors generate images from the same prompt set across five open-source unified models, then train a ConvNeXT-Tiny classifier from scratch to predict which model generated each image. They use MJHQ-30K prompts for the scaling study, with a held-out test set of 5K images per model.

The scaling curve is the first thing operators should notice.

Training images per model	Reported attribution accuracy	Interpretation
100	36%	Above five-way chance of 20%, but weak enough for caution
500	59.8%	Clear signal with limited data
3K	93.9%	Strong practical separability in the tested setting
25K	99.9%	Near-perfect attribution under a large controlled dataset

The most important number is not 99.9%. Big training sets often make demos look like destiny. The more interesting number is 93.9% at 3K images per model, because that starts to look like an achievable audit dataset rather than a research-lab indulgence. Three thousand labeled examples per model is still non-trivial, but for platforms, large enterprises, model vendors, agencies, and trust-and-safety teams, it is not fantasy territory.

The lower-data result also matters. At 100 images per model, the classifier reaches 36%, above the 20% chance level but nowhere near operational reliability. That is a useful warning. This technique is not “upload five samples and become the FBI.” The signal exists early, but dependable attribution needs enough examples.

The classifier’s mistakes are also informative. Janus and Show-o2 are more distinctive at low data, with higher recall when trained on only 100 images per model. Emu shows high precision but low recall: when the classifier predicts Emu, it is often right, but it does not consistently catch Emu outputs. That distinction is operationally important. A model can have identifiable traits that are sparse, inconsistent, or only visible under certain visual conditions. In governance terms, “detectable” is not one property. It is a profile.

Off-the-shelf multimodal models are not attribution systems

The paper then asks a tempting question: if frontier multimodal models can interpret images, can they perform source attribution directly?

This is a comparison with prior-style prompting practice, not the paper’s main classifier result. The authors evaluate two frontier MLLMs as attributors, giving them a target image and asking which of five candidate models produced it. They test zero-shot, one-shot, and five-shot settings, with 1,000 images per source model per shot condition.

The result is politely embarrassing for the “just ask the big model” school of procurement.

The MLLMs begin near chance: 28.8% for one evaluator and 21.4% for the other in a five-way setting. Few-shot examples help, but performance plateaus at around 50% from one to five exemplars per candidate model. That is much better than guessing. It is also far below the specialized classifier trained for the task.

This result should not be overread. It does not prove MLLMs can never perform attribution well. It proves that, under the paper’s setup, in-context visual comparison is not a substitute for a trained attribution model. The specialized classifier appears to learn representational cues that are not easily extracted through a few reference examples and a prompt saying, in effect, “look carefully, please.”

For business use, this is the difference between a dashboard and a colleague squinting at a screenshot. Both may produce an answer. One is less likely to improvise its way into confidence.

The corruption tests rule out the cheap explanation

A natural objection is that the classifier might be exploiting low-level artifacts: blur, compression-like residue, edge patterns, color noise, resizing quirks, or other superficial crumbs left by image generation. If so, attribution would be brittle. A mild image transformation would break it.

The paper’s corruption experiment is best read as a robustness and sensitivity test. The authors apply low-level corruptions to both training and test images, using 25K training images and 5K test images per model. The corruptions include color jitter, Gaussian noise, Gaussian blur, and resizing.

The no-corruption baseline is 99.9%. Under color jitter, accuracy remains around 94–95%. Under Gaussian noise, around 95–96%. Under blur, 98.7–99.4%. Under resizing, one condition stays at 96.6%, while the more aggressive resize drops to 85.2%.

Test type	Likely purpose	What the result supports	What it does not prove
Low-level corruptions	Robustness/sensitivity test	Attribution is not only driven by fragile pixel-level artifacts	It does not prove robustness to all real-world editing, recompression, cropping, or adversarial manipulation
Structural perturbations	Ablation-like probe	Depth, object boundaries, and color distributions each carry some source signal	It does not identify the full causal feature set
OOD semantic domains	Generalization test	Model-specific cues transfer across subject categories	It does not eliminate all semantic influence
Prompt-language attribution	Exploratory extension and boundary test	Model identity is easier to recover than prompt language	It does not cover all languages, prompt styles, or cultural effects

The corruption results are important because they weaken the most convenient misconception: that attribution works because the classifier memorizes cheap visual artifacts. It may use some low-level information, but that cannot be the whole story. The signal survives too much damage.

This does not mean the signature is indestructible. The paper’s transformations are controlled and symmetrical: the classifier is trained and tested on corrupted images. Real-world settings may involve unknown editing, platform compression, screenshots, crops, filters, or deliberate laundering. But the tested signature is not as flimsy as a metadata tag. It has depth.

Annoying for people who hoped provenance would be solved by deleting EXIF data. Convenient, perhaps, for everyone else.

Structural perturbations show that the signature is distributed

The next experiment probes structural bias. This is not a second thesis; it is a diagnostic layer. The authors transform images into depth maps using Depth-Anything-V2, into segmentation outputs using Segment Anything, and into random pixel shuffles. These transformations isolate different families of visual information: 3D structure, object-level boundaries, and color distribution.

The accuracies drop, but they do not collapse:

Transformation	Accuracy
No transformation	99.9%
Depth	83.2%
SAM segmentation	79.2%
Random pixel shuffle	72.7%

This is one of the paper’s more useful interpretive moments. If depth maps alone retain 83.2% accuracy, then model identity is partly encoded in spatial and structural tendencies. If segmentation retains 79.2%, object boundaries and composition also carry signal. If random pixel shuffle retains 72.7%, color distribution matters too, even after spatial arrangement is destroyed.

But none of these transformed views restores the full 99.9%. That matters. The source signature is not one neat fingerprint sitting in one layer of the image. It appears distributed across multiple feature families.

For operators, this changes how attribution tooling should be designed. A detector that relies on one artifact family may be fragile. A detector that integrates texture, color, structure, composition, and higher-level visual representations is more plausible. The business version: do not buy the tool that claims it found “the tell.” There probably is no single tell. There is a pattern family.

Cross-domain tests separate model identity from subject matter

The semantic-generalization experiment addresses another objection: perhaps the classifier is not learning model identity but learning subject-matter habits. Maybe one model renders vehicles differently, another tends to produce certain food layouts, and the classifier merely rides those category-specific shortcuts.

To test this, the authors build a 10D dataset across ten broad domains: animals, vehicles, arts and works, landscapes, foods and drinks, clothing, interior spaces, household items, buildings, and people. Within each domain, they generate 300 minimal prompts, avoiding adjectives, viewpoints, styles, and narrative flourishes. The point is to reduce subjective prompt variation and isolate broad semantic category effects.

They then train one classifier per domain using 200 images per model and test it across all domains using 100 images per model. This creates a 10-by-10 matrix: train on animals, test on vehicles; train on food, test on buildings; and so on.

This is a generalization test, and its interpretation should be careful. Same-domain results tend to be higher, as expected. A classifier trained on buildings and tested on buildings reaches 71.1%; one trained on people and tested on people reaches 72.3%; one trained on vehicles and tested on vehicles reaches 64.1%. But cross-domain performance often remains well above chance. Since this is now a seven-model task, chance is about 14.3%.

The paper also trains mixed-domain baselines with the same number of training images, producing an average accuracy of 46.7%. Cross-domain performance varies widely and is not symmetric. For example, training on vehicles and evaluating on food and drinks gives 24.4%, while training on food and drinks and evaluating on vehicles gives 39.1%. That asymmetry is not noise to wave away; it means domains expose different parts of the model signature.

The authors further test whether semantic co-occurrence might explain the results. They use Qwen3-VL to ask whether images from one prompt domain contain elements from another queried domain. The reported co-occurrence patterns show little correlation with attribution accuracy, which weakens the “it just learned the subject matter” explanation.

The business implication is subtle. A model-origin classifier trained only on one content category may generalize somewhat, but not uniformly. If a platform wants to audit AI-generated fashion images, training only on food prompts because food is easy to generate would be unwise. Domain coverage still matters. But the existence of cross-domain signal suggests that attribution systems need not be rebuilt from zero for every content vertical.

In other words: subject matter matters, but it is not the boss.

Prompt language is mostly not recoverable, and that boundary is useful

The language experiment is an exploratory extension and a boundary test. Once the generating model is fixed, can a classifier infer the language of the prompt from the generated image?

The authors translate 1,000 MJHQ-30K prompts into Spanish, Turkish, Japanese, and Simplified Chinese, alongside English. For each model, they train a classifier to predict prompt language from generated images using 700 images per language for training and 300 for evaluation. Five-way chance is 20%.

The results divide the models into two groups.

Model	Prompt-language classification accuracy
BAGEL	21.2%
Emu	21.9%
Janus	52.9%
MMaDA	34.2%
Show-o2	54.0%
Gemini	20.1%
ChatGPT	21.9%

For BAGEL, Emu, Gemini, and ChatGPT, language recovery is basically at chance. That is a major boundary on the attribution story. Model identity may be recoverable, but prompt language usually is not.

For Janus, MMaDA, and Show-o2, the higher language-attribution accuracies are not necessarily a flattering sign of multilingual sensitivity. The paper’s qualitative examples suggest failure modes. Janus often generates unrelated scenery for Japanese and Turkish prompts, sometimes with repeated tower, mountain, or temple-like motifs. MMaDA often fails to follow Japanese and Turkish prompts and produces abstract objects. Show-o2 frequently generates images of Asian women for Japanese or Chinese prompts regardless of prompt semantics.

That means high prompt-language attribution can indicate a language-conditioned artifact or failure mode, not sophisticated cultural grounding. A model that leaks the prompt language through stereotyped or collapsed outputs is not “more multilingual.” It is waving a flag from the ditch.

For business readers, this distinction is important. Source-model attribution and prompt-context attribution are different tasks. A provenance system may tell you which model likely made an image while telling you almost nothing reliable about the language, intent, or cultural context of the prompt. That is not a bug in the paper. It is the paper doing the useful thing: drawing a boundary.

The evidence chain, in business terms

The value of this paper is not any single result. It is the sequence.

First, the scaling experiment shows that model attribution is feasible under controlled conditions and improves sharply with data. Second, the MLLM comparison shows that generic multimodal reasoning is not enough; specialized measurement still matters. Third, corruption tests show that attribution is not merely fragile low-level artifact detection. Fourth, structural tests show that the signal is distributed across depth, segmentation, and color statistics. Fifth, cross-domain tests show that the classifier is not simply memorizing semantic categories. Sixth, the language experiment shows that not every hidden property is recoverable.

That gives operators a more disciplined framework:

Paper finding	Directly shown	Cognaptus interpretation	Boundary
Model attribution scales rapidly with training data	A ConvNeXT classifier reaches 93.9% at 3K images/model and 99.9% at 25K in the tested five-model setting	Controlled candidate-set attribution can become an operational monitoring layer	Requires labeled examples and known candidates
MLLMs plateau around 50% with few-shot examples	Prompted frontier MLLMs underperform a specialized classifier	Attribution should be trained and evaluated as a task, not improvised through chat	Future MLLMs or better prompting may improve, but the paper does not show that
Corruptions have modest effect	Accuracy remains high under color jitter, noise, blur, and resizing	The signature is not just a trivial pixel artifact	Real-world transformations may be harsher and asymmetric
Structural transformations retain signal	Depth, segmentation, and shuffled color distributions still classify above chance	Model signatures are distributed across multiple visual feature families	These probes do not identify the complete causal mechanism
Cross-domain generalization persists	Domain-trained classifiers often transfer above chance across semantic domains	Model identity is not reducible to subject matter	Domain choice still affects accuracy
Prompt-language recovery is mostly near chance	Four of seven models are near 20%; three show artifacts or failures	Provenance is easier than prompt reconstruction	Does not support claims about user intent or exact prompt recovery

This is why an evidence-first reading is necessary. A normal paper summary would say “model attribution works.” That is true, but thin. The useful conclusion is more conditional: model attribution works because the generator leaves distributed visual regularities that survive several attempts to explain them away, but the recoverable signal is much stronger for model identity than for finer-grained prompt context.

That sentence is less fun for a pitch deck. It is also more likely to survive contact with reality.

What this means for provenance programs

For businesses, the paper points to a practical middle ground between two bad positions.

The first bad position is provenance fatalism: if metadata is stripped and watermarks are absent, source tracing is hopeless. This paper argues against that. Images may contain recoverable production traces even when explicit declarations disappear.

The second bad position is forensic overconfidence: if a classifier can hit 99.9% in a controlled benchmark, then every synthetic image can be confidently traced to its origin. The paper does not support that. The candidate set is controlled, training data is available, and the experiments are designed to isolate signals rather than simulate every hostile distribution shift on the open internet.

A serious provenance program should treat model attribution as one layer in a stack:

Declared provenance: metadata, content credentials, platform labels, internal asset lineage.
Cryptographic or watermark signals: when available and trustworthy.
Model-attribution classifiers: trained on candidate generators relevant to the organization’s exposure.
Human and policy review: especially for high-stakes claims.
Incident logging and feedback: so the attribution system improves as model versions and content patterns change.

The paper’s results mostly strengthen the third layer. That layer is valuable precisely because the first two are often missing, stripped, spoofed, or never implemented because someone had a quarterly roadmap to beautify.

Vendor auditing becomes more interesting than public takedown theater

One of the more commercially relevant uses is not public misinformation detection. It is vendor and pipeline auditing.

Organizations increasingly use multiple image-generation systems across marketing, product, design, customer support, training data generation, and internal documentation. Even when the official policy says one model is approved, images may circulate from shadow tools, contractor workflows, agency pipelines, or model versions that were quietly swapped because someone liked the lighting better.

A trained attribution model could help answer narrower, practical questions:

Are creative assets in this campaign consistent with the approved generator?
Did a vendor use the model specified in the contract?
Are synthetic images in a dataset drawn from the claimed source models?
Did a model upgrade change visual distribution enough to affect downstream moderation or brand review?
Are internal teams mixing closed-source and open-source generators in ways that create compliance exposure?

These are controlled candidate-set problems. They are exactly where the paper’s assumptions look most operationally plausible. You know the likely models. You can collect training examples. You can test on the relevant content domain. You can define thresholds for triage rather than courtroom proof.

The ROI is not “we can identify every fake image.” The ROI is cheaper diagnosis of provenance drift.

The signature will not sit still

The paper’s limitation section is worth taking seriously because it affects deployment, not just academic neatness. The study examines seven unified models at one point in time. Model architectures, training procedures, safety filters, decoding settings, and post-processing layers change quickly. A signature learned today may weaken, shift, or become misleading after a model update.

The corruption and perturbation experiments are also tied to MJHQ-30K. The out-of-distribution experiment covers ten broad semantic domains, which is useful but not exhaustive. The language study covers five languages, not the full range of linguistic families, dialects, scripts, translation artifacts, and culturally specific prompt styles.

There is another boundary not to miss: the classifiers are trained with images from known candidate models. Open-world attribution is harder. If an image comes from a model outside the candidate set, a closed-set classifier will still choose one of the known labels unless designed to abstain. That is not intelligence. That is how forced-choice classifiers earn their reputation for overconfidence.

Any business deployment should therefore include abstention thresholds, model-version tracking, recurring retraining, post-processing stress tests, and careful separation between “likely generated by X among our candidate set” and “definitively generated by X in the world.” The first can be useful. The second is a lawsuit wearing a lab coat.

The real contribution is a map of what can and cannot be inferred

The paper’s strongest contribution is not simply that model attribution works. Prior work has shown attribution signals in GANs, diffusion models, datasets, and text. The sharper contribution is that unified multimodal image generators also appear to leave separable visual signatures, and those signatures survive enough diagnostic pressure to be taken seriously.

The second contribution is the boundary. Model identity is recoverable. Prompt language is mostly not. Semantic content contributes but does not dominate. Low-level artifacts contribute but do not explain everything. Structural information matters but is not sufficient alone.

That boundary is exactly what business AI governance often lacks. Teams either overstate what detection can do or dismiss it because it is imperfect. This paper gives a more usable position: attribution can be operationally meaningful when scoped properly.

For Cognaptus readers, the practical takeaway is straightforward. If synthetic images matter to your platform, brand, compliance process, or content supply chain, you should not rely only on declared provenance. You should also measure the visual behavior of the generators in your ecosystem. Build candidate-specific reference sets. Test across your actual content domains. Stress the system under compression, resizing, editing, and screenshot workflows. Track model versions. Require abstention. Treat attribution as evidence, not verdict.

The receipt may be in the pixels. Just do not mistake the receipt for the entire transaction history.

Cognaptus: Automate the Present, Incubate the Future.

Jasin Cekinmez, Ryo Mitsuhashi, Addison J. Wu, and Yida Yin, “Guess the Unified Model: How Much Can We Recover from Generated Images?”, arXiv:2605.25254v1, 24 May 2026, https://arxiv.org/abs/2605.25254. ↩︎

TL;DR for operators#

The image is not just content; it is a production trace#

The main result is a scaling curve, not a magic trick#

Off-the-shelf multimodal models are not attribution systems#

The corruption tests rule out the cheap explanation#

Structural perturbations show that the signature is distributed#

Cross-domain tests separate model identity from subject matter#

Prompt language is mostly not recoverable, and that boundary is useful#

The evidence chain, in business terms#

What this means for provenance programs#

Vendor auditing becomes more interesting than public takedown theater#

The signature will not sit still#

The real contribution is a map of what can and cannot be inferred#