Opening — Why this matters now
The current AI market has become very good at producing fluent answers and very bad at explaining where those answers came from. This is not a minor inconvenience. It is the difference between an assistant that can be trusted in an operational workflow and an assistant that merely performs confidence with attractive typography.
For businesses, the question is no longer only: Is the model correct? That question still matters, obviously. But in regulated, contractual, safety-sensitive, or reputationally exposed workflows, a second question is becoming equally important:
Which source most likely supports this answer?
This is the problem taken up by DataDignity: Training Data Attribution for Large Language Models, a May 2026 arXiv paper by Xiaomin Li, Andrzej Banburski-Fahey, and Jaron Lanier.1 The paper studies what it calls pinpoint provenance: given a prompt, a model response, and a candidate corpus, rank the documents that most likely support the knowledge expressed in that response.
That wording is careful. The authors are not claiming to solve the entire metaphysics of model memory, copyright causality, or philosophical responsibility. Sensibly, they leave that for people with worse hobbies. Instead, they frame provenance as an operational audit task: produce a short ranked list of candidate documents that a human inspector can review.
That is exactly the kind of middle layer enterprise AI needs. Not perfect omniscience. Not decorative explainability. A practical audit queue.
Background — Context and prior art
Most enterprise AI deployments already rely on some form of retrieval. A user asks a question, a system retrieves documents, the model answers, and the company hopes the answer is grounded. This is the familiar RAG pattern. It is useful. It is also frequently oversold.
Retrieval answers one question:
Which documents look similar or relevant to the query?
Provenance asks a harder question:
Which document most likely supplied the answer-critical fact expressed by the model?
Those are not the same. A document can look topically similar without supporting the answer. A document can share names, style, or keywords while missing the exact fact. A generated answer may be short, paraphrased, indirect, or wrapped in noisy context. In those cases, ordinary lexical or semantic retrieval can become a very polished guessing machine.
The paper situates itself against several related traditions:
| Area | Typical question | Why it is not enough for this task |
|---|---|---|
| Lexical retrieval | Which document shares words or phrases? | Breaks under paraphrase, obfuscation, and sparse answers. |
| Dense retrieval | Which document is semantically close? | Can reward topic similarity instead of answer support. |
| Influence methods | Which training examples affected model behavior? | Often targets causal training dynamics rather than inspectable document ranking. |
| Activation methods | What does the model’s hidden state encode? | Promising, but usually not packaged as a practical provenance workflow. |
DataDignity’s core move is to make provenance measurable under conditions where easy shortcuts are deliberately weakened. That is the important design choice. If a benchmark allows the source document, question, and answer to share rare phrases, ordinary retrieval may appear impressive. The business version of this mistake is even more common: a demo works beautifully because the test question is clean, the source document is obvious, and nobody asks what happens when a user asks the same thing badly.
In real workflows, users ask things badly. They paraphrase. They paste irrelevant text. They ask indirectly. They role-play. They use internal nicknames. They add noise because the copied email thread already contained noise. Enterprise AI systems that only work under clean prompting are not robust systems. They are polite prototypes.
Analysis or Implementation — What the paper does
The paper contributes both a benchmark and two attribution methods. The benchmark is called FakeWiki. The methods are ScoringModel and SteerFuse.
FakeWiki: a controlled source-tracing environment
FakeWiki contains 3,537 fabricated Wikipedia-style articles about non-real entities and concepts. The fabricated design is not cosmetic. It solves a serious evaluation problem: if the documents were about real entities, a model might already know the facts from pretraining. Then attribution would be contaminated by pre-existing knowledge.
Instead, the authors inject the fabricated corpus into each target model through continued pretraining. Then they ask whether attribution methods can recover which injected training document supports a later response.
The benchmark has four important components:
| Benchmark component | What it contains | What it tests |
|---|---|---|
| Fabricated articles | 3,537 Wikipedia-style documents about fictional entities, places, artifacts, events, organizations, and technical concepts | Whether provenance can be evaluated with known ground truth rather than internet ambiguity. |
| QA probes | Five short question-answer probes per document | Whether attribution works when the response contains only sparse evidence. |
| Source variants | Paraphrases and retro-generated variants | Whether methods can recover source support despite changed wording and context. |
| Anti-documents | Topically similar documents with answer-critical facts removed or altered | Whether methods distinguish true support from mere resemblance. |
| Transformed queries | Clean, Obfuscate, RolePlay, NoiseInjection, and Indirect | Whether provenance survives prompt transformations. |
The anti-document idea is especially valuable. An anti-document keeps much of the same surface texture but removes the fact needed to answer the question. This is cruel in the correct way. It prevents a provenance method from winning by saying, “This document looks close enough.”
In business terms, anti-documents resemble near-miss evidence: the policy page that mentions the product but not the exception; the contract clause that names the client but not the liability trigger; the medical note that discusses the condition but not the dosage. Similarity is not support. This sentence should be printed above half the dashboards currently labeled “AI Governance.”
The task: rank supporting documents
The paper formalizes the task as a ranking problem. Given a question, a transformed query, a model response, and a candidate corpus, the attribution system scores candidate documents. Evaluation uses Recall@10: did at least one valid supporting source appear in the top 10?
For an audit workflow, this metric is reasonable. A human reviewer can inspect a short candidate list. The point is not to make the model declare divine certainty. The point is to shrink the haystack.
ScoringModel: supervised provenance scoring
ScoringModel is a supervised pairwise ranker. It maps response-side features and document-side features into a shared embedding space, then scores compatibility using temperature-scaled cosine similarity.
In simplified form:
$$ s(r,d) = \frac{\cos(z_r, z_d)}{\tau} $$
where $z_r$ is the projected response representation, $z_d$ is the projected document representation, and $\tau$ is a temperature parameter.
The training objective is contrastive. For each positive response-document pair, the model sees several kinds of negatives:
| Negative type | Business translation |
|---|---|
| In-batch negatives | Random irrelevant documents. Cheap contrast. |
| Retrieval-mined hard negatives | Documents that look semantically close. Harder contrast. |
| Curated anti-documents | Documents that look highly plausible but do not support the answer. The useful villains. |
The loss is based on InfoNCE:
$$ \mathcal{L} = -\log \frac{\exp(s(r,d^+))}{\exp(s(r,d^+)) + \sum_{d^-}\exp(s(r,d^-))} $$
The important operational detail is that ScoringModel is not an $N$-way classifier over fixed document labels. It learns a compatibility function between responses and documents. That matters because real enterprise corpora change. New documents arrive. Old documents are revised. A practical provenance scorer needs to rank candidates, not memorize a closed list of document IDs.
SteerFuse: activation evidence plus retrieval
The second method, SteerFuse, is training-free. It asks whether a candidate document provides model-internal evidence toward the observed response.
The intuition is simple enough: if reading a document induces an internal activation direction, and the observed response has a corresponding answer-side representation, then their alignment may indicate provenance support. Exact activation patching would be expensive, so the authors approximate the signal with cached document directions and a response-side proxy based on generated answer tokens.
SteerFuse then fuses this activation signal with SBERT retrieval. This is not a replacement for text retrieval. The paper is clear that activation evidence is noisy and works best when stabilized by retrieval fusion. That restraint is refreshing. The AI industry occasionally remembers that “complementary signal” is a phrase; it should use it more often.
Findings — Results with visualization
The headline result is straightforward: ordinary retrieval is useful but brittle; ScoringModel is substantially stronger; SteerFuse helps, but less consistently.
The paper evaluates nine open-weight instruction-tuned LLMs, five query conditions, eleven retrieval baselines, SteerFuse, and ScoringModel. The main aggregate result is Recall@10 averaged across the nine target models.
| Query condition | Best retrieval baseline | SteerFuse | ScoringModel | ScoringModel gain vs. baseline |
|---|---|---|---|---|
| Clean | 55.7 | 69.2 | 77.2 | +21.5 |
| Obfuscate | 39.1 | 30.5 | 44.4 | +5.3 |
| RolePlay | 42.9 | 50.1 | 62.5 | +19.6 |
| NoiseInjection | 36.7 | 47.0 | 59.2 | +22.5 |
| Indirect | 12.0 | 14.5 | 17.7 | +5.7 |
| Average | 37.3 | 42.3 | 52.2 | +14.9 |
A compact way to read the result:
Average Recall@10
Best baseline | █████████████████████████████████████ 37.3
SteerFuse | ██████████████████████████████████████████ 42.3
ScoringModel | ████████████████████████████████████████████████████ 52.2
The paper also reports that ScoringModel beats the strongest retrieval baseline in 41 out of 45 model-by-condition cells and beats SteerFuse in 40 out of 45. SteerFuse beats the best baseline in 32 out of 45 cells. That pattern matters: ScoringModel is not merely winning through one friendly condition.
The clean-prompt trap
The clean condition is where many demos live. It is also where risk hides. On clean prompts, the best baseline achieves 55.7 Recall@10. That sounds decent. Add ScoringModel and performance rises to 77.2.
But the more interesting result appears when the query is transformed. RolePlay and NoiseInjection expose how brittle retrieval can be when surface cues remain plausible but become less reliable. ScoringModel gains +19.6 under RolePlay and +22.5 under NoiseInjection.
Indirect prompting remains hard for everyone: ScoringModel reaches only 17.7 Recall@10. That should not be brushed aside. It tells us that provenance is not solved. It also tells us which class of workflows needs more cautious human review.
Larger models reveal more recoverable provenance signal
One of the paper’s more interesting findings is that ScoringModel’s gains on transformed prompts are largest for larger target models.
| Target model | Best baseline | SteerFuse | ScoringModel | Gain vs. baseline |
|---|---|---|---|---|
| Llama-3.1-8B | 24.2 | 32.2 | 51.1 | +26.9 |
| Qwen3-8B | 30.3 | 40.4 | 50.4 | +20.0 |
| Llama-2-7B | 23.0 | 37.3 | 42.0 | +19.0 |
| Mistral-7B | 24.6 | 40.2 | 41.5 | +17.0 |
| Qwen2-1.5B | 32.1 | 27.6 | 43.0 | +10.9 |
| Llama-3.2-3B | 39.9 | 34.6 | 49.4 | +9.5 |
| Qwen2.5-7B | 37.9 | 29.4 | 46.6 | +8.8 |
| Llama-3.2-1B | 44.4 | 40.0 | 49.2 | +4.8 |
| TinyLlama-1.1B | 37.9 | 38.2 | 40.3 | +2.4 |
The paper’s interpretation is that larger LLM hidden states may encode more recoverable provenance information. That is a direct empirical observation within this benchmark, not a universal law of model behavior.
My business interpretation: as enterprises adopt larger or more capable models, provenance instrumentation should not be treated as a bolt-on search feature. The model’s internal representations may become part of the audit surface. That does not mean every company should start poking hidden states tomorrow morning. It means governance architecture should not assume that keyword search plus embeddings will remain the ceiling of attribution.
What the paper directly shows vs. what business readers should infer
| Claim | Status | Practical reading |
|---|---|---|
| ScoringModel improves average Recall@10 from 37.3 to 52.2 over the strongest retrieval baseline. | Direct paper result. | Supervised provenance scoring can outperform generic retrieval under controlled attribution tests. |
| Anti-documents help test whether methods distinguish support from similarity. | Direct benchmark design. | Enterprise evaluations should include near-miss documents, not only obviously relevant or irrelevant files. |
| Activation-space evidence can complement retrieval. | Direct result, but with caveats. | Internal model signals may be useful, yet current activation-only evidence is not stable enough as a standalone audit mechanism. |
| Larger models may expose more recoverable provenance signal. | Direct result within FakeWiki. | Larger enterprise models may justify richer provenance instrumentation, but this needs domain-specific validation. |
| This solves copyright causality or legal responsibility. | Not shown. | Do not use provenance rankings as legal proof. Use them as audit evidence. The distinction is not optional. |
Implications — What changes in practice
The paper’s value for business is not that every company should reproduce FakeWiki. Most will not. Nor should they pretend that document-level Recall@10 is enough for production governance.
The useful lesson is architectural: AI systems need a provenance layer that is tested against shortcut failure.
1. Similarity search is not provenance
Many organizations currently treat retrieval logs as if they were explanation logs. This is convenient. It is also sloppy.
A retrieval log says which documents were fetched. A provenance system should estimate which documents actually support the generated answer. In a RAG system, those can diverge. In a fine-tuned or continually trained model, the divergence becomes even more serious because the model may express knowledge absorbed during training rather than retrieved at inference time.
The minimum enterprise upgrade is to separate these concepts:
| Layer | Question answered | Typical evidence |
|---|---|---|
| Retrieval | What did the system fetch? | Search scores, retrieved chunks, query-document similarity. |
| Grounding | What did the answer cite or use at inference time? | Quoted spans, generated citations, context window traces. |
| Provenance | What source likely supports the model’s expressed knowledge? | Ranked source candidates, hard-negative tested scores, human-auditable evidence. |
| Causality | Did this source cause the model behavior? | Much harder; may require training logs, influence methods, interventions, and legal interpretation. |
Pretending these are the same thing is how companies create dashboards that look sophisticated while answering the wrong question.
2. Evaluation should include near-misses
The most business-relevant benchmark design choice is the anti-document. In operational settings, failures often come from plausible near-misses, not absurd distractors.
For example:
| Workflow | Plausible near-miss |
|---|---|
| HR policy assistant | Old policy version with similar wording but different eligibility rules. |
| Legal intake assistant | Similar contract template missing a jurisdiction-specific clause. |
| Insurance claims assistant | Related exclusion clause that does not apply to the claimant’s product line. |
| Medical admin assistant | Patient note mentioning the condition but not the approved dosage. |
| Finance research assistant | A report about the same issuer but from a different quarter. |
A proper AI evaluation set should include these near-misses. Otherwise the system can win by being approximately relevant. In business, approximately relevant is often just wrong with better lighting.
3. Provenance should become part of ROI analysis
ROI discussions about AI automation usually focus on labor saved: fewer hours spent drafting, searching, summarizing, or routing information. That is only half the equation.
The other half is error cost. A faster answer that cannot be audited may increase downstream review cost, compliance risk, and customer escalation. Provenance systems reduce these costs by narrowing review scope.
A simple operational ROI frame:
| Variable | Meaning |
|---|---|
| $T_0$ | Time required for manual source search. |
| $T_a$ | Time required to inspect an AI-ranked candidate list. |
| $C_e$ | Expected cost of an unsupported answer escaping review. |
| $p_e$ | Probability of unsupported answer escaping review. |
| $C_s$ | Cost of building and maintaining provenance infrastructure. |
A rough decision rule is:
$$ \text{Net value} \approx (T_0 - T_a) \times \text{review volume} + \Delta(p_e C_e) - C_s $$
This is not from the paper; it is a business extrapolation. The paper supplies evidence that better provenance ranking is possible under controlled conditions. The business question is whether the reduction in review time and risk justifies implementation cost in a specific workflow.
4. Document-level provenance is useful, but not the finish line
The authors are explicit about limitations. The benchmark uses controlled prompt transformations, not a full taxonomy of real user behavior. The main metric is Recall@10, which supports audit-list evaluation but not calibrated confidence. The attribution is document-level, not sentence-level or span-level.
For enterprise use, the next practical step is a layered evidence interface:
| Stage | Current paper’s focus | Business extension |
|---|---|---|
| Document ranking | Return likely supporting documents. | Attach document version, owner, permissions, and policy status. |
| Evidence localization | Not solved in the main benchmark. | Highlight specific paragraphs or spans for reviewer confirmation. |
| Confidence calibration | Identified as future work. | Route low-confidence cases to humans automatically. |
| Audit logging | Not the central experimental focus. | Store provenance decisions, reviewer overrides, and final disposition. |
| Workflow integration | Outside the benchmark. | Connect provenance output to compliance, QA, legal, or customer support queues. |
This is where business systems become different from academic benchmarks. The model ranking is only the start. The workflow around the ranking determines whether the result creates measurable value.
5. The governance lesson: design for adversarial normality
The transformed prompts in the paper are “jailbreak-inspired,” but the broader lesson is not only about malicious users. Many enterprise prompts are accidentally adversarial. They are messy, indirect, overloaded, copied from email threads, or framed by role instructions.
DataDignity’s transformed query conditions are useful because they model this broader reality:
| Query condition | Enterprise analogue |
|---|---|
| Clean | A well-formed internal question. Rare, but pleasant. |
| Obfuscate | Internal shorthand, coded terms, or terminology drift. |
| RolePlay | Persona-based prompting or task framing. |
| NoiseInjection | Copied email threads, irrelevant ticket history, long chat context. |
| Indirect | Ambiguous executive requests, multi-hop questions, or soft phrasing. |
The practical message is simple: if your AI system is evaluated only on clean prompts, you do not know how it behaves. You know how it behaves in a brochure.
Conclusion
DataDignity is not a final answer to training data attribution. It does not prove legal causality. It does not localize exact evidence spans. It does not solve all real-world prompting behavior. It does something more useful: it makes the provenance problem harder in the right places and shows that generic retrieval is not enough.
The paper’s strongest contribution is the separation between source resemblance and answer support. That distinction will matter more as AI systems move from chat interfaces into operational workflows where answers trigger decisions, payments, compliance actions, customer communications, and managerial judgments.
For businesses, the lesson is not “buy a provenance model.” The lesson is sharper: build AI workflows where every important answer can produce inspectable evidence, and test that evidence against near-misses, paraphrases, noisy prompts, and indirect requests.
A model that answers quickly is useful. A model that answers quickly and shows where the answer probably came from is governable. The second one is harder to build. Naturally, it is also the one worth paying for.
Cognaptus: Automate the Present, Incubate the Future.
-
Xiaomin Li, Andrzej Banburski-Fahey, and Jaron Lanier, “DataDignity: Training Data Attribution for Large Language Models,” arXiv:2605.05687, 2026. https://arxiv.org/abs/2605.05687 ↩︎