Provenance, Not Providence: Why AI Answers Need Receipts

Opening — Why this matters now

The current AI market has become very good at producing fluent answers and very bad at explaining where those answers came from. This is not a minor inconvenience. It is the difference between an assistant that can be trusted in an operational workflow and an assistant that merely performs confidence with attractive typography.

For businesses, the question is no longer only: Is the model correct? That question still matters, obviously. But in regulated, contractual, safety-sensitive, or reputationally exposed workflows, a second question is becoming equally important:

Which source most likely supports this answer?

This is the problem taken up by DataDignity: Training Data Attribution for Large Language Models, a May 2026 arXiv paper by Xiaomin Li, Andrzej Banburski-Fahey, and Jaron Lanier.¹ The paper studies what it calls pinpoint provenance: given a prompt, a model response, and a candidate corpus, rank the documents that most likely support the knowledge expressed in that response.

That wording is careful. The authors are not claiming to solve the entire metaphysics of model memory, copyright causality, or philosophical responsibility. Sensibly, they leave that for people with worse hobbies. Instead, they frame provenance as an operational audit task: produce a short ranked list of candidate documents that a human inspector can review.

That is exactly the kind of middle layer enterprise AI needs. Not perfect omniscience. Not decorative explainability. A practical audit queue.

Background — Context and prior art

Most enterprise AI deployments already rely on some form of retrieval. A user asks a question, a system retrieves documents, the model answers, and the company hopes the answer is grounded. This is the familiar RAG pattern. It is useful. It is also frequently oversold.

Retrieval answers one question:

Which documents look similar or relevant to the query?

Provenance asks a harder question:

Which document most likely supplied the answer-critical fact expressed by the model?

Those are not the same. A document can look topically similar without supporting the answer. A document can share names, style, or keywords while missing the exact fact. A generated answer may be short, paraphrased, indirect, or wrapped in noisy context. In those cases, ordinary lexical or semantic retrieval can become a very polished guessing machine.

The paper situates itself against several related traditions:

Area	Typical question	Why it is not enough for this task
Lexical retrieval	Which document shares words or phrases?	Breaks under paraphrase, obfuscation, and sparse answers.
Dense retrieval	Which document is semantically close?	Can reward topic similarity instead of answer support.
Influence methods	Which training examples affected model behavior?	Often targets causal training dynamics rather than inspectable document ranking.
Activation methods	What does the model’s hidden state encode?	Promising, but usually not packaged as a practical provenance workflow.

DataDignity’s core move is to make provenance measurable under conditions where easy shortcuts are deliberately weakened. That is the important design choice. If a benchmark allows the source document, question, and answer to share rare phrases, ordinary retrieval may appear impressive. The business version of this mistake is even more common: a demo works beautifully because the test question is clean, the source document is obvious, and nobody asks what happens when a user asks the same thing badly.

In real workflows, users ask things badly. They paraphrase. They paste irrelevant text. They ask indirectly. They role-play. They use internal nicknames. They add noise because the copied email thread already contained noise. Enterprise AI systems that only work under clean prompting are not robust systems. They are polite prototypes.

Analysis or Implementation — What the paper does

The paper contributes both a benchmark and two attribution methods. The benchmark is called FakeWiki. The methods are ScoringModel and SteerFuse.

FakeWiki: a controlled source-tracing environment

FakeWiki contains 3,537 fabricated Wikipedia-style articles about non-real entities and concepts. The fabricated design is not cosmetic. It solves a serious evaluation problem: if the documents were about real entities, a model might already know the facts from pretraining. Then attribution would be contaminated by pre-existing knowledge.

Instead, the authors inject the fabricated corpus into each target model through continued pretraining. Then they ask whether attribution methods can recover which injected training document supports a later response.

The benchmark has four important components:

Benchmark component	What it contains	What it tests
Fabricated articles	3,537 Wikipedia-style documents about fictional entities, places, artifacts, events, organizations, and technical concepts	Whether provenance can be evaluated with known ground truth rather than internet ambiguity.
QA probes	Five short question-answer probes per document	Whether attribution works when the response contains only sparse evidence.
Source variants	Paraphrases and retro-generated variants	Whether methods can recover source support despite changed wording and context.
Anti-documents	Topically similar documents with answer-critical facts removed or altered	Whether methods distinguish true support from mere resemblance.
Transformed queries	Clean, Obfuscate, RolePlay, NoiseInjection, and Indirect	Whether provenance survives prompt transformations.

The anti-document idea is especially valuable. An anti-document keeps much of the same surface texture but removes the fact needed to answer the question. This is cruel in the correct way. It prevents a provenance method from winning by saying, “This document looks close enough.”

In business terms, anti-documents resemble near-miss evidence: the policy page that mentions the product but not the exception; the contract clause that names the client but not the liability trigger; the medical note that discusses the condition but not the dosage. Similarity is not support. This sentence should be printed above half the dashboards currently labeled “AI Governance.”

The task: rank supporting documents

The paper formalizes the task as a ranking problem. Given a question, a transformed query, a model response, and a candidate corpus, the attribution system scores candidate documents. Evaluation uses Recall@10: did at least one valid supporting source appear in the top 10?

For an audit workflow, this metric is reasonable. A human reviewer can inspect a short candidate list. The point is not to make the model declare divine certainty. The point is to shrink the haystack.

ScoringModel: supervised provenance scoring

ScoringModel is a supervised pairwise ranker. It maps response-side features and document-side features into a shared embedding space, then scores compatibility using temperature-scaled cosine similarity.

In simplified form:

$$ s(r,d) = \frac{\cos(z_r, z_d)}{\tau} $$

where $z_r$ is the projected response representation, $z_d$ is the projected document representation, and $\tau$ is a temperature parameter.

The training objective is contrastive. For each positive response-document pair, the model sees several kinds of negatives:

Negative type	Business translation
In-batch negatives	Random irrelevant documents. Cheap contrast.
Retrieval-mined hard negatives	Documents that look semantically close. Harder contrast.
Curated anti-documents	Documents that look highly plausible but do not support the answer. The useful villains.

The loss is based on InfoNCE:

$$ \mathcal{L} = -\log \frac{\exp(s(r,d^+))}{\exp(s(r,d^+)) + \sum_{d^-}\exp(s(r,d^-))} $$

The important operational detail is that ScoringModel is not an $N$-way classifier over fixed document labels. It learns a compatibility function between responses and documents. That matters because real enterprise corpora change. New documents arrive. Old documents are revised. A practical provenance scorer needs to rank candidates, not memorize a closed list of document IDs.

SteerFuse: activation evidence plus retrieval

The second method, SteerFuse, is training-free. It asks whether a candidate document provides model-internal evidence toward the observed response.

The intuition is simple enough: if reading a document induces an internal activation direction, and the observed response has a corresponding answer-side representation, then their alignment may indicate provenance support. Exact activation patching would be expensive, so the authors approximate the signal with cached document directions and a response-side proxy based on generated answer tokens.

SteerFuse then fuses this activation signal with SBERT retrieval. This is not a replacement for text retrieval. The paper is clear that activation evidence is noisy and works best when stabilized by retrieval fusion. That restraint is refreshing. The AI industry occasionally remembers that “complementary signal” is a phrase; it should use it more often.

Findings — Results with visualization

The headline result is straightforward: ordinary retrieval is useful but brittle; ScoringModel is substantially stronger; SteerFuse helps, but less consistently.

The paper evaluates nine open-weight instruction-tuned LLMs, five query conditions, eleven retrieval baselines, SteerFuse, and ScoringModel. The main aggregate result is Recall@10 averaged across the nine target models.

Query condition	Best retrieval baseline	SteerFuse	ScoringModel	ScoringModel gain vs. baseline
Clean	55.7	69.2	77.2	+21.5
Obfuscate	39.1	30.5	44.4	+5.3
RolePlay	42.9	50.1	62.5	+19.6
NoiseInjection	36.7	47.0	59.2	+22.5
Indirect	12.0	14.5	17.7	+5.7
Average	37.3	42.3	52.2	+14.9

A compact way to read the result:

Average Recall@10
Best baseline  | █████████████████████████████████████ 37.3
SteerFuse      | ██████████████████████████████████████████ 42.3
ScoringModel   | ████████████████████████████████████████████████████ 52.2

The paper also reports that ScoringModel beats the strongest retrieval baseline in 41 out of 45 model-by-condition cells and beats SteerFuse in 40 out of 45. SteerFuse beats the best baseline in 32 out of 45 cells. That pattern matters: ScoringModel is not merely winning through one friendly condition.

The clean-prompt trap

The clean condition is where many demos live. It is also where risk hides. On clean prompts, the best baseline achieves 55.7 Recall@10. That sounds decent. Add ScoringModel and performance rises to 77.2.

But the more interesting result appears when the query is transformed. RolePlay and NoiseInjection expose how brittle retrieval can be when surface cues remain plausible but become less reliable. ScoringModel gains +19.6 under RolePlay and +22.5 under NoiseInjection.

Indirect prompting remains hard for everyone: ScoringModel reaches only 17.7 Recall@10. That should not be brushed aside. It tells us that provenance is not solved. It also tells us which class of workflows needs more cautious human review.

Larger models reveal more recoverable provenance signal

One of the paper’s more interesting findings is that ScoringModel’s gains on transformed prompts are largest for larger target models.

Target model	Best baseline	SteerFuse	ScoringModel	Gain vs. baseline
Llama-3.1-8B	24.2	32.2	51.1	+26.9
Qwen3-8B	30.3	40.4	50.4	+20.0
Llama-2-7B	23.0	37.3	42.0	+19.0
Mistral-7B	24.6	40.2	41.5	+17.0
Qwen2-1.5B	32.1	27.6	43.0	+10.9
Llama-3.2-3B	39.9	34.6	49.4	+9.5
Qwen2.5-7B	37.9	29.4	46.6	+8.8
Llama-3.2-1B	44.4	40.0	49.2	+4.8
TinyLlama-1.1B	37.9	38.2	40.3	+2.4

The paper’s interpretation is that larger LLM hidden states may encode more recoverable provenance information. That is a direct empirical observation within this benchmark, not a universal law of model behavior.

My business interpretation: as enterprises adopt larger or more capable models, provenance instrumentation should not be treated as a bolt-on search feature. The model’s internal representations may become part of the audit surface. That does not mean every company should start poking hidden states tomorrow morning. It means governance architecture should not assume that keyword search plus embeddings will remain the ceiling of attribution.

What the paper directly shows vs. what business readers should infer

Claim	Status	Practical reading
ScoringModel improves average Recall@10 from 37.3 to 52.2 over the strongest retrieval baseline.	Direct paper result.	Supervised provenance scoring can outperform generic retrieval under controlled attribution tests.
Anti-documents help test whether methods distinguish support from similarity.	Direct benchmark design.	Enterprise evaluations should include near-miss documents, not only obviously relevant or irrelevant files.
Activation-space evidence can complement retrieval.	Direct result, but with caveats.	Internal model signals may be useful, yet current activation-only evidence is not stable enough as a standalone audit mechanism.
Larger models may expose more recoverable provenance signal.	Direct result within FakeWiki.	Larger enterprise models may justify richer provenance instrumentation, but this needs domain-specific validation.
This solves copyright causality or legal responsibility.	Not shown.	Do not use provenance rankings as legal proof. Use them as audit evidence. The distinction is not optional.

Implications — What changes in practice

The paper’s value for business is not that every company should reproduce FakeWiki. Most will not. Nor should they pretend that document-level Recall@10 is enough for production governance.

The useful lesson is architectural: AI systems need a provenance layer that is tested against shortcut failure.

1. Similarity search is not provenance

Many organizations currently treat retrieval logs as if they were explanation logs. This is convenient. It is also sloppy.

A retrieval log says which documents were fetched. A provenance system should estimate which documents actually support the generated answer. In a RAG system, those can diverge. In a fine-tuned or continually trained model, the divergence becomes even more serious because the model may express knowledge absorbed during training rather than retrieved at inference time.

The minimum enterprise upgrade is to separate these concepts:

Layer	Question answered	Typical evidence
Retrieval	What did the system fetch?	Search scores, retrieved chunks, query-document similarity.
Grounding	What did the answer cite or use at inference time?	Quoted spans, generated citations, context window traces.
Provenance	What source likely supports the model’s expressed knowledge?	Ranked source candidates, hard-negative tested scores, human-auditable evidence.
Causality	Did this source cause the model behavior?	Much harder; may require training logs, influence methods, interventions, and legal interpretation.

Pretending these are the same thing is how companies create dashboards that look sophisticated while answering the wrong question.

2. Evaluation should include near-misses

The most business-relevant benchmark design choice is the anti-document. In operational settings, failures often come from plausible near-misses, not absurd distractors.

For example:

Workflow	Plausible near-miss
HR policy assistant	Old policy version with similar wording but different eligibility rules.
Legal intake assistant	Similar contract template missing a jurisdiction-specific clause.
Insurance claims assistant	Related exclusion clause that does not apply to the claimant’s product line.
Medical admin assistant	Patient note mentioning the condition but not the approved dosage.
Finance research assistant	A report about the same issuer but from a different quarter.

A proper AI evaluation set should include these near-misses. Otherwise the system can win by being approximately relevant. In business, approximately relevant is often just wrong with better lighting.

3. Provenance should become part of ROI analysis

ROI discussions about AI automation usually focus on labor saved: fewer hours spent drafting, searching, summarizing, or routing information. That is only half the equation.

The other half is error cost. A faster answer that cannot be audited may increase downstream review cost, compliance risk, and customer escalation. Provenance systems reduce these costs by narrowing review scope.

A simple operational ROI frame:

Variable	Meaning
$T_0$	Time required for manual source search.
$T_a$	Time required to inspect an AI-ranked candidate list.
$C_e$	Expected cost of an unsupported answer escaping review.
$p_e$	Probability of unsupported answer escaping review.
$C_s$	Cost of building and maintaining provenance infrastructure.

A rough decision rule is:

$$ \text{Net value} \approx (T_0 - T_a) \times \text{review volume} + \Delta(p_e C_e) - C_s $$

This is not from the paper; it is a business extrapolation. The paper supplies evidence that better provenance ranking is possible under controlled conditions. The business question is whether the reduction in review time and risk justifies implementation cost in a specific workflow.

4. Document-level provenance is useful, but not the finish line

The authors are explicit about limitations. The benchmark uses controlled prompt transformations, not a full taxonomy of real user behavior. The main metric is Recall@10, which supports audit-list evaluation but not calibrated confidence. The attribution is document-level, not sentence-level or span-level.

For enterprise use, the next practical step is a layered evidence interface:

Stage	Current paper’s focus	Business extension
Document ranking	Return likely supporting documents.	Attach document version, owner, permissions, and policy status.
Evidence localization	Not solved in the main benchmark.	Highlight specific paragraphs or spans for reviewer confirmation.
Confidence calibration	Identified as future work.	Route low-confidence cases to humans automatically.
Audit logging	Not the central experimental focus.	Store provenance decisions, reviewer overrides, and final disposition.
Workflow integration	Outside the benchmark.	Connect provenance output to compliance, QA, legal, or customer support queues.

This is where business systems become different from academic benchmarks. The model ranking is only the start. The workflow around the ranking determines whether the result creates measurable value.

5. The governance lesson: design for adversarial normality

The transformed prompts in the paper are “jailbreak-inspired,” but the broader lesson is not only about malicious users. Many enterprise prompts are accidentally adversarial. They are messy, indirect, overloaded, copied from email threads, or framed by role instructions.

DataDignity’s transformed query conditions are useful because they model this broader reality:

Query condition	Enterprise analogue
Clean	A well-formed internal question. Rare, but pleasant.
Obfuscate	Internal shorthand, coded terms, or terminology drift.
RolePlay	Persona-based prompting or task framing.
NoiseInjection	Copied email threads, irrelevant ticket history, long chat context.
Indirect	Ambiguous executive requests, multi-hop questions, or soft phrasing.

The practical message is simple: if your AI system is evaluated only on clean prompts, you do not know how it behaves. You know how it behaves in a brochure.

Conclusion

DataDignity is not a final answer to training data attribution. It does not prove legal causality. It does not localize exact evidence spans. It does not solve all real-world prompting behavior. It does something more useful: it makes the provenance problem harder in the right places and shows that generic retrieval is not enough.

The paper’s strongest contribution is the separation between source resemblance and answer support. That distinction will matter more as AI systems move from chat interfaces into operational workflows where answers trigger decisions, payments, compliance actions, customer communications, and managerial judgments.

For businesses, the lesson is not “buy a provenance model.” The lesson is sharper: build AI workflows where every important answer can produce inspectable evidence, and test that evidence against near-misses, paraphrases, noisy prompts, and indirect requests.

A model that answers quickly is useful. A model that answers quickly and shows where the answer probably came from is governable. The second one is harder to build. Naturally, it is also the one worth paying for.

Cognaptus: Automate the Present, Incubate the Future.

Xiaomin Li, Andrzej Banburski-Fahey, and Jaron Lanier, “DataDignity: Training Data Attribution for Large Language Models,” arXiv:2605.05687, 2026. https://arxiv.org/abs/2605.05687 ↩︎

Opening — Why this matters now#

Background — Context and prior art#

Analysis or Implementation — What the paper does#

FakeWiki: a controlled source-tracing environment#

The task: rank supporting documents#

ScoringModel: supervised provenance scoring#

SteerFuse: activation evidence plus retrieval#

Findings — Results with visualization#

The clean-prompt trap#

Larger models reveal more recoverable provenance signal#

What the paper directly shows vs. what business readers should infer#

Implications — What changes in practice#

1. Similarity search is not provenance#

2. Evaluation should include near-misses#

3. Provenance should become part of ROI analysis#

4. Document-level provenance is useful, but not the finish line#

5. The governance lesson: design for adversarial normality#

Conclusion#