Who Owns Your Words? Copyright, LLMs, and the Quiet Arms Race Over Training Data

The new copyright question is not “did the model copy me?” but “how would I know?”

A writer uploads a chapter. A publisher uploads a manuscript. A compliance team uploads a protected document. The question is simple enough to ask in one sentence: did this material end up inside a large language model’s training data?

Unfortunately, simple questions are where machine learning likes to become expensive, probabilistic, and legally inconvenient.

The paper “Copyright Detection in Large Language Models: An Ethical Approach to Generative AI Development” proposes an open-source, user-facing system for testing whether a target LLM shows signs of having memorized copyrighted text.¹ Its main contribution is not a new theory of copyright law. It is also not a dramatic benchmark table showing one model caught red-handed while another walks away whistling. The paper is better read as a mechanism: a pipeline that transforms the vague suspicion of “my work may have been used” into a sequence of testable steps.

That distinction matters. Copyright detection for LLMs sits in a strange middle ground. It borrows from information retrieval, adversarial evaluation, plagiarism detection, model auditing, and legal evidence gathering. But it is not identical to any of them. A model selecting the original passage over paraphrases may be evidence of memorization. It is not, by itself, a signed confession from the training-data department. Sadly, the model does not come with a little receipt printer.

The useful question is therefore more practical: can such a system help creators, publishers, AI vendors, and compliance teams decide what to investigate next?

This paper’s answer is yes, but with boundaries. It offers a more accessible workflow built around passage extraction, paraphrase generation, question-answer testing, multiple-choice evaluation, statistical analysis, vector-store logging, and dataset-cleaning improvements. The business relevance lies in triage: making copyright suspicion cheaper to test, easier to document, and less dependent on specialist engineering teams.

The core mechanism: make the model choose between the original and plausible impostors

The paper builds on the logic of DE-COP, a prior framework for detecting copyrighted content in language-model training data.² DE-COP’s intuition is elegant: take an original passage, generate paraphrased alternatives, and ask the target model to choose which answer is from the source text. If the model repeatedly selects the original verbatim passage, that behavior may suggest that the model has seen and memorized the passage during training.

This is different from a plagiarism checker. A plagiarism checker compares text against a database of documents. A copyright-detection test for LLMs probes the model’s behavior. The model may not reproduce a passage spontaneously, but it may still reveal familiarity when forced to choose between the exact text and near-equivalent paraphrases.

The paper’s pipeline follows this rough structure:

User text
  -> unique passage extraction
  -> paraphrase generation
  -> question generation
  -> multiple-choice probing
  -> statistical evaluation
  -> logging and similarity search
  -> user-facing dashboard

That pipeline is the article’s real subject. Each step tries to reduce one source of noise. Generic passages are filtered out. Weak paraphrases are cleaned. Answer-position bias is reduced. Duplicate submissions are logged. The goal is not mystical AI forensics. It is a structured test bench.

The authors frame their system as an open-source platform with a web-based interface, so users can submit content for evaluation without reproducing the full technical workflow themselves. The backend combines LangGraph-based workflows, Claude 3.5 Sonnet for paraphrase and question generation, GPT-4o for evaluation, BM25 for uniqueness selection, SBERT-style semantic checks, and Pinecone for vector logging.

That stack choice is less important than the operational idea: copyright detection becomes a service workflow, not a one-off research script.

Step one: test passages that are actually worth remembering

The first mechanism is passage extraction.

Not every sentence is useful for detecting memorization. A line like “She opened the door and looked outside” is not a strong probe. It could appear in thousands of works. If a model recognizes it, that proves very little. If it fails to recognize it, that also proves very little. Generic text is evidence confetti.

The paper therefore prioritizes unique passages. It uses BM25, a classic information-retrieval scoring method, to compare passages within a document. The passages with lower similarity to other passages in the same work are treated as more distinctive and selected for evaluation.

The mechanism is straightforward:

Pipeline step	What it filters	Why it matters
Passage extraction	Generic or repetitive text	Reduces false signals from common phrasing
BM25 scoring	Passages similar to many other passages	Prioritizes distinctive text more likely to reveal memorization
Uniqueness selection	Weak probes	Improves the quality of downstream multiple-choice tests

This is a sensible design choice. Copyright disputes often revolve around protected expression, not merely general facts or common phrasing. A technical test should therefore focus on text that has enough individuality to be informative.

The boundary is equally important. A unique passage is not automatically legally protectable, and BM25 uniqueness within one submitted document is not the same as uniqueness across the entire internet. The mechanism improves probe quality; it does not settle legal originality.

Step two: build distractors that are close enough to make the test meaningful

Once the system has selected passages, it generates paraphrases. This is where the test becomes more interesting.

A weak multiple-choice test is easy to game accidentally. If the original passage is polished, literary, and specific while the paraphrases are clumsy, generic, or semantically distorted, the target model may choose the original because it is simply better written. That would be a bad signal. It would confuse literary quality with memorization. The copyright-detection equivalent of testing whether someone recognizes a wine label by serving one glass of wine and three glasses of dishwater.

The paper addresses this by using more structured paraphrase strategies. It mentions passive-voice conversion, question-based restructuring, and language simplification. The paraphrase layer is implemented through LangGraph’s StateGraph and uses Claude 3.5 Sonnet with a temperature setting of 0.7.

The point is not that Claude has magical paraphrase authority. The point is that the test needs plausible impostors. The paraphrases must preserve meaning while varying surface form. If they drift too far, the original becomes obvious. If they are too close, the test may collapse into shallow lexical matching.

The paper also proposes XML formatting for paraphrases, mainly as an implementation detail for structured handling by instruction-following models. This is not a core scientific result. It is plumbing. But in real systems, plumbing is where many “AI governance tools” quietly drown.

Step three: question-answering turns memorization into a controlled probe

The next layer generates questions. The paper says its QA layer supports both “create” and “format” modes, allowing custom questions that use exact text from the input content. The output is structured in JSON for downstream consistency.

This matters because the question is the test frame. A model may behave differently depending on how the passage is introduced, how answer choices are formatted, and whether the prompt asks for a concise response. Prompt format is not cosmetic. In black-box model evaluation, prompt format is part of the instrument.

The paper’s QA component is best interpreted as an implementation detail with methodological consequences. It does not provide a large experimental comparison showing which question styles are most diagnostic. But it recognizes a real issue: if the testing procedure is inconsistent, the result becomes harder to trust.

The more stable the test format, the easier it becomes to compare runs, aggregate scores, and distinguish model behavior from evaluation noise.

Step four: multiple-choice design lowers random success and answer-position bias

The multiple-choice layer is the heart of the method.

Suppose each test contains one original passage and $k$ paraphrased distractors. If the model guesses randomly, the probability of selecting the original is:

$$ P(\text{random correct}) = \frac{1}{k+1} $$

This is why answer design matters. The paper states that expanding from three to four paraphrased options reduces the probability that the original is selected randomly. Interpreted as one original plus four paraphrased distractors, the random baseline falls from $1/4$ to $1/5$, or from 25% to 20%. That is a five-percentage-point drop, and a 20% relative reduction from the previous random-correct rate.

This does not automatically prove a 20% improvement in detection accuracy. It means the chance baseline becomes stricter. That is useful, but not the same as reporting a full empirical sensitivity analysis.

The paper also discusses answer-order bias. DE-COP used an exhaustive approach that generated all permutations of answer choices. The proposed implementation initially uses a simplified randomization strategy and then proposes a dedicated permutation function to automate all answer orderings inside LangGraph.

This is where careful reading matters. The permutation function appears partly as an enhancement proposal, not a fully demonstrated ablation with reported before-and-after metrics. Its likely purpose is methodological robustness: reducing the chance that a model picks answer A, B, C, or D for reasons unrelated to copyright memorization.

The business translation is simple: if a system is going to produce audit-like evidence, it cannot let answer-position bias masquerade as infringement risk. That would be very efficient, in the same way a smoke alarm that screams whenever someone makes toast is efficient.

Step five: cleaning the evaluation data may be the most practical contribution

The paper’s most operationally useful section may be its least glamorous one: data processing.

The authors report several issues in DE-COP’s dataset, including NULL values, API output errors, inconsistent formatting, unfinished sentences, and paraphrases ranging from 20% to 250% of the original passage length. They also state that such inconsistencies increased token usage by up to 50%.

This is not a side note. It changes the economics and reliability of the system.

If paraphrases are too short, the original may stand out. If paraphrases are too long, the model may avoid them for formatting reasons. If API errors are embedded as candidate answers, the test becomes a tragic little comedy. If invalid passages are sent to paid APIs, cost rises without improving evidence.

The proposed preprocessing pipeline uses SBERT embeddings and cosine similarity to check semantic integrity, filters invalid passages, and normalizes passage lengths. The likely role of this layer is quality control, not headline detection. It makes the test less embarrassing before the model ever sees it.

Data issue	Likely effect on evaluation	Proposed correction
NULL values	Invalid or unusable answer choices	Filter before evaluation
API output errors	Model may choose based on formatting artifacts	Remove malformed generations
Paraphrase length extremes	Original may become artificially salient	Normalize passage lengths
Semantic drift	Distractors stop meaning the same thing	Use embedding similarity checks
Duplicate or repeated submissions	Redundant API cost	Vector-store logging and lookup

This is where the system’s claimed efficiency gains become more plausible. The paper reports a 10-30% reduction in API consumption through streamlined randomization and question restructuring. It also argues that better preprocessing reduces wasted evaluations.

For businesses, that matters more than it may sound. Model audits rarely fail because one clever test is impossible. They fail because running enough tests, logging them, cleaning them, and explaining them to non-technical decision-makers becomes annoying and expensive. Annoying and expensive is where governance tools go to become slideware.

Step six: vector logging turns isolated tests into an audit trail

The paper’s logging layer uses Pinecone as a serverless vector database. Submitted documents are embedded using all-MiniLM-L6-v2, producing 384-dimensional embeddings for approximate nearest-neighbor search. Metadata such as copyright ownership, timestamps, evaluation results, and content type is stored alongside the vectors.

This layer has two practical roles.

First, it avoids redundant evaluations. If a user submits highly similar content twice, the system can detect that and avoid paying for the same test again. Second, it creates an evaluation history. That history matters for creators and organizations who may need to document what was tested, when it was tested, and what the system reported.

This is not the same as a legal chain of custody. The paper does not solve evidentiary standards, adversarial tampering, identity verification, or jurisdictional differences. But it moves the workflow in the right direction: from “I asked a chatbot and took a screenshot” to “we ran a structured evaluation and logged the result.”

That may sound modest. It is modest. Modest is often what usable governance looks like before someone in a policy meeting adds seven decorative dashboards and a blockchain.

What the paper’s evidence supports, and what it does not

The paper’s results section is short. It states that the proposed framework improves detection accuracy, computational efficiency, and accessibility; that preprocessing catches errors and improves reproducibility; that the multiple-choice layer reduces API consumption by 10-30%; and that the vector store improves scalability and duplicate detection.

However, the paper does not provide a detailed benchmark table, model-by-model comparison, ROC curves, AUC values, confidence intervals, or an ablation study isolating each component’s contribution. It mentions statistical tools such as ROC analysis, AUC scoring, and hypothesis testing as part of the evaluation layer, but the paper does not report detailed numeric outputs from those methods.

That creates a clear interpretation boundary.

Paper element	Likely purpose	What it supports	What it does not prove
Figure 1: detection pipeline	Conceptual overview	Shows the intended user workflow from passage extraction to probability score	Does not provide empirical validation
Figure 2: DE-COP overview	Comparison with prior framework	Positions the paper as an extension of DE-COP	Does not quantify improvement by itself
Figure 3: system architecture	Implementation diagram	Shows how paraphrase, QA, evaluator, and vector DB layers interact	Does not prove scalability under production load
10-30% API reduction claim	Efficiency result	Suggests operational cost improvement from workflow changes	Does not isolate which component caused the reduction
Dataset-cleaning discussion	Quality-control argument	Identifies realistic failure modes in paraphrase-based testing	Does not provide a full cleaned-vs-uncleaned benchmark table
Future work on unlearning	Research extension	Connects detection to possible remediation	Does not show content removal in this paper

The article-level takeaway should therefore be disciplined: the paper is valuable as a system design and implementation proposal for accessible copyright-detection workflows. It should not be oversold as definitive empirical proof that a given model trained on a given copyrighted work.

The common misconception: detection is not proof of dataset inclusion

The most tempting mistake is to treat the system’s output as a legal verdict.

If a model repeatedly identifies the original passage, that may indicate memorization. Memorization may indicate exposure during training. Exposure during training may support a copyright concern. But each “may” hides a gap.

A model could select the original because it has seen the passage. It could also select it because the original is stylistically superior, more coherent, more idiomatic, or more statistically likely under the model’s learned distribution. The system tries to reduce those confounds through paraphrase quality, uniqueness selection, answer permutation, and statistical testing. Reducing confounds is not the same as eliminating them.

The better mental model is risk scoring.

A strong result from this kind of system should trigger further investigation: legal review, licensing checks, model-provider inquiry, additional probing, comparison across models, or independent replication. It should not be treated as a standalone infringement judgment.

This is not a weakness unique to this paper. It is the nature of black-box model auditing. When the training dataset is inaccessible, evidence is behavioral. Behavioral evidence can be useful. It can also be noisy, prompt-sensitive, and context-dependent.

Business relevance: cheaper triage for a copyright arms race

For businesses, the paper’s relevance depends on which side of the training-data dispute they occupy.

For publishers and creators, the system offers a possible first-pass audit. Instead of hiring a technical team to reproduce a research framework, a user-facing platform could let them submit works, receive detection scores, and build a prioritized list of suspicious cases. This is especially relevant when the portfolio is large: books, articles, training manuals, proprietary reports, course material, scripts, or documentation.

For AI vendors, the same kind of system can become an internal compliance tool. Before releasing or fine-tuning a model, a vendor could test against high-risk corpora, licensed datasets, opt-out lists, or known protected works. The point is not only to avoid lawsuits. It is to understand where memorization risk concentrates.

For enterprise buyers, copyright detection becomes part of procurement due diligence. A company adopting a model for customer-facing generation may want to know whether the vendor has a process for detecting and mitigating memorized copyrighted content. “We have a responsible AI policy” is pleasant. “We run structured memorization probes against protected corpora and log results” is more useful.

For legal and compliance teams, the system provides a vocabulary for escalation. Instead of debating abstractly whether a model “knows” a work, teams can ask narrower questions: Which passages were tested? How unique were they? How were paraphrases generated? What was the random baseline? Were answer choices permuted? Were invalid samples filtered? Were results logged?

That is the practical pathway:

Suspicion
  -> structured test
  -> risk score
  -> evidence log
  -> prioritization
  -> legal, licensing, or technical follow-up

The value is not automatic enforcement. The value is lowering the cost of deciding where enforcement or negotiation may be worth pursuing.

ROI is hidden in fewer bad tests, not just fewer API calls

The paper reports a 10-30% reduction in API consumption. That is useful, but the deeper ROI story is broader.

A copyright-detection workflow has several cost centers: selecting passages, generating paraphrases, running target-model prompts, repeating tests for robustness, storing results, reviewing suspicious outputs, and translating findings into action. API calls are visible because they come with invoices. Human review time is less visible but often more expensive.

The paper’s architecture reduces cost in three ways:

Technical contribution	Operational consequence	ROI relevance
BM25-based unique passage selection	Fewer weak probes	Less spending on low-value tests
Structured paraphrase generation	More plausible distractors	Fewer misleading results needing manual review
SBERT/cosine preprocessing	Filters malformed or semantically bad samples	Lower wasted API usage
More distractors in multiple-choice tests	Lower random-correct baseline	Potentially fewer passages needed for statistical confidence
Vector-store logging	Duplicate detection and evaluation history	Avoids repeated work and supports audit documentation
Web-based interface	Lowers technical barrier	Makes testing feasible for non-engineering users

The most important business insight is that audit systems create value by reducing ambiguity. A tool that produces a perfect answer would be wonderful. A tool that consistently separates “ignore,” “monitor,” and “escalate” can still be economically valuable.

This paper belongs in the second category.

Where the system should be used carefully

The paper’s architecture is promising, but several boundaries matter.

First, the results depend on passage quality. If the selected passages are not distinctive enough, the test becomes weak. If they are too distinctive in style or formatting, the model may identify the original for reasons other than memorization.

Second, paraphrase quality is central. The whole test assumes that distractors preserve meaning while hiding surface form. Poor paraphrases can inflate the detection signal. Overly polished paraphrases can suppress it. The paper’s preprocessing layer helps, but paraphrase evaluation remains a difficult task.

Third, prompt sensitivity remains a concern. Different target models may respond differently to answer formatting, instruction wording, refusal behavior, and safety filters. A model’s choice in a multiple-choice setting is not a transparent window into its training set.

Fourth, the legal meaning of memorization is unsettled. Even if a model has memorized a passage, legal consequences depend on jurisdiction, licensing, fair use or fair dealing analysis, output behavior, market harm, and the facts of data acquisition. The system can support legal inquiry. It does not replace it.

Fifth, the paper does not yet provide the kind of detailed empirical validation that enterprise buyers would want before relying on the tool for high-stakes decisions. A production-grade version would need benchmark datasets, sensitivity tests, model comparisons, false-positive analysis, false-negative analysis, reproducibility checks, and clear confidence reporting.

The strongest use case is therefore not “press button, sue model provider.” The strongest use case is “press button, decide whether this case deserves serious attention.”

The quiet arms race over training data will be fought with audit workflows

Copyright disputes around LLMs are often narrated as courtroom drama: authors versus AI labs, publishers versus platforms, scraping versus consent. That drama matters. But behind it sits a quieter arms race.

Creators need ways to test whether their work appears to have been absorbed. Model developers need ways to detect memorization before deployment. Enterprise buyers need ways to assess vendor risk. Regulators need workflows that are more concrete than press releases and less fragile than anecdotes.

This paper’s contribution is to make that workflow more accessible. It extends DE-COP-style probing with a user-facing architecture, uniqueness selection, diversified paraphrasing, structured QA, multiple-choice evaluation, statistical hooks, vector logging, and preprocessing that reduces wasted API usage. Its claims are more architectural than decisively empirical, and that should shape how we read it.

The paper does not solve copyright in generative AI. Nobody should expect a four-page system paper to do what courts, regulators, publishers, AI labs, and licensing markets are still fighting over. What it does offer is a practical diagnostic layer: a way to turn suspicion into structured evidence.

In the emerging economy of model accountability, that may be the real product. Not certainty. Not absolution. A better queue for deciding which questions are worth asking next.

Cognaptus: Automate the Present, Incubate the Future.

David Szczecina, Senan Gaffori, and Edmond Li, “Copyright Detection in Large Language Models: An Ethical Approach to Generative AI Development,” arXiv:2511.20623, 2025, https://arxiv.org/abs/2511.20623. ↩︎
A. V. Duarte, X. Zhao, A. L. Oliveira, and L. Li, “DE-COP: Detecting Copyrighted Content in Language Models Training Data,” arXiv:2402.09910, 2024, https://arxiv.org/abs/2402.09910. ↩︎

The new copyright question is not “did the model copy me?” but “how would I know?”#

The core mechanism: make the model choose between the original and plausible impostors#

Step one: test passages that are actually worth remembering#

Step two: build distractors that are close enough to make the test meaningful#

Step three: question-answering turns memorization into a controlled probe#

Step four: multiple-choice design lowers random success and answer-position bias#

Step five: cleaning the evaluation data may be the most practical contribution#

Step six: vector logging turns isolated tests into an audit trail#

What the paper’s evidence supports, and what it does not#

The common misconception: detection is not proof of dataset inclusion#

Business relevance: cheaper triage for a copyright arms race#

ROI is hidden in fewer bad tests, not just fewer API calls#

Where the system should be used carefully#

The quiet arms race over training data will be fought with audit workflows#