TL;DR for operators

Memorization audits usually start with the wrong question: “Which individual text snippets look memorized?” This paper suggests a better first diagnostic: group many snippets by how closely the model reproduces them, then measure the entropy of the token distribution inside each group.1

The result is an empirical pattern the authors call Entropy–Memorization Linearity. In plain English: when training examples are pooled by edit-distance score, their set-level entropy forms a strong linear relationship with how closely the model reproduces them. Since the paper’s “memorization score” is an edit distance, lower score means stronger verbatim reproduction; higher score means the generated continuation is farther from the ground truth.

The useful twist is the unit of analysis. Individual compressibility metrics are noisy because one sequence contains only a tiny slice of the vocabulary. Pooling examples into score-level sets gives the entropy estimator enough token-space coverage to reveal a pattern. This is the whole trick. As usual, the trick is simple after someone else has done the work.

For business use, the paper points toward cheaper diagnostics for privacy leakage, copyright exposure, and benchmark contamination. Its dataset-inference method compares the regression intercept of a candidate dataset against a calibrated threshold from known training data. The authors report promising results and at least 2× runtime speedups over a baseline in their comparison, but the method still produces false positives on some dataset slices. Treat it as an audit signal, not divine revelation in YAML.

The boundary is important: this does not predict whether one specific sentence was memorized. It works at set level, depends on access to the model and enough candidate samples, uses edit-distance-style reproduction rather than semantic similarity, and needs calibration per model. Useful, yes. Magic, no.

The familiar audit problem: the model says the quiet part verbatim

A company trains or evaluates an LLM. Someone asks the awkward question: did the model memorize the training data, or merely learn patterns from it?

That question is no longer academic housekeeping. It touches privacy, copyright, benchmark contamination, model governance, and vendor risk. If a model can reproduce a user record, a book passage, a proprietary code block, or an evaluation item, “we trained on lots of internet data” stops sounding like an explanation and starts sounding like a discovery request.

The usual instinct is to hunt for individual smoking guns. Prompt the model with a prefix, compare its continuation with the original text, and ask whether the output is close enough to count as memorized. That is useful for extraction studies. It is less satisfying for understanding what property of the training data made memorization more likely.

The paper takes a different route. It asks whether data compressibility can quantify memorization. Not model perplexity. Not gradient influence. Not a fragile prompt trick. A property of the data itself.

That sounds intuitive. Highly regular text should be easier to compress. Perhaps highly compressible text should also be easier for a model to memorize. Unfortunately, intuition has a hobby of embarrassing itself in public.

Prior attempts using instance-level compressibility did not find a strong, reliable pattern. The authors argue that the problem was not only the metric. It was the scale at which the metric was being measured.

The old compressibility story fails because one snippet is too small

At the individual-example level, compressibility is a bad witness.

The authors reproduce the older style of analysis using two compressibility-related metrics: a zlib compression ratio and an empirical entropy estimator. Each point in that setup represents one sampled training sequence and its memorization score. The result is weak: the relationship is positive but noisy, with poor explanatory force.

The reason is structural. A single sequence touches only a tiny subset of the model’s token space. The paper notes that even much longer contexts would still cover only a small fraction of the available vocabulary. So when one estimates entropy from one short text span, the estimator is trying to infer a distribution from a keyhole view. Admirable effort. Terrible optics.

This matters because the research question is not really “what is the entropy of this one passage?” It is closer to:

$$ \text{What does the token distribution look like for passages that receive the same memorization score?} $$

That is a conditional distribution question. One instance is not enough to estimate it well.

This reframes the likely misconception. The useful claim is not:

More compressible individual text is always easier for an LLM to memorize.

The replacement claim is subtler:

When examples are grouped by their memorization score, the entropy of the pooled token distribution becomes a strong set-level proxy for memorization behaviour.

That distinction is the paper’s hinge. Miss it, and the whole result sounds like recycled “compression explains intelligence” wallpaper. Understand it, and the paper becomes an audit method.

The mechanism: group by score, then measure entropy

The authors use a discoverable memorization setup. They sample token sequences from known training corpora, split each sequence into a prompt prefix and a ground-truth continuation, generate a model continuation, and compare the generated continuation to the ground truth.

The comparison uses Levenshtein edit distance at token level. That choice is practical. If the model outputs almost the same sequence with one inserted character, punctuation mark, or offset, edit distance still captures the near-verbatim match. A position-by-position comparison would be more brittle. Copyright and privacy disputes are generally about close reproduction, not vibes-based semantic resemblance.

The paper also filters out trivial copying where the model simply repeats material already present in the prompt. That is a good hygiene step. If the prompt contains a long URL and the model echoes it, the model may be continuing the prompt rather than revealing a memorized continuation. Nobody needs a leaderboard for copy-paste.

The central method then changes the unit of measurement:

  1. Compute a memorization score for each sampled prompt-continuation pair.
  2. Group all examples with the same score.
  3. Pool the tokens inside each score group.
  4. Estimate entropy over that pooled token set.
  5. Regress entropy against memorization score.

Informally, for a memorization-score group $d$, the entropy estimator has the familiar form:

$$ H_d = - \sum_{t \in V_d} p_d(t)\log_2 p_d(t) $$

where $V_d$ is the set of unique tokens observed in the group and $p_d(t)$ is the empirical probability of token $t$ in that group.

This is not a complicated algorithm. Its value comes from asking the metric to do the job it is suited for. Entropy needs a distribution. A single text span is not one. A grouped set is closer.

Entropy–Memorization Linearity is the main evidence, not the appendix garnish

The headline empirical result is that the level-set entropy estimator linearly tracks the edit-distance memorization score across several open LLM families and training corpora. The paper reports results on OLMo, OLMo-2, OpenLlama, and Pythia, using their corresponding open or reconstructed training datasets.

Set-level zlib is tested too, but entropy is the cleaner signal. This is not too surprising. zlib operates as a practical compression algorithm with implementation choices and sequence-order effects. The entropy estimator, crude though it is, more directly targets the token distribution the authors want to measure.

The paper’s evidence stack is best read as follows:

Test or result Likely purpose What it supports What it does not prove
Instance-wise entropy and zlib fail to show strong structure Comparison with prior work The old unit of analysis is too noisy That individual examples contain no memorization signal
Set-level entropy aligns linearly with memorization score Main evidence Entropy–Memorization Linearity exists under the tested setup A causal theory of memorization
Different continuation lengths retain high correlations, reported around 0.92–0.98 Robustness test The pattern is not tied to one generation length That intercepts and slopes are universal
Temperature, top-p, and top-k variants retain strong correlations in appendix tests Robustness/sensitivity test Sampling choices do not destroy the relationship That decoding strategy is irrelevant
Semantic clustering across 16 clusters preserves the broad pattern Robustness by data domain The result is not only an artefact of one mixed corpus That all domains behave equally
Biomedical/genetic cluster weakens the correlation Boundary case Domain shift can degrade the signal That the method fails generally
Instruct-model test on OLMo-2-Instruct shows reduced memorization but similar entropy correlation Exploratory extension The pattern can persist after instruction tuning in this tested setting That post-training datasets are handled by the same prefix-continuation method
Dataset inference using intercept thresholds Application EM Linearity can support training/test membership audits Perfect membership attribution

This is a solid empirical package because the appendices mostly test fragility rather than inventing a second thesis. The main story remains the set-level entropy relationship. The appendix asks the mature question: how easily does it break?

The answer is: not immediately, but not never.

Higher entropy does not mean “more memorized” in the naive sense

A careful reading must keep the score direction straight.

The paper’s memorization score is an edit distance. A score of zero means the model continuation exactly matches the ground truth. A higher score means the model is farther away. So when entropy increases with memorization score, the relationship is between entropy and distance from verbatim reproduction, not between entropy and memorization in the casual English sense.

That sounds like a nuisance, but it is clarifying. Low-score groups—the more memorized examples—occupy a smaller token space. The paper’s token-space analysis finds that lower-score data contains exponentially fewer unique tokens. It also finds that, after normalising entropy by the maximum possible entropy for the observed support size, normalised entropy decreases as memorization score increases.

The authors define normalised entropy approximately as:

$$ \frac{H_d}{\log |V_d|} $$

This separates two effects that ordinary entropy blends together:

  • the size of the token space available in the group;
  • how evenly probability mass is distributed inside that space.

The interpretation is still exploratory. The paper does not offer a full theory for why the linearity appears. Its proposed intuition is that high-score samples may resemble ordinary natural language distributions, which are long-tailed, while low-score memorized samples may include specialised text such as code snippets, numerical strings, and structured content that behaves differently.

That is plausible, but still a hypothesis. The better business takeaway is simpler: the intercept and slope of the entropy–score line carry dataset-specific information. They become audit features.

The audit use case: dataset inference through intercepts

The paper’s practical application is dataset inference. Given a model and a candidate dataset, can an auditor infer whether that dataset was likely part of training?

This differs from classic membership inference. Membership inference often asks whether one individual example was in the training set. Dataset inference asks whether a collection—a book, benchmark slice, corpus subset, or proprietary document bundle—was included. That is more realistic for many business disputes. Companies rarely litigate over one token sequence in isolation. They argue over collections.

The authors’ method is intentionally simple:

  1. Run the set-level entropy procedure on a known reference training subset.
  2. Fit the entropy–memorization regression.
  3. Use the intercept as a membership signal.
  4. Set a model-specific threshold.
  5. Run the same procedure on the candidate dataset.
  6. Classify the candidate as member or non-member based on the threshold.

The paper uses LiveBench as temporally non-member data for OLMo-2 and MIMIR data for Pythia-related experiments. In the main dataset-inference table, the method correctly separates several member and non-member datasets, but it also predicts some non-member slices as members. In the appendix comparison against a baseline dataset-inference method from Maini et al., the authors report 5/6 correct predictions for their method versus 3/6 for the baseline on the selected comparison, along with at least a 2× runtime speedup and no need for the baseline’s additional regression-weight training step.

The operational message is not “solved.” It is “cheap enough to run early.”

That matters. Audit workflows are constrained by time, GPU budget, and access. A method that only requires inference on sampled data and does not require training shadow models is attractive as a first-pass screen.

What Cognaptus would infer for operators

Here is the clean separation.

What the paper directly shows: under the tested open-model settings, a set-level entropy estimator over score-grouped training examples linearly approximates edit-distance memorization score. The pattern survives several robustness checks, including decoding variations, continuation lengths, semantic clustering, and an instruct-model variant. The dataset-inference application is promising but imperfect.

What Cognaptus infers for business use: this could become a low-cost audit layer for organisations that need to assess whether sensitive, copyrighted, or evaluation data may have influenced a model. It is especially relevant where the unit of concern is a dataset, not a single example. Think: “Was this benchmark contaminated?” “Was this proprietary manual likely included?” “Does this model reproduce structured client records more than expected?”

What remains uncertain: the method needs calibration, sufficient sample size, and appropriate access. It does not establish causality. It does not produce instance-level membership proof. It may weaken under domain shift. It relies on edit-distance reproduction, so it is less informative for paraphrased or semantic memorization.

The business value is therefore diagnostic triage. Use it before expensive legal, forensic, or retraining decisions. Do not use it as the only reason to accuse a vendor of swallowing your corpus whole. Even if that would make for a livelier procurement meeting.

A practical operator framework

For an AI governance team, the method suggests a four-layer workflow.

Layer Question EM-linearity role Decision value
Training-data pre-screen Which data subsets look structurally prone to reproduction? Estimate entropy–score patterns on candidate training subsets Prioritise deduplication, filtering, or access controls
Privacy audit Does a model reproduce sensitive structured data unusually closely? Compare low-score groups and intercept behaviour Flag datasets for deeper extraction testing
Copyright exposure review Does a protected corpus behave like known training data? Dataset inference using calibrated thresholds Produce an early risk signal, not legal proof
Benchmark hygiene Does an evaluation set look like training data? Test candidate benchmark splits against reference training subsets Reduce contamination before performance claims

The most important implementation detail is sample size. The paper’s dataset-inference appendix reports that small samples make intercept and slope unstable, and it empirically sets a minimum of 1,500 samples, with each sample being a 150-token sequence. That is not huge by modern standards, but it is not “just paste three paragraphs into a chatbot and squint.”

Calibration is also model-specific. The authors use different intercept thresholds for Pythia and OLMo-2. That makes sense: tokenizers, corpora, model size, training recipes, and decoding choices all shape the fitted line. A universal threshold would be convenient. So would a universal CFO who enjoys uncertainty. We work with what exists.

The instruct-model result is interesting, but not a free pass

The appendix test on OLMo-2-1124-7B-Instruct is worth attention because many deployed systems are instruction-tuned or preference-aligned.

The authors test the set-level entropy estimator on the instruct variant using a pre-training subset, not a post-training instruction dataset. They find that measured memorization is markedly reduced: no perfect memorization is detected, and the minimum edit distance is 6. Yet the entropy–memorization correlation remains strong and similar to the base model, with reported regression metrics close to the base model: intercept 3.500 versus 3.724, slope 0.150 versus 0.142, and correlation 0.955 versus 0.945.

This is a useful nuance. Alignment may reduce direct reproduction in the tested setup, but it does not necessarily erase the statistical relationship between set-level entropy and reproduction distance. The model may become less willing or less likely to reproduce, while the underlying data-conditioned pattern remains visible.

For operators, that means safety alignment should not be treated as a memorization audit. It may change behaviour. It does not automatically remove exposure.

Where the method bends

The paper’s limitations are not decorative. They change how the method should be used.

First, the method is set-level. It cannot reliably tell you whether one specific sentence was in training. If your workflow requires instance-level attribution, this is not sufficient.

Second, the memorization score is edit distance. That is appropriate for verbatim or near-verbatim reproduction, which is often the legally and operationally salient case. It is less helpful for semantic leakage, paraphrase, or concept-level retention.

Third, the experiments use a particular discoverable memorization setup: prefix prompts from sampled sequences and generated continuations. Other prompting strategies may expose different behaviour. The paper tests sampling variants, but it does not exhaust adversarial extraction or non-adversarial reproduction settings.

Fourth, the method requires enough candidate data. The authors’ sensitivity test supports their 1,500-sample threshold for dataset inference because smaller samples make slope and intercept unstable.

Fifth, domain shift matters. The semantic-clustering appendix finds broadly robust linearity across clusters, but the genetic and biomedical research cluster shows a noticeably weaker correlation. The authors suspect this reflects a specialised textual style with Latin-derived compound words and domain-specific token distributions. Translation: specialised corpora may need their own calibration before anyone starts waving intercepts around in a boardroom.

Finally, the theoretical why remains unfinished. The empirical line is there. The explanation is not fully nailed down. The paper points toward long-tail theory and multicalibration as possible future tools. That is refreshingly honest. A named law without a full theory is still useful, but it should not be mistaken for physics.

The strategic implication: audit by distribution, not anecdote

The most useful idea in the paper is not the word “entropy.” It is the shift from anecdotal examples to distributional audit.

Anecdotes will always matter. If a model reproduces a private record, that individual output is important. But anecdotes do not tell an organisation whether a model has a systematic exposure problem. They are too sparse, too prompt-sensitive, and too easy to dismiss as edge cases.

Set-level entropy gives auditors a different kind of instrument. It asks whether groups of examples with similar reproduction behaviour also share measurable distributional structure. Once that structure exists, it can be used for monitoring, comparison, and triage.

That is the business-relevant contribution. Not “we can predict every memorized string.” Not “compression explains LLMs.” Not “copyright audits are solved.” The contribution is narrower and more useful: a simple set-level statistic can expose regularity where instance-level metrics looked like noise.

In LLM governance, that is usually how progress arrives. Not as a grand unified theory. As a slightly better measuring stick that makes the next expensive decision less blind.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yizhan Huang, Zhe Yang, Meifang Chen, Huang Nianchen, Jianping Zhang, and Michael R. Lyu, “Data Compressibility Quantifies LLM Memorization,” arXiv:2507.06056v4, 2026, https://arxiv.org/abs/2507.06056↩︎