From Genes to Memes: The Evolutionary Biology of Hugging Face's 2 Million Models

TL;DR for operators

Open-model adoption is usually treated as procurement with a nicer download button: find a model, check the licence, skim the model card, run a few benchmarks, and move on. This paper makes that habit look under-specified.

Laufer, Oderinwale, and Kleinberg analyse 1.86 million models on Hugging Face and reconstruct family trees linking models to fine-tuned, adapted, quantised, or merged descendants.¹ The useful result is not merely that Hugging Face is large. We knew that. The useful result is that model lineages mutate their public governance signals as they spread.

The paper shows three things operators should care about.

First, related models do resemble each other, but not in the simple “child inherits parent” way. In the fine-tuning trees studied most closely, sibling models often look more similar to each other than parent-child pairs do. That suggests mutation is not just random noise around a base model. Fine-tuning tends to move derivatives in shared directions.

Second, the traits that mutate are operationally sensitive. Licences drift from restrictive or commercial terms toward permissive or copyleft labels. Documentation thins. Language support specialises, strongly toward English. Task labels shift in a pattern the authors interpret as loosely mirroring the machine-learning lifecycle.

Third, the evidence is about declared Hugging Face metadata and model cards, not direct inspection of weights or undisclosed dependencies. That boundary matters. The paper does not prove that every derivative model violates its upstream terms or that every English-labelled model lost actual multilingual capability. It does show that the public facts companies rely on for model intake can change quickly along derivative lineages. That is enough to make “we used the same base family” a weak governance answer.

The operational takeaway: model registries need lineage-aware review. A derivative model should not automatically inherit trust from its ancestor. It should inherit questions.

Model registries are becoming breeding grounds, not warehouses

A model registry sounds reassuringly dull. Warehouses are dull. Ledgers are dull. Procurement systems are dull by design. They tell the organisation what it has, where it came from, and which rules apply before someone embeds it inside a customer-facing product and later discovers the licence has opinions.

Hugging Face is not dull in that way. It is closer to an ecological habitat: models appear, branch, specialise, get remixed, and spawn derivatives. Some lineages become small shrubs. Some become sprawling forests. Some look like broadcasts from a popular base model. Others run through multiple generations, where one derivative becomes the parent of another derivative, which then becomes the parent of yet another derivative, and so on. The family tree stops being a metaphor and becomes a dependency graph with a better costume.

The paper’s central move is to take that metaphor seriously enough to measure it. The authors treat metadata and model cards as a kind of public genetic material. Not DNA in the literal sense — nobody sensible should confuse a JSON tag with a chromosome before coffee — but a structured record of inherited and mutated traits: licence, task, language support, library, linked datasets, model-card language, and parent-child relationships.

That framing matters because many business workflows still behave as if an open model is a discrete object. Pick model. Review model. Deploy model. Done.

The paper’s evidence points to a different object: a model is a point in a lineage. The lineage has direction. The direction changes the governance facts.

The dataset makes lineage measurable at ecosystem scale

The first contribution is infrastructure. The authors collect a snapshot of 1.86 million publicly available Hugging Face models. They pull two broad categories of information.

The first is structured metadata: model IDs, downloads, likes, trending scores, task labels, libraries, creation timestamps, tags, licences, linked arXiv papers, language codes, and parent-model relationships encoded in Hugging Face tags. The second is the model-card text, collected separately through per-model API calls.

That second part matters because model cards are the public documentation layer. They are where developers often explain intended use, limitations, performance, risks, training context, or nothing much at all, with great confidence and six words. In the dataset, 67.04% of models have a model card. Among the 1,247,149 available cards, the average length is about 3,575.60 characters, while the median is 2,073 characters. The gap between mean and median is a quiet warning: a few extremely long cards pull the average upward, while ordinary documentation is much thinner.

The paper also reports ecosystem-level summaries: permissive licences, especially Apache-2.0 and MIT, make up over 60% of reported licences; text-generation dominates task labels; English appears in over 75% of models that report any language compatibility; Chinese is second at 4.4%; and transformers is the most common library.

Those numbers are useful, but the more important move is not the census. It is the graph. The authors use logged Hugging Face relationships to connect models into family trees. Some edges represent fine-tuning, some quantisation, some adapters, and some merges. For the genetic-similarity analysis, they narrow the focus mainly to fine-tuning edges, excluding merges, adaptations, and quantisations so the graph behaves more like a tree with one parent per child.

This is not a pedantic modelling choice. It determines what the evidence can support. Fine-tuning lineages allow the authors to ask inheritance-style questions: do children resemble parents? Do siblings resemble each other? Does similarity decay across generations? Merges would complicate that with multiple parents. Useful complication, but still complication. The paper leaves that “sexual reproduction” problem largely for future work. Quite fair. Even evolutionary biology got a few centuries.

The mechanism: family resemblance exists, but inheritance is not simple

The obvious expectation is that a fine-tuned child should look more like its parent than like its sibling. In a simple asexual reproduction story, the child inherits traits from the parent, with some random mutation. Siblings share a parent, but each has its own mutations. Parent-child similarity should therefore be stronger than sibling-sibling similarity.

The paper finds something more interesting. Models in the same family do resemble each other more than random pairs. Similarity also generally declines as models are farther apart in the family graph or separated by more generations. So the family-tree metaphor is not decorative. Related models really do carry overlapping public markers.

But sibling models can be more similar to one another than either is to the parent. That is the mechanism hiding inside the paper.

The authors interpret this as evidence that mutation is fast and directed. Fine-tuned children do not merely drift randomly away from a parent. They often move away in similar directions. A base model may spawn many derivatives that add similar kinds of metadata, adopt similar documentation templates, narrow similar language claims, or shift similar task labels. The siblings look alike because they are shaped by common downstream tooling, platform conventions, developer incentives, and market demand.

For business readers, this is where the paper becomes more than an ecosystem map. It says public model facts are not stable inheritance properties. They are edited by the reproductive process itself.

A base model’s licence, documentation depth, multilingual promise, and task framing may not survive the trip through downstream adaptation. The derivative may still carry family resemblance, but the details procurement cares about can mutate. The family name is not the compliance file.

Licence drift turns provenance into a legal intake problem

The licence result is the paper’s sharpest governance finding.

The authors count 162 unique licence types on Hugging Face, including 98 standard platform categories once “unknown” and “other” are excluded. For licence inheritance, they observe 320,065 parent-child cases. The mutation rate is 14.98%, meaning the child reports a different licence from the parent in roughly one out of seven observed licence inheritances.

The direction is the uncomfortable part. Among the 20 most common licence traits studied in the drift graph, 132 of 140 directional drifts follow the paper’s optimal ordering, and 84.26% of mutations follow that order. The ordering generally runs from commercial or restrictive licences toward permissive or copyleft licences.

Examples matter here because “licence drift” sounds abstract until it meets a legal team’s pulse. The paper notes that Gemma and Llama licence variants often sit upstream of Apache-2.0 or MIT labels. It also notes Creative Commons non-commercial licences drifting toward licences that remove non-commercial restrictions. In plain English: derivatives sometimes present themselves as more open than their ancestors’ terms appear to allow.

The paper does not prove legal violation model by model. It is not a court, which is fortunate for everyone’s formatting. What it does show is systematic pressure toward more permissive self-description. That is enough to change the operating procedure.

A company cannot safely read a derivative model’s licence tag in isolation. It must ask whether the derivative’s declared licence is compatible with its parent’s licence. If the parent is restrictive and the child claims Apache-2.0, that is not a green light. It is a review queue.

Documentation does not merely shrink; it industrialises

Documentation drift is less dramatic than licence drift, but more familiar to anyone who has seen internal AI governance in the wild.

The paper finds that among parent-child pairs with model cards, child cards are substantially shorter. The parent’s model card is roughly twice the size of the child’s on average. At the same time, derivative cards increasingly contain phrases such as “automatically generated” or “generated automatically.” About 30% of derivative models contain those bigrams.

This is not simply laziness. It is a mechanism.

Fine-tuning and adapter workflows reduce the marginal cost of producing a model. Documentation then becomes a by-product of tooling rather than a crafted explanation. Templates fill the space where contextual judgement should be. A derivative can be produced quickly, uploaded quickly, and documented just enough to satisfy the platform’s shape of documentation.

That is operationally dangerous because documentation is already where downstream users look for risk boundaries. Was the model trained on sensitive data? Is it suitable for medical use? Which languages were evaluated? What safety testing was performed? Does the model inherit known limitations from the base model or introduce new ones? A shorter, templated card may not answer those questions. It may simply be a receipt.

The business lesson is not “never use derivatives.” That would be theatrical and unhelpful. The lesson is that documentation quality should be scored at the derivative level, not inherited from the parent. A child model with a famous ancestor and a thin autogenerated card is not well documented. It is wearing a family crest on a paper napkin.

Language drift quietly narrows the addressable world

Language support is operationally different from licence because a model can list multiple languages. The paper therefore treats language mutation as a partial shift across sets: languages can be added, dropped, or narrowed.

The result is a strong movement toward specialisation, especially English. In the language analysis, the authors observe 115,660 inheritances, a mutation rate of 12.80%, 186 of 190 drift directions following the optimal ordering, and 74.71% of mutations following that order. The endpoint of the ordering is English.

This is easy to misread. The paper is not saying the global open-model ecosystem literally cannot support other languages. Nor does a language tag fully measure capability. A model may retain latent multilingual behaviour even if its metadata says otherwise. Conversely, a model may claim language support that evaluation would not vindicate. Metadata is a signal, not a benchmark.

Still, the drift is commercially important. Base models often report broader multilingual compatibility. Derivatives tend to report one or a few languages, with strong movement toward English. That can produce a false sense of localisation coverage. A team adopting a derivative because its ancestor was multilingual may discover, too late, that the derivative’s public claims narrowed.

For operators in multilingual markets, the model-family name is therefore insufficient. Localisation review must ask: what does this exact derivative claim, what has it been evaluated on, and what did it drop relative to the parent?

This matters especially in compliance-heavy, customer-facing, or government contexts. Language support is not a UX flourish. It affects fairness, accessibility, support cost, error rates, and legal exposure. The English drift is not just cultural gravity. It is product risk with a default accent.

Task drift looks like a lifecycle — but treat it as a hypothesis

The task-label analysis is the most conceptually interesting and the easiest to overstate.

The paper reports 251,060 task inheritances, a high mutation rate of 23.14%, 111 of 121 drift directions following the optimal order, and 95.16% of mutations following that order. The ordering appears to move from lower-level representation or feature tasks toward modality translation, generation, classification, and reinforcement-learning-style tasks.

The authors interpret this as a possible recapitulation of the machine-learning lifecycle: general-purpose representation first, then adaptation, then task-specific or human-facing outputs. They are careful here. This is an explanatory hypothesis, not a proven law of model development. Good. The graveyard of grand biological analogies has enough marble.

For business use, the task drift is still useful because it suggests that model lineages do not merely specialise by domain. They may also move along capability layers. A base model associated with representation learning or translation can spawn descendants labelled for generation, classification, or downstream human-facing applications.

That affects evaluation design. If a derivative’s task label changes, the parent benchmark is not the benchmark. A company evaluating a child model for classification cannot rely on the parent’s text-generation reputation. A task label is not a decorative tag; it tells the governance system which tests should run.

What each part of the evidence is actually doing

The paper uses several forms of evidence. They should not be treated as interchangeable. The main results establish the mechanism. The appendices mostly test measurement choices or extend the picture.

Evidence in the paper	Likely purpose	What it supports	What it does not prove
1.86M-model Hugging Face dataset	Main evidence	The open-model ecosystem can be studied as a large, structured lineage graph using platform metadata and model cards.	That all real-world model dependencies are visible, complete, or correctly declared.
Family-tree examples and growth curves	Main evidence / descriptive framing	Model diffusion varies widely, with broadcast-like and multi-generation structures.	That a particular family structure is better, safer, or more commercially useful.
TF-IDF metadata similarity across family structures	Main evidence	Related models share public metadata traits more than random pairs; similarity varies by family position.	That model weights, performance, or latent behaviour are equally similar.
Sibling similarity exceeding parent-child similarity	Main evidence	Mutation is fast and directed rather than simple random inheritance from a parent.	The precise cause of that directionality; incentives and tooling are plausible mechanisms, not directly isolated causal factors.
Licence drift graph	Main evidence	Licence labels systematically move from restrictive/commercial terms toward permissive or copyleft labels.	That every such case is legally invalid or intentionally non-compliant.
Model-card length and autogenerated-text analysis	Main evidence	Documentation tends to thin while templated or automatically generated documentation increases in derivatives.	That every short card is bad or every long card is trustworthy. Length is a proxy, not assurance.
Language drift graph	Main evidence	Language claims specialise and drift strongly toward English.	That actual multilingual capability disappears exactly as metadata claims change.
Task drift graph	Exploratory extension with strong descriptive signal	Task labels mutate in an ordered pattern that may mirror stages of model adaptation and deployment.	A universal lifecycle law. The paper itself treats this as a hypothesis needing more research.
Appendix similarity measures: bag-of-words, Levenshtein, model-card comparisons	Robustness / sensitivity test	The broad similarity pattern does not depend entirely on one text-similarity metric.	That text-based similarity captures all meaningful technical similarity.
Structural virality appendix	Exploratory extension	Many graphs are broadcast-like, but some lineages reach nearly 40 generations.	That deep lineage alone implies quality, adoption value, or risk.
Documentation availability and tooling appendix	Exploratory ecosystem context	Endpoint compatibility, safetensors use, Spaces automation, and DOI association give further signals about platform adoption and visibility.	That these features causally drive downloads or trust.

This evidence stack is useful because it prevents the lazy reading: “models evolve, therefore everything is biology now.” No. The paper’s strongest claim is narrower and more practical: declared model traits change directionally along recorded family trees.

The business implication is lineage-aware model governance

The paper directly shows that Hugging Face model families have measurable public resemblance and measurable public mutation. Cognaptus’s business inference is that model governance should become lineage-aware.

That means a model intake process should not ask only: “What does this model claim today?” It should ask: “What did its parent claim, what changed, and are those changes allowed, tested, and documented?”

A minimal operating framework looks like this:

Governance object	What the paper directly shows	Cognaptus inference for business use	Remaining uncertainty
Licence	Licence labels mutate in about 14.98% of observed licence inheritances, often toward more permissive labels.	Treat derivative licences as claims requiring parent-compatibility review, not as final answers.	Legal validity depends on actual licence text, derivative facts, jurisdiction, and use case.
Documentation	Child model cards are roughly half the parent-card size on average; autogenerated markers rise in derivatives.	Score documentation at the child-model level; require missing evaluation, risk, and intended-use details before production use.	Length and autogenerated wording are imperfect proxies for documentation quality.
Language support	Language claims specialise and drift strongly toward English.	Require derivative-specific multilingual testing for markets outside English-first use.	Metadata may understate or overstate actual language performance.
Task label	Task mutation is high and strongly ordered.	Trigger evaluation suites from the derivative’s actual task label, not the parent’s reputation.	Task labels may be coarse, inconsistent, or strategically chosen by uploaders.
Lineage	Related models resemble each other, but siblings can resemble each other more than parents.	Store lineage as a first-class registry field and monitor sibling clusters for shared risks.	Hidden lineage and undisclosed reuse remain outside the observed graph.

This is not bureaucratic ornamentation. It changes practical review.

A derivative model with a permissive licence but a restrictive upstream parent should enter legal review. A derivative with a famous multilingual ancestor but English-only metadata should enter localisation review. A derivative with a short autogenerated model card should enter documentation review. A derivative with a changed task label should enter evaluation review.

The review should follow the mutation, not the marketing page.

The misconception: derivatives do not simply inherit governance from their ancestors

The reader misconception worth killing is simple: a fine-tuned model inherits the important public facts of its base model unless the uploader says otherwise.

The paper suggests the opposite default. The uploader often does say otherwise, directly or indirectly, through changed metadata, changed licence labels, shortened cards, narrower language tags, and new task labels. The derivative may still be related to the parent. It may even be technically useful because it is related to the parent. But the governance record is its own object.

This matters because open-model adoption often borrows trust from upstream reputation. A team sees a familiar base family and mentally imports the parent’s credibility. That is understandable. It is also how small governance errors get promoted into production.

A famous parent does not make the child compliant. A multilingual parent does not make the child multilingual. A well-documented parent does not make the child well documented. A permissive-looking child licence does not erase upstream restrictions. The family resemblance is real, but it is not a warranty.

What a better model registry would record

Most enterprise AI registries still look like asset inventories. A better one would look more like a lineage observatory.

At intake, each model entry should record:

declared parent model or base model;
relationship type, such as fine-tune, adapter, quantisation, merge, or unknown;
parent licence and child licence;
licence compatibility status;
parent and child task labels;
parent and child language claims;
parent and child documentation score;
evaluation suites triggered by the child’s task and deployment context;
whether the model card appears autogenerated or materially incomplete;
lineage confidence, especially when dependency claims are missing or ambiguous.

This does not require heroic tooling. Much of it can be implemented as structured intake plus automated checks. The hard part is cultural: teams must stop treating “open” as a substitute for “understood.”

The paper’s ecological metaphor helps here. In an ecosystem, risk is not only inside the organism. It is in transmission, adaptation, and niche pressure. In an open-model ecosystem, the same applies. Licences relax because openness is rewarded. Documentation thins because speed is rewarded. English dominates because market demand and platform defaults reward it. Task labels move because downstream use rewards specialisation.

The mutation is not a bug in the ecosystem. It is the ecosystem doing what incentives tell it to do. Naturally, the compliance department arrives afterwards with a clipboard and a headache.

Boundaries: what this paper cannot settle

The paper is strong because it is large and careful about what it measures. It is also bounded in ways that directly affect operational interpretation.

First, it only captures relationships logged on Hugging Face. Many models may be related without declaring that relationship. Some families may therefore be fragmented. Related models released in different sizes may be treated as separate base models. Hidden reuse remains hidden, which is inconveniently consistent with the word “hidden.”

Second, the genetic material here is metadata and model-card text. Those are important public artefacts, especially for governance, but they are not weights, training data, source code, or full repository configuration. The paper does not claim that metadata similarity equals behavioural similarity.

Third, platform changes can affect reporting. The authors note that Hugging Face’s createdAt field was introduced in March 2022 and back-filled for existing models. Interface changes, autogenerated documentation features, and evolving metadata conventions can change what developers report without necessarily changing the underlying model.

Fourth, the fine-tuning similarity analysis deliberately excludes merges, adaptations, and quantisations to preserve tree-like structure. That makes the analysis cleaner, but it leaves out important forms of open-model reuse. Merges are especially important because they can combine lineages. The authors suggest they may eventually create a very different graph structure, potentially connecting much of the ecosystem.

Fifth, the paper is a snapshot. Ecosystems move. A lineage analysis in six months might expose different dominant families, licences, and documentation practices.

These limitations do not weaken the business takeaway. They sharpen it. If visible metadata already mutates this much, invisible dependencies and off-platform reuse are not reasons to relax governance. They are reasons to stop pretending the registry sees everything.

The useful biology is the cold kind

The evolutionary metaphor works because it explains why governance facts drift. It does not work if it becomes poetry.

The paper’s useful idea is not that AI models are alive, or that Hugging Face is a rainforest, or that your compliance team needs a field guide and khaki shorts. The useful idea is that derivative models form lineages, lineages transmit traits, and traits mutate under pressure.

For operators, that turns model selection into a lineage problem. You are not only choosing a model. You are choosing the consequences of its ancestry and the mutations introduced downstream.

That is the uncomfortable but productive lesson. Open-model ecosystems scale because they let people copy, adapt, and specialise quickly. The same mechanism that creates value also rewrites the public signals companies rely on for trust. Licence, language, documentation, and task claims are not fixed at birth. They evolve.

So the next time a team says, “It is based on a reputable model,” the correct answer is not congratulations.

It is: show the family tree.

Cognaptus: Automate the Present, Incubate the Future.

Benjamin Laufer, Hamidah Oderinwale, and Jon Kleinberg, “Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face,” arXiv:2508.06811, submitted 9 August 2025, https://doi.org/10.48550/arXiv.2508.06811. ↩︎

TL;DR for operators#

Model registries are becoming breeding grounds, not warehouses#

The dataset makes lineage measurable at ecosystem scale#

The mechanism: family resemblance exists, but inheritance is not simple#

Licence drift turns provenance into a legal intake problem#

Documentation does not merely shrink; it industrialises#

Language drift quietly narrows the addressable world#

Task drift looks like a lifecycle — but treat it as a hypothesis#

What each part of the evidence is actually doing#

The business implication is lineage-aware model governance#

The misconception: derivatives do not simply inherit governance from their ancestors#

What a better model registry would record#

Boundaries: what this paper cannot settle#

The useful biology is the cold kind#