Clustering Without Amnesia: Why Abstraction Keeps Fighting Representation

A customer database looks harmless until someone asks for “natural segments.”

Then the ritual begins. Export the data. Pick a clustering algorithm. Reduce the dimensions. Make a pretty 2D plot. Give each blob a name. “Premium convenience buyers.” “Budget explorers.” “Dormant loyalists.” Everyone nods, because blobs are comforting. Business strategy has survived on worse.

The problem is that clustering is not blob decoration. It is a fight between two incompatible demands: forget enough detail to see groups, but remember enough detail not to merge different things into one convenient fiction. Claudia Plant, Lena G. M. Bauer, and Christian Böhm frame this fight directly in their AAAI 2026 tutorial, Clustering High-dimensional Data: Balancing Abstraction and Representation.¹ The paper is not trying to introduce one new champion algorithm. Its more useful contribution is diagnostic: it explains why classical clustering, subspace clustering, autoencoder-based methods, contrastive clustering, generative clustering, and future hybrids all sit at different points in the same tension.

That makes the paper more valuable for business users than another leaderboard would be. In real systems, clustering is usually used before anyone knows the labels. The model is asked to discover hidden structure in customers, images, documents, transactions, products, tickets, or anomalies. There is no teacher standing nearby with the answer key. Worse, high-dimensional data can often be grouped in multiple meaningful ways. A product catalog may cluster by price, use case, brand identity, material, margin, seasonality, or visual style. A customer base may cluster by spending level, product taste, acquisition channel, churn risk, or support burden. “The cluster structure” may not exist. Several plausible structures may compete for attention.

So the central question is not “Which clustering method is best?” That is too simple, and therefore dangerous. The better question is: what does this method force the model to forget, and what does it force the model to preserve?

Clustering Is Not Representation Learning With a Hat

The first misconception to kill is the polite one: if a deep model learns a richer representation, clustering should improve automatically.

Not necessarily. Representation learning and clustering want different things.

Representation learning tries to place objects in a space where meaningful differences remain visible. It creates separation. The paper describes this as a kind of repelling force: different objects should occupy different locations in the latent space because their differences may matter. Clustering, by contrast, creates attractive force: objects in the same group should be pulled together, and unimportant differences should be suppressed. That suppression is not a bug. It is the point.

The conflict appears immediately in high-dimensional data. If the representation preserves every detail, grouping becomes hard because the model sees too much. Background, lighting, noise, rare attributes, and incidental correlations can dominate the geometry. If the representation abstracts too aggressively, grouping also fails because the model forgets distinctions that actually define the clusters.

The paper’s running example uses a subset of the German Traffic Sign Benchmark. Each image has roughly 3,000 pixel dimensions. The task is to recover groups corresponding to different sign types. This is a useful example because the human eye immediately understands the desired abstraction: background and lighting should matter less; sign identity should matter more. The algorithm does not receive that instruction. It has to infer the relevant level of forgetting.

That is why the paper’s mechanism-first framing matters. High-dimensional clustering fails not because algorithms are stupid in the abstract, but because each algorithm makes a specific bargain between abstraction and representation. Some bargains are cheap and interpretable. Some are expressive and expensive. Some look beautiful while quietly deleting the evidence.

K-Means Forgets Too Much, But in the Wrong Space

K-means is a useful starting point because its bargain is brutally simple. It represents each cluster by a mean vector. That creates strong abstraction: individual details are averaged away. For low-dimensional, roughly spherical clusters, this can be useful. For high-dimensional images, it is mostly optimism with arithmetic.

In the paper’s traffic-sign example, K-means in the original high-dimensional pixel space reaches an NMI of 0.28. NMI, or Normalized Mutual Information, compares assigned clusters with known benchmark labels on a 0-to-1 scale, where 1 means perfect agreement. The number is not the main lesson; the mechanism is. K-means abstracts strongly, but it does so inside a poor representation. Pixel-level distance is not the same thing as sign-level similarity. The algorithm is trying to average its way out of a representation problem.

PCA, the obvious rescue attempt, does not solve the issue. A global 2D PCA projection of the original image space does not reveal the desired cluster structure because PCA assumes a global linear simplification. The data is not one tidy Gaussian cloud waiting to be flattened. It contains several groups, and the useful directions for separating them may be local, nonlinear, or entangled with nuisance variation.

This is the first business warning. Simple clustering is not “bad.” It is often cheap, interpretable, and low-carbon. The paper explicitly notes that classical methods can run on a single CPU, have relatively small parameter burdens, and produce results in the original feature space. But those strengths become liabilities when the feature space itself is wrong. If your customer, document, or image representation is poor, K-means will confidently summarize the wrong geometry. Very efficient nonsense is still nonsense. It just invoices less.

Subspace Methods Try to Remember the Right Features

Subspace clustering improves the bargain by asking whether groups appear only in some dimensions or linear combinations of dimensions. Instead of treating all features as equally relevant, these methods search for spaces where clusters become visible.

Axis-parallel subspace methods, such as CLIQUE-style approaches, are attractive because they retain interpretability. If a cluster exists in a subset of original features, domain experts can inspect those features. In business terms, that matters. A segmentation based on “discount sensitivity + weekday purchase frequency + category breadth” is easier to use than a segmentation based on latent dimension 47 having a spiritual crisis.

But the paper is careful about the boundary. Axis-parallel subspace methods can face exponential worst-case runtime in the number of features, and they may miss clusters that do not align with original feature axes. Other methods search arbitrarily oriented linear subspaces, often integrating local PCA into clustering. These can capture more structure, but they may assign each cluster its own subspace and lose the relationships among clusters.

The paper discusses common subspace clustering and Sub-K-Means as an attempt to learn a K-means clustering together with a linear subspace that supports that clustering. This helps with visualization and interpretation because clusters can be compared in the same learned space. Yet in the traffic-sign example, Sub-K-Means still reaches its limit: the ground-truth classes look slightly better separated than PCA, but the clustering quality does not improve beyond standard K-means.

That result is not a failure of the subspace idea. It is a boundary condition. Linear subspaces are often useful for moderate-dimensional structured data, and the paper states that subspace methods typically work well up to around 100 dimensions. But image-like data with thousands of dimensions and nonlinear variation asks for more expressive representation learning.

So the mechanism moves one step: K-means forgot too much in a bad space; subspace methods try to find a better space while keeping interpretability; deep methods then offer much richer spaces, at a price.

Autoencoders Remember More, Which Is Not the Same as Clustering Better

An autoencoder learns to compress high-dimensional input into a latent representation and reconstruct the original input from it. This looks promising for clustering because it produces a lower-dimensional representation. In the paper’s example, running K-means on the autoencoder latent space improves NMI to 0.56.

That improvement is real, but it is also incomplete. The autoencoder has learned a representation useful for reconstruction, not necessarily for grouping. Its job is to remember enough information to rebuild the image. Clustering, however, needs selective amnesia. It should forget lighting and background while preserving sign identity. Reconstruction loss alone does not know which differences are nuisance and which are semantic.

Deep clustering methods therefore add clustering losses to representation learning. These losses explicitly enforce abstraction. DEC and IDEC are representative centroid-based methods. They encourage embeddings to move toward cluster centroids and use soft assignments so the model can be trained through backpropagation. Their objective includes a form of “cluster hardening”: make assignments clearer, make clusters sharper, make the latent space look decisively grouped.

This is where the paper delivers its most useful correction. A latent space that looks more clustered is not automatically a better clustering.

In the traffic-sign example, DEC produces very compact-looking clusters but reaches only NMI 0.58. The paper explains that DEC over-abstracts: clusters contain mixed ground-truth labels. The representation has collapsed too much information. IDEC, which preserves the autoencoder reconstruction term and balances it with clustering loss, looks less clean but performs slightly better, with NMI 0.60.

The eye likes compact clusters. The metric, and more importantly the reconstructed images, tell a less flattering story. In the paper’s reconstruction comparison, DEC tends to reconstruct images as cluster centroids, losing individual differences inside the same cluster. DeepECT, a hierarchical method, retains more image-specific characteristics while still abstracting unnecessary detail.

This matters operationally. In business clustering, “clean segmentation” can be a trap. A dashboard with neat groups may be the product of excessive compression. It may have removed the very differences that determine churn, fraud, product preference, complaint escalation, or medical risk. The danger is not that the model fails loudly. The danger is that it gives a segmentation so tidy that nobody asks what was erased.

The Evidence Is a Tutorial Demonstration, Not a Production Benchmark

The paper’s numeric results are best read as illustrative evidence for the mechanism, not as a universal ranking of algorithms. That distinction matters.

Test or comparison in the paper	Likely purpose	What it supports	What it does not prove
K-means on roughly 3,000-dimensional traffic-sign pixels reaches NMI 0.28	Main evidence for the curse of dimensionality and poor original representation	Strong abstraction alone is not enough when the representation is unsuitable	K-means is always unsuitable for business clustering
PCA and Sub-K-Means visualizations	Comparison with linear representation learning	Linear transformations can help interpretation but may not expose nonlinear cluster structure in image data	Subspace methods are weak in all domains
K-means on autoencoder latent space reaches NMI 0.56	Main evidence for representation learning helping clustering	A learned latent space can improve clusterability	Reconstruction-optimized representations are automatically cluster-friendly
DEC reaches NMI 0.58 while producing very compact latent clusters	Main evidence for over-abstraction	Sharper-looking clusters can mix true classes when clustering loss dominates	DEC is useless; the example only shows a failure mode under this setup
IDEC reaches NMI 0.60 while keeping reconstruction loss	Comparison within centroid-based deep clustering	Retaining representation pressure can reduce collapse	Reconstruction loss alone solves the abstraction problem
DeepECT reaches NMI 0.80 with a hierarchical cluster tree	Main evidence for local, hierarchical abstraction	Gradual hierarchical abstraction can better balance grouping and retained structure in this example	Hierarchical deep clustering always wins across domains and constraints
Table comparing classical, subspace, deep, and hybrid methods	Summary framework	Method choice involves high-dimensional capacity, interpretability, carbon footprint, and parameterization	The plus/minus ratings are precise cost estimates

This table is the right level of seriousness for the evidence. The paper is a tutorial and synthesis. It uses the traffic-sign example to make the mechanism concrete, not to issue a procurement ranking. Anyone converting the NMI values into a universal method leaderboard has already missed the point, which is impressive given that the point is printed rather clearly.

Hierarchies Make Abstraction Local Instead of Violent

The strongest part of the paper’s argument is not “deep clustering is better.” It is more specific: abstraction should often be local, hierarchical, and adaptive.

Centroid-based losses are easy to formalize because attraction to a centroid is differentiable and convenient for mini-batch training. Convenience, however, is not the same as adequacy. Pulling all members of a cluster toward one center can erase substructure too aggressively. Density-based and hierarchical methods are harder to formulate as differentiable losses, but they can preserve relationships among clusters, outliers, and local structure.

DeepECT is the paper’s main example. It grows a hierarchical cluster tree in embedded space and reaches NMI 0.80 on the traffic-sign subset. The figure is instructive because different levels of the hierarchy remain meaningful. One level corresponds to four sign types, while higher and lower levels distinguish other variations such as bright and dark backgrounds or lighting differences. That is exactly what good clustering should do: not merely output a flat label, but expose the scale at which a grouping becomes meaningful.

For business systems, this is a major design clue. Many segmentation problems are not naturally flat. Customers can be grouped first by lifecycle stage, then by product preference, then by price sensitivity. Documents can be grouped first by domain, then by topic, then by workflow intent. Fraud cases can be grouped first by attack pattern, then by channel, then by operational signature. A single forced partition may be useful for a monthly slide deck. It is rarely enough for decision automation.

Hierarchical clustering is not automatically better. The paper notes that hierarchical, density-based, and non-parametric deep methods remain less abundant than centroid-based approaches and often rely on limiting assumptions or parameter choices. But the direction is important: instead of treating abstraction as one global knob, future methods should learn where and how much to abstract.

Contrastive Methods Depend on the Meaning of Augmentation

Contrastive deep clustering offers another route. It starts from the idea that positive pairs should be close and negative pairs should be far apart. In supervised learning, labels can define those pairs. In unsupervised clustering, labels are unavailable, so methods often use augmentations. Two transformed versions of the same image become a positive pair; different images become negative pairs.

The paper discusses Contrastive Clustering as an example. It uses a ResNet34 encoder producing a 512-dimensional vector, then separates the architecture into an instance-level head and a cluster-level head. The instance-level head preserves discrimination among instances. The cluster-level head pushes stronger abstraction by projecting into a space whose dimension equals the number of clusters and using softmax outputs that can be interpreted as cluster assignments.

This is a neat architecture because it makes the mechanism visible. One part of the model says, “remember the instance.” Another says, “compress toward the group.” The system’s success depends on whether the augmentations preserve the right semantic identity. For natural images, augmentations such as color, crop, or transformation can be meaningful if they preserve object identity. For business data, the equivalent is less obvious.

What is an augmentation of a customer? Removing one transaction? Masking a demographic field? Perturbing purchase timing? Translating a support ticket? Summarizing a document differently? These choices encode assumptions about what should be invariant. A contrastive clustering pipeline is only as sensible as those assumptions.

This is where business users often underestimate unsupervised learning. “No labels” does not mean “no judgment.” It means judgment has moved into architecture, loss design, feature construction, augmentation rules, and evaluation strategy. The human decision did not disappear. It put on a lab coat.

Generative Methods Turn Clusters Into Assumptions About Data Creation

Generative deep clustering takes a more explicit probabilistic route. Methods such as VaDE assume that clusters correspond to components of a latent distribution, often a Gaussian mixture. The model learns both the latent representation and the cluster assignment. This can be powerful, especially for multi-view data where a probabilistic model helps identify shared information across views.

But the bargain is again visible. A generative model improves structure by assuming how data is produced. VaDE uses a variational autoencoder framework, with reconstruction and regularization terms. ClusterGAN uses a GAN architecture and adds an encoder so generated objects must map back to latent codes that include cluster identifiers. In ClusterGAN, stronger weighting on the encoder enforces clearer cluster separation, but the paper notes that this can come at the expense of generating realistic samples.

That sentence is the whole abstraction-representation conflict in miniature. Push harder for clean cluster separation, and you may damage the fidelity of generated data. Preserve realism too much, and cluster separability may weaken. The loss weights are not merely technical hyperparameters. They are policy choices about what the system values.

For production teams, generative clustering can be attractive when the data has a plausible latent generative story, when uncertainty matters, or when multiple views must be reconciled. It is less attractive when the assumed prior is mainly a mathematical convenience and the business user will mistake it for discovered truth. A Gaussian mixture in latent space is a useful assumption. It is not a confession from reality.

The Operational Choice Is a Trade-Off Matrix, Not a Beauty Contest

The paper’s summary table compares classical, subspace, deep, and future hybrid methods across high-dimensional capacity, interpretability, carbon footprint, and parameterization. The exact plus/minus marks are qualitative, but the operational message is clear.

Method family	What it tends to preserve	What it tends to forget	Operational strength	Operational risk
Classical clustering	Original feature meaning and simple geometry	Complex nonlinear structure	Cheap, interpretable, easier to parameterize	Breaks down on high-dimensional complex data
Subspace clustering	Relevant feature subsets or linear projections	Nonlinear relationships and sometimes relationships across clusters	More interpretable than deep methods; no GPU requirement in many cases	Struggles with very high-dimensional data and non-axis-aligned structure
Autoencoder-based deep clustering	Nonlinear latent structure	Potentially either cluster structure or individual distinctions, depending on the loss	Stronger for thousands of dimensions	Lower interpretability, higher carbon cost, harder tuning
Contrastive clustering	Invariances induced by augmentation	Details treated as augmentation noise	Powerful for data types with meaningful augmentations	Bad augmentations encode bad assumptions
Generative clustering	A probabilistic story of latent data generation	Structures outside the chosen prior	Useful for uncertainty and multi-view settings	Training instability, overfitting risk, prior misspecification
Hybrid future methods	Adaptive local balance	Ideally less unnecessary information	Promise of better performance, efficiency, and interpretability	Still a research direction, not a packaged guarantee

The business implication is not “use deep clustering.” It is more conditional.

Use classical methods when the original features are meaningful, dimensionality is manageable, and interpretability or cost dominates. Use subspace methods when clusters likely exist in feature subsets or linear projections that domain experts can inspect. Use deep clustering when the raw feature space is not semantically useful and nonlinear representation learning is necessary. Use contrastive methods when you can define augmentations that preserve the business identity you care about. Use generative methods when a latent probabilistic structure is plausible and uncertainty is worth modeling.

And when none of these statements feels safe, that is not indecision. That is the problem telling you it needs an evaluation protocol before it needs a model.

What Cognaptus Would Infer for Business Practice

The paper directly shows a conceptual framework and tutorial evidence: high-dimensional clustering requires balancing abstraction and representation; richer representations need explicit abstraction; too much abstraction can collapse meaningful distinctions; future methods should balance the trade-off more locally and adaptively.

For business use, Cognaptus would infer four practical rules.

First, evaluate representations, not just cluster labels. If a model produces clusters, inspect what information has been preserved and what has been erased. Reconstruction quality, nearest-neighbor examples, cluster exemplars, counterexamples, and local explanations can reveal whether the clustering is meaningful or merely compressed.

Second, treat clean visual separation as suspicious until validated. A 2D latent plot is a communication artifact, not a legal document. DEC’s example in the paper is the warning: compact clusters can still mix ground-truth classes. In business settings without labels, the equivalent validation must come from expert review, downstream task performance, stability checks, or intervention results.

Third, separate the “cluster-relevant” and “everything else” information when possible. The paper highlights approaches such as ACe/DeC that split latent space into a clustered space for grouping-relevant information and a shared space for other variation. This is especially relevant for business datasets where many differences are real but not relevant to the current decision. A customer’s geography may matter for logistics segmentation but not for product-taste segmentation. A support ticket’s tone may matter for escalation but not for issue taxonomy.

Fourth, prefer local abstraction over one global compression rule. Some clusters need more internal diversity preserved; others can be summarized tightly. Some data points are typical; others are boundary cases or outliers. Hybrid methods that adapt the abstraction-representation balance by dataset, cluster, or object are still a research outlook in the paper, but the design principle is already usable.

Boundaries Before Procurement Gets Excited

The paper is a tutorial, not a new empirical benchmark. Its numeric examples are valuable because they clarify mechanisms, not because they settle algorithm choice for every dataset. The traffic-sign case is image-based and label-evaluable; many business clustering tasks do not have labels, and their “correct” grouping may depend on the decision being made.

The paper also does not remove the practical headaches of clustering. Choosing the number of clusters, defining meaningful augmentations, setting loss weights, selecting priors, handling outliers, and explaining clusters to nontechnical users remain hard. Deep clustering adds representational power, but it also adds parameterization difficulty, compute cost, and interpretability loss. The authors explicitly point to the carbon footprint and tuning burden of deep methods. That matters in ordinary companies, where “just train another model” often means “just spend another month and another budget line.”

Finally, business value depends on what happens after clustering. A beautiful taxonomy that does not change pricing, routing, recommendation, risk review, product design, or knowledge retrieval is just office wallpaper. Clustering should be evaluated by whether it improves a decision process, not merely by whether it produces names for groups.

The Real Lesson: Clustering Needs Selective Memory

The phrase “clustering without amnesia” sounds contradictory because clustering always requires forgetting. The trick is to forget the right things.

K-means forgets aggressively but may do so in the wrong representation. Autoencoders remember richly but may not group. Centroid-based deep clustering can make the latent space look wonderfully clean while erasing distinctions that matter. Hierarchical and density-based methods suggest a more careful route: abstract locally, preserve relationships, and expose structure at multiple levels. Contrastive and generative methods show the same conflict through different machinery: augmentations and priors are just ways of deciding what the model should treat as stable.

For practitioners, the paper’s best contribution is not a method recommendation. It is a diagnostic vocabulary. When a clustering system fails, ask whether it failed because it remembered too much, forgot too much, or remembered the wrong things. That question is more useful than staring at a 2D plot and hoping the colors have a strategy department.

The future direction the authors sketch is hybrid clustering: methods that combine the efficiency and interpretability of subspace approaches with the expressive power of deep learning, while adapting the abstraction-representation balance during runtime. Not globally, not once, and not by pretending every data point has the same needs. The trade-off may need to be local to the dataset, the cluster, or even the individual object.

That is the right ambition. Because in business, as in machine learning, intelligence is rarely about remembering everything. It is about knowing what can be safely forgotten.

Cognaptus: Automate the Present, Incubate the Future.

Claudia Plant, Lena G. M. Bauer, and Christian Böhm, “Clustering High-dimensional Data: Balancing Abstraction and Representation Tutorial at AAAI 2026,” arXiv:2601.11160, submitted January 16, 2026. https://arxiv.org/abs/2601.11160 ↩︎

Clustering Is Not Representation Learning With a Hat#

K-Means Forgets Too Much, But in the Wrong Space#

Subspace Methods Try to Remember the Right Features#

Autoencoders Remember More, Which Is Not the Same as Clustering Better#

The Evidence Is a Tutorial Demonstration, Not a Production Benchmark#

Hierarchies Make Abstraction Local Instead of Violent#

Contrastive Methods Depend on the Meaning of Augmentation#

Generative Methods Turn Clusters Into Assumptions About Data Creation#

The Operational Choice Is a Trade-Off Matrix, Not a Beauty Contest#

What Cognaptus Would Infer for Business Practice#

Boundaries Before Procurement Gets Excited#

The Real Lesson: Clustering Needs Selective Memory#