A pricing team gives an LLM several hundred property listings and asks a sensible question: Which characteristics help predict the selling price?
The model returns an equally sensible list. Swimming pools. Granite countertops. Scenic views. Green lawns. Kitchen islands.
Everything sounds plausible. That is the problem.
The list describes what generally makes a house attractive. It does not necessarily describe what separated expensive from inexpensive houses in this particular collection, sold in particular locations, during a particular year. The LLM has supplied real-estate conventional wisdom when the business needed dataset-specific evidence.
GenZ, a hybrid modeling framework proposed by Marko Jojic and Nebojsa Jojic, begins from this mismatch.1 Instead of asking a foundation model to invent useful features from its domain knowledge, GenZ first lets a statistical model determine which items need to be separated. Only then does it ask the foundation model to explain what those groups have in common.
The reversal is small enough to fit in one sentence:
Statistics discovers the split; the foundation model gives the split a name.
That division of labor is the paper’s central contribution. It also explains why GenZ performs better than zero-shot LLM-generated features in the paper’s house-pricing and movie-recommendation experiments.
The usual workflow asks the LLM to guess the right questions
Many interpretable machine-learning systems place human-readable concepts between raw inputs and final predictions.
A house might be represented by features such as:
- located in a coastal area;
- has a swimming pool;
- includes a three-car garage;
- uses modern interior finishes.
A movie might be represented by:
- is science fiction;
- has a female lead;
- belongs to a franchise;
- has a tragic ending.
Once those features exist, a conventional statistical model can estimate how they relate to price, demand, user preference, or another target.
Foundation models make feature creation easier because they can inspect text, images, or retrieved information and answer questions about an item. An LLM can classify whether a movie belongs to a franchise. A multimodal model can inspect a property photograph and identify plantation shutters.
The difficult part is choosing which questions deserve to be asked.
A knowledgeable LLM can propose a polished feature list from the task description. Yet that list will usually reflect patterns emphasized in its training data: common genres, familiar property amenities, standard marketing categories, and other generally accepted distinctions.
Proprietary datasets often care about stranger things.
Customers from one period may cluster around particular actors, composers, release windows, or prestige signals. Property prices in a small sample may depend heavily on exact ZIP codes, construction type, or dated bathroom fixtures. These relationships may be meaningful only for a specific market, cohort, or moment.
Prompt engineering does not give an LLM access to statistics it has never learned.
GenZ lets prediction errors decide what distinction is missing
GenZ represents each item using latent binary features.
For an item with semantic description $s$, the system maintains a binary vector:
where each $z_i$ indicates whether a human-readable feature applies.
A foundation model receives a feature description—such as “belongs to a major film franchise”—and judges whether it applies to the item. Crucially, GenZ does not treat that judgment as unquestionable truth. It treats the answer as a noisy observation of the latent feature.
The statistical portion then predicts a target $\mathbf{y}$ from those latent features:
The target can be a single real value, such as the logarithm of a house price. It can also be a multidimensional vector, such as a movie embedding derived from collaborative-filtering behavior.
This matters because multidimensional errors are difficult to explain to an LLM directly. Telling a model that one movie’s predicted 32-dimensional embedding is wrong is not particularly informative. A vector does not arrive with a convenient sentence explaining which cultural preference was missed.
GenZ instead uses the statistical posterior to ask a simpler question: Which items should receive different values for one binary feature so that the target becomes easier to predict?
The statistical model can answer that question without understanding movies, houses, or customers.
The foundation model can then inspect representative items from the two resulting groups and propose a semantic description that separates them.
How a feature is born: split first, describe second
When GenZ adds a new feature, it initially disconnects that feature from the foundation model’s opinion. Its classification-error probability is set to maximum uncertainty, effectively allowing the statistical model to assign the feature however it best reduces prediction error.
The feature-discovery cycle then proceeds as follows:
-
Fit the current statistical model. Existing features are used to predict the observed targets.
-
Find a useful binary split. The new latent feature is assigned across items according to which division improves the statistical fit.
-
Select contrasting examples. Representative positive and negative items are sampled from the two sides of the split.
-
Ask the foundation model to name the distinction. A mining prompt asks what characteristic separates the positive examples from the negative examples.
-
Apply the new description to every item. A separate extraction prompt classifies whether the discovered feature applies to each item.
-
Refit the model and its uncertainty. GenZ updates the statistical mapping, prediction variances, posterior feature assignments, and the estimated reliability of the semantic feature.
The paper formulates this as a generalized expectation-maximization process. Familiar numerical parameters can be optimized through statistical updates. The less tidy semantic parameter—the text describing a feature—is updated through foundation-model proposals guided by the posterior.
The process does not require changing the foundation model’s weights. In the reported experiments, GPT-5 performs semantic mining, while GPT-4 applies the discovered descriptions across items.
“Frozen model,” however, should not be confused with “free model.” The workflow still requires repeated inference calls across multiple discovery and extraction cycles. The paper does not report the resulting API cost.
Targeted discovery stops one feature from explaining several unrelated mistakes
Simple residual splitting can become confused when existing features interact.
Suppose a property model has already learned whether a house is located in an Arizona desert community. When discovering the next feature, the remaining errors among Arizona houses might be explained by swimming pools, while errors among non-Arizona houses might be explained by construction quality.
A single unrestricted split could mix both patterns and ask the LLM to invent one strained description covering them. Foundation models are rather talented at producing strained descriptions when invited.
GenZ addresses this through conditional feature addition. It identifies the existing feature combination responsible for the largest total error, then chooses positive and negative examples from within that subgroup. The resulting semantic question is narrower: among otherwise similar items, what characteristic explains the remaining difference?
For the house and movie experiments, the system repeatedly adds five features and removes two, running between 10 and 20 expansion-contraction cycles. Features applying to fewer than 2% or more than 98% of items are discarded.
This expand-contract process serves two purposes. It allows the feature vocabulary to evolve rather than merely accumulate, and it exposes overtraining when additional features continue improving the training fit while held-out performance deteriorates.
The binary-number experiment explains the mechanism; it does not validate the business case
The paper’s first experiment uses numbers from 0 to 511 represented as strings, with each number’s value serving as its target.
The first discovered feature divides the numbers into a lower and upper half. The next feature divides those groups again. Repeated discovery eventually recovers the nine binary digits needed to represent all 512 values.
The reconstruction is perfect.
This experiment is best understood as a mechanism demonstration. It shows why residual-guided splitting naturally proceeds from coarse distinctions to finer ones. The statistical model identifies a useful partition, and the foundation model recognizes the semantic rule behind it: greater than 255, a particular bit being set, even numbers, and so forth.
It does not show that GenZ will reliably discover meaningful concepts in messy enterprise data. The numeric pattern is exact, easily recognizable, and unusually well matched to repeated binary splitting.
Its value is explanatory. It makes the subsequent experiments easier to interpret.
House prices show the difference between plausible features and useful features
The house-pricing experiment uses a dataset of 535 properties listed in 2016. Each property includes metadata and four listing images.
To reduce image-processing and prompt-length constraints, the authors first use GPT-5 to convert the images into detailed text descriptions. Each resulting semantic item combines those descriptions with metadata such as area, bedroom count, bathroom count, and ZIP code. Price is withheld from the semantic item, and the prediction target is log price.
The system receives several numerical inputs directly—intercept, area, bedrooms, and bathrooms—rather than wasting discovery cycles rediscovering basic quantities. GenZ then mines additional binary semantic features.
The data is divided into an 80% training set and a 20% test set. The authors evaluate both a linear mapping and a small neural network.
The strongest GenZ configuration reaches a log-price median absolute error of roughly 0.10–0.11, corresponding to about 12% median relative error on raw house prices. The zero-shot feature baseline remains around 38%.
That gap is too large to explain through a slightly better phrasing of the same feature list.
The baseline describes houses; GenZ describes this housing dataset
For the zero-shot baseline, GPT-5 is asked to propose approximately 50 yes-or-no characteristics useful for price prediction. It produces 52.
The resulting list contains reasonable housing attributes:
- private swimming pool;
- scenic views;
- granite countertops;
- hardwood flooring;
- decorative chandeliers;
- green lawns;
- kitchen islands;
- gated entries.
These features are not foolish. Many are likely relevant in some market.
They are also highly conventional.
GenZ’s discovered features include exact ZIP-code groups, site-built rather than manufactured construction, plantation shutters, carports, three-car garages, retro bathroom elements, Hollywood-style bathroom lighting, wood paneling, and the absence of a kitchen island.
Some of these features may be direct pricing factors. Others may operate as proxies for location, construction era, renovation quality, or market segment. Their importance comes from fitting the observed pricing structure, not from sounding universally important.
The neural network performs better than the linear model in this experiment because house-price effects plausibly interact. A quality feature may have a different value in different locations or at different property sizes. The nonlinear model can capture those combinations.
Its learning curve also shows more severe overtraining after reaching its best test performance. More discovered features eventually improve training error while worsening generalization. Interpretability has not repealed the usual rules of model selection.
The house baseline is generous, but the evaluation remains illustrative
The baseline receives unusually favorable treatment. GPT-5 is given the complete collection of items, including prices, when proposing its features. Those features are then repeatedly pruned, and the best observed test performance is reported.
This makes the baseline difficult to dismiss as an intentionally weak prompt. Even with access to extensive information, generic feature proposal performs poorly.
It also means the comparison is not a clean simulation of deployment on an untouched test set. Combined with the small dataset and single reported split, the experiment should be read as strong illustrative evidence for the mechanism—not a settled estimate of production performance.
Netflix shows why collective preference is not the same as content similarity
The second main experiment concerns cold-start recommendation.
The Netflix Prize dataset contains ratings from 480,189 users across 17,770 movies. The authors use singular value decomposition to obtain a 32-dimensional collaborative-filtering embedding for each movie.
These embeddings summarize how users collectively relate movies to one another. A new movie, however, has no interaction history from which to estimate its position in that preference space.
GenZ attempts to predict the embedding directly from semantic information.
For the experiment, the authors focus on the 512 most-watched movies and randomly split them into training and test sets. Because these are popular films, the semantic description contains only each movie’s title and release year; the foundation model is expected to know enough about them already.
The linear GenZ model reaches a test cosine similarity of approximately 0.59 between predicted and true embeddings. The zero-shot feature baseline reaches about 0.48, while the neural model reaches roughly 0.52 and overfits more sharply.
Here, greater model capacity is not rewarded. Within this dataset and sample size, the simpler mapping generalizes better.
A score of 0.59 is roughly equivalent to thousands of interactions
The paper separately simulates the conventional cold-start process.
For 100 randomly selected movies, it removes existing ratings, gradually restores user observations, and measures how closely the partially reconstructed embedding approaches the full-data embedding.
After approximately 1,000 user observations, cosine similarity reaches only about 0.38. At roughly 4,000 observations, it reaches around 0.57.
GenZ’s predicted embedding reaches approximately 0.59 without using ratings for the new item.
Two related comparisons should be kept distinct:
- GenZ’s absolute performance is roughly comparable to the embedding quality obtained after about 4,000 user observations.
- Its improvement over the zero-shot baseline, from 0.48 to 0.59, corresponds to approximately 2,000 additional observations in the simulation.
These numbers do not mean GenZ eliminates the value of subsequent behavior. Real interactions can still refine or overturn the predicted embedding. The result suggests a stronger starting position before sufficient behavior arrives.
The discovered features reveal the audience, not merely the movies
The zero-shot baseline proposes familiar content categories: genre, pacing, plot structure, setting, violence, romance, narrative style, and tone.
GenZ discovers a different mixture:
- franchise membership;
- Best Picture recognition;
- acting and screenplay awards;
- films scored by John Williams;
- movies starring particular actors;
- narrow release-period distinctions;
- standard theatrical releases rather than special editions;
- combinations of mainstream positioning, age, prestige, and audience orientation.
The “John Williams” feature is especially informative. It groups movies across genres through a creative signature that appears to align with collective viewing patterns. A conventional content taxonomy would have little reason to elevate it.
The result illustrates a useful distinction. Semantic similarity asks whether two movies resemble each other in content. Collaborative preference asks whether the same people tend to value them in related ways. Those structures overlap, but they are not identical.
The discovered descriptions remain interpretations of predictive splits. They are not causal findings about why audiences watched the films. “John Williams” may represent music preference, franchise affinity, generational familiarity, studio scale, or several correlated factors at once.
Each experiment answers a different question
The paper contains several tests and appendices, but they should not be treated as equal pieces of evidence.
| Component | Likely purpose | What it supports | What it does not establish |
|---|---|---|---|
| Binary-number reconstruction | Mechanism illustration | Residual-guided discovery can produce coarse-to-fine semantic features | Effectiveness on noisy or unfamiliar business data |
| House-price experiment | Main evidence for scalar regression | Dataset-guided semantic features can outperform generic LLM-proposed features | Production accuracy across markets, periods, or larger property datasets |
| Netflix cold-start experiment | Main evidence for multidimensional targets | Semantic features can predict collaborative-filtering embeddings without item interactions | Performance for obscure items, new catalogs, or changing audiences |
| Linear-versus-neural comparisons | Model-form comparison and overfitting diagnostic | The appropriate statistical mapping depends on the target structure | A general rule that linear or neural models are superior |
| Expand-contract learning curves | Implementation behavior and model-selection signal | Feature refinement can improve performance before overtraining begins | Automatic protection against overfitting |
| Appendix B linear derivation | Theoretical and implementation detail | Exact expectations are tractable for a linear observation model | Additional empirical robustness |
| Appendix C two-sided interaction model | Exploratory extension | The framework could model semantic features for two interacting item types | Demonstrated performance for buyers, users, regions, or other paired entities |
This classification matters because the paper’s broadest ambitions extend beyond its experiments.
The authors suggest applications in areas where a foundation model can recognize semantic patterns in items but does not understand the target generated by new experiments. Scientific discovery is an obvious possibility. Appendix C also sketches models involving two semantically described item sets, such as products and buyers or plants and geographic regions.
Those extensions follow logically from the framework. They have not yet been demonstrated.
The practical product is a dataset-specific semantic layer
The most useful business interpretation of GenZ is not “use an LLM for prediction.”
It is: use proprietary target data to build a semantic feature layer that an LLM could not have designed from general knowledge alone.
A plausible operational workflow would be:
- Build a statistical target from proprietary outcomes or interactions.
- Represent each item using text, images, metadata, or retrieval-supported descriptions.
- Run feature discovery offline, allowing residual structure to guide semantic mining.
- validate discovered features across held-out samples, time periods, and business segments.
- Deploy the stable features as an interpretable input layer for downstream prediction.
- Continue monitoring whether the feature meanings and predictive relationships drift.
The framework appears most relevant when four conditions hold:
- Items can be meaningfully described or inspected by a foundation model.
- The prediction target contains proprietary or newly observed statistical structure.
- Cold-start performance or limited labels create operational costs.
- Human-readable intermediate features are valuable for diagnosis, governance, or decision support.
Potential applications include catalog recommendation, property or asset pricing, product segmentation, underwriting support, demand modeling, and analysis of experimental outcomes.
The paper directly demonstrates only house pricing and movie embeddings. Applying the mechanism elsewhere is a business inference that requires new validation.
Frozen models reduce training requirements, not operational requirements
GenZ does not fine-tune GPT-4 or GPT-5. This may simplify model maintenance and allow organizations to replace the semantic model without rebuilding the entire statistical framework.
The architecture still introduces operational burdens.
Feature mining repeatedly sends contrasting examples to a foundation model. Extraction then applies candidate features across the item collection. Organizations using external APIs would need to consider data-governance boundaries, caching, prompt reproducibility, latency, and inference costs.
Discovered feature descriptions also become model assets. They require versioning and review. A feature such as “located in Arizona desert communities” may remain understandable while its statistical effect changes. A feature involving an actor, award, or release period may become irrelevant as the user base shifts.
The system can learn that a semantic feature is unreliable through its error parameter. It cannot determine whether the feature is legally acceptable, ethically appropriate, causally meaningful, or stable enough for a decision process.
Statistics can identify a predictive proxy with impressive efficiency. Compliance departments have met this genre before.
The important boundaries are scale, stability, and meaning
The experiments establish that the mechanism can work. They leave several practical questions open.
First, the evaluated semantic collections are small: 535 houses and 512 popular movies. The underlying Netflix ratings matrix is large, but feature discovery is performed only on a small, familiar subset of films. Performance and inference cost at catalog scale remain uncertain.
Second, the paper reports random train-test splits rather than repeated evaluations across multiple seeds, markets, or time periods. The discovered concepts may change substantially under another sample.
Third, the most successful feature set may be statistically useful without being semantically stable. An LLM can produce a convincing description for a sampled contrast even when several correlated explanations are possible. Held-out prediction tests usefulness; they do not prove that the chosen description captures the true underlying mechanism.
Fourth, the feature space is binary. Binary questions make mining and interpretation manageable, but some business concepts are continuous, relational, hierarchical, or context-dependent.
Finally, GenZ still overfits. The house neural network and movie neural network both show that continued feature discovery can eventually improve training performance at the expense of test performance. A production system would need disciplined stopping rules and validation procedures rather than faith in an endlessly expanding semantic vocabulary.
Teach the model to listen before asking it to explain
Return to the pricing team and its list of pools, lawns, and granite countertops.
The problem was not that the LLM knew too little about houses. It knew enough to produce answers that sounded correct before examining what this dataset was trying to say.
GenZ changes the order of operations.
The statistical model first identifies a distinction that matters for prediction. The foundation model then uses its semantic knowledge to make that distinction legible. The resulting feature can be inspected, questioned, applied to new items, or discarded when it stops generalizing.
That is a more disciplined role for generative AI: not the authority deciding which patterns matter, but the interpreter helping humans understand patterns found in evidence.
The paper’s results are early and deliberately illustrative. Yet its architectural lesson is already useful. When proprietary data and foundation-model knowledge disagree, the system should not automatically assume the model is wiser.
Sometimes the LLM needs to stop talking and listen to the residuals.
Cognaptus: Automate the Present, Incubate the Future.
-
Marko Jojic and Nebojsa Jojic, “GenZ: Foundational Models as Latent Variable Generators within Traditional Statistical Models,” arXiv:2512.24834, 2025. https://arxiv.org/abs/2512.24834 ↩︎