Planning Before Picking: When Slate Recommendation Learns to Think

A list of individually excellent items can still be a terrible list. Ask anyone who has attended a conference with five brilliant speakers, no agenda, and three consecutive sessions on the same topic.

Recommendation systems have the same problem.

A conventional recommender can assign highly accurate scores to individual videos, products, or articles, then still assemble a repetitive, badly ordered, or strangely balanced feed. Each item wins its private competition. The user receives the collective consequences.

Generative recommendation was supposed to fix this by producing the whole recommendation slate as a sequence. Yet the obvious implementation creates another problem: instead of operating a fast ranking model, the platform now waits for an autoregressive model to generate multiple tokens for every item. The system becomes more aware of the list, but also considerably slower. A familiar industrial bargain: better theory, worse latency.

Tencent researchers propose a more structured alternative in HiGR, a hierarchical generative slate recommendation framework that separates list-level planning from item-level selection and then aligns the completed slate with several preference objectives.¹

The important point is not simply that HiGR generates recommendations. Plenty of models now do that. Its contribution is that it redesigns three connected layers of the recommendation process:

Representation: item identifiers must carry usable hierarchical meaning.
Generation: the model should plan the slate at a coarse level before decoding specific items.
Alignment: post-training should reward good lists, not merely good individual predictions.

That coordination is what makes the paper more interesting than another contest between recommendation models. HiGR is less a clever decoder than a proposed production architecture.

A Slate Is a Product Surface, Not a Bag of High Scores

Most recommendation pipelines divide the task into two broad stages.

First, a model scores candidate items independently or in pairs. Then, a reranking layer applies rules for diversity, business priorities, freshness, or other constraints. This design is efficient and operationally familiar. It also treats slate quality as something assembled after the important predictions have already been made.

The weakness becomes visible whenever the value of an item depends on its neighbors.

A short-video user may enjoy football commentary, political analysis, cooking clips, and comedy. Recommending the individually highest-scoring football video five times is not necessarily an excellent response to that preference profile. The system has predicted several items correctly while constructing the experience poorly.

Generative slate recommendation changes the unit of production. Instead of asking, “Which item should rank highest?”, it asks, “Which ordered list should the system generate?”

That shift allows the model to capture inter-item dependencies, but it introduces two practical difficulties.

First, recommendation catalogs are too large to represent every item as one ordinary vocabulary token. Generative recommenders therefore often encode an item as a short sequence of semantic ID tokens. A ten-item slate represented by three tokens per item may require thirty sequential decoding steps.

Second, ordinary left-to-right generation does not automatically provide genuine global planning. An autoregressive model can condition on previous outputs, but each decision still arrives after earlier decisions have narrowed the remaining possibilities. It may be globally informed in principle while behaving locally in practice.

HiGR addresses these problems as a linked system rather than treating them as separate engineering inconveniences.

Layer	Failure being addressed	HiGR component	Operational consequence
Representation	Semantic ID prefixes are inconsistent or entangled	Contrastive RQ-VAE	Coarse item meanings become more usable during generation
Generation	Every item token passes through an expensive sequential decoder	Hierarchical Slate Decoder	Expensive planning is separated from lighter item decoding
Alignment	Item-level objectives do not capture list quality	Listwise ORPO post-training	Completed slates are trained against ranking, interest, and diversity preferences

The order matters. A planner cannot reliably plan with meaningless abstractions. An efficient decoder is not especially useful if it efficiently generates the wrong slate. Preference alignment cannot repair every structural weakness after the fact.

Representation: A Planner Cannot Use Semantic IDs Whose Prefixes Lie

Semantic IDs are intended to give recommendation models a compact language for describing items.

Instead of assigning every item one enormous categorical identifier, a model encodes each item as several codewords. Earlier codewords are expected to capture broad meaning, while later codewords distinguish increasingly specific items.

In the ideal case, items with a shared prefix belong to a meaningful neighborhood. A planner can then operate on the prefix as a coarse preference signal before choosing the final item identity.

The trouble is that ordinary residual quantization does not guarantee this behavior. Two items with similar meanings may receive unrelated prefixes. Two items sharing a prefix may represent quite different content. The hierarchy exists syntactically but not reliably semantically.

HiGR introduces a Contrastive Residual Quantized Variational Auto-Encoder, or CRQ-VAE, to make those prefixes more dependable.

The method adds two important constraints.

First, it introduces a global quantization loss intended to prevent residual vanishing. In a residual quantization pipeline, each codebook layer encodes what previous layers failed to capture. If the residual collapses too quickly, later layers receive little meaningful information and become decorative complexity.

Second, CRQ-VAE applies contrastive learning to the earlier codebook layers. Semantically related or frequently co-occurring items are encouraged to share similar prefix representations, while unrelated items are pushed apart.

The final codebook layer is deliberately excluded from the contrastive constraint. That exception is important. Early prefixes should organize items into useful semantic neighborhoods; the last layer must still distinguish the individual houses.

What the Tokenization Evidence Supports

The paper compares CRQ-VAE with RQ-Kmeans and conventional RQ-VAE using three internal tokenization metrics.

Tokenization method	Collision	Concentration	Consistency
RQ-Kmeans	0.2126	0.83	0.4533
RQ-VAE	0.0298	0.75	0.5577
CRQ-VAE	0.0237	0.93	0.6647

According to the paper’s evaluation, CRQ-VAE produces the lowest collision rate, the highest concentration, and the strongest consistency. The consistency score rises from 0.5577 for RQ-VAE to 0.6647 for CRQ-VAE.

This experiment is an ablation-style validation of the representation layer. It supports the claim that contrastive constraints produce a more structured semantic-ID space. It does not, by itself, prove that users prefer the resulting slates. That evidence must come later, after the representation is used by the generator and evaluated in the complete system.

The business interpretation is straightforward: semantic IDs are no longer merely a compression device. They become part of the control surface.

A platform that can trust high-level prefixes can more easily plan topical balance, impose category-level constraints, or prevent several slate positions from collapsing into the same semantic neighborhood. The catalog representation begins to carry operational meaning.

That usefulness remains domain-dependent. A semantic structure derived from content embeddings and interaction signals may organize a media catalog well while failing to capture attributes that matter in commerce, such as price sensitivity, inventory availability, seller obligations, or delivery time. A meaningful representation is meaningful only relative to the decisions the business needs to make.

Generation: Move the Expensive Work to the Slate Level

The central misconception around generative recommendation is that it merely replaces a ranking model with a slower autoregressive model.

HiGR does not eliminate autoregressive generation. It reorganizes where the expensive autoregression occurs.

Its Hierarchical Slate Decoder contains two stages:

A coarse-grained slate planner generates a preference embedding for each slate position.
A fine-grained item generator converts each preference embedding into the semantic-ID sequence of a specific item.

The planner handles the difficult list-level task: deciding the evolving structure and intent of the slate. The item generator handles a narrower conditional task: given the intended preference representation for this position, identify an appropriate item.

The word “planning” should not be mistaken for symbolic reasoning or an agent privately debating the virtues of documentaries versus cooking videos. The planner generates latent preference embeddings autoregressively. Its intelligence lies in choosing the right abstraction level, not in producing a written chain of thought.

Why the Separation Can Reduce Latency

In a conventional semantic-ID generator, the same deep autoregressive model may process every token of every item. If a slate contains $M$ items and each item requires $D$ semantic-ID tokens, the model must work through a long combined sequence.

HiGR assigns the heavier network to item-level preference planning and uses a much shallower shared generator for semantic-ID decoding.

During inference, the fine-grained generator performs beam search for an item conditioned on the planner’s preference embedding. The planner then uses the selected item representation when producing the next slate preference. The process remains sequential at the slate level, but the expensive model no longer performs every fine-grained token decision.

This is a more precise description than saying HiGR simply “plans once and fills everything in parallel.” It does not. The efficiency comes from moving fine-grained decoding into a lighter, parameter-shared module and reducing the amount of deep sequential computation.

Hierarchy pays rent only when the expensive layers are assigned the expensive decisions.

The Decoder Ablations Separate Core Evidence from Configuration Tests

The paper examines several variants of the Hierarchical Slate Decoder.

Removing the user-context embedding from the item generator produces a substantial decline. NDCG@5 falls from 0.0753 to 0.0664, while the impression and effective-view metrics also deteriorate. This is a meaningful architectural ablation: the fine-grained decoder still needs direct access to user context rather than relying entirely on the planner’s preference embedding.

By contrast, changing how semantic-ID embeddings are pooled—using sum, mean, or maximum pooling—produces relatively small differences. This is primarily a robustness and configuration test, not a second major contribution.

The paper also compares shared and non-shared item generators. A separate generator for each slate position provides no consistent performance advantage. That result matters operationally because parameter sharing reduces model complexity and allows variable-length generation without requiring a dedicated decoder for every possible position.

Experiment	Likely purpose	What it supports	What it does not prove
Remove context embedding	Core ablation	Fine-grained item selection still requires user context	That every context architecture will work equally well
Compare sum, mean, and max pooling	Sensitivity test	Performance is not highly dependent on pooling choice	That pooling design is irrelevant in other settings
Shared versus non-shared item generators	Efficiency-oriented ablation	Sharing preserves overall quality while simplifying the model	That all position-specific effects can always be ignored
Vary slate length from 1 to 10	Robustness test	HiGR retains its advantage across tested output lengths	Performance on substantially longer or structurally different slates

Efficiency Is Main Evidence, Not a Decorative Benchmark

The paper supports its efficiency claim with both complexity analysis and a matched empirical comparison against OneRec.

Under identical model settings and hardware, HiGR without key-value caching is reported to achieve more than five times the inference speed of OneRec with beam search while also producing more than a 5% performance improvement. With key-value caching enabled, HiGR remains both faster and more accurate in the reported comparison.

That result is central because it addresses the practical objection to generative slate recommendation. The architecture is not merely more list-aware; it attempts to make list-aware generation compatible with industrial serving constraints.

The appendix’s decoding-length experiment serves a different purpose. It shows that HiGR continues to outperform OneRec as slate length changes across the tested range. This supports robustness to different output lengths. It should not be interpreted as a separate proof of deployment efficiency.

Similarly, the paper’s scaling experiment—from 0.05 billion to 2 billion parameters—shows improving NDCG@5 and declining convergence loss along roughly linear trajectories on logarithmic axes. This is useful evidence that the architecture can benefit from greater capacity. It is not evidence that the largest model offers the best economic return. Scaling curves describe performance potential; finance departments remain stubbornly interested in cost.

Alignment: Users Evaluate the List, So Train on List-Level Failures

Even a well-represented and efficiently generated slate can optimize the wrong objective.

Recommendation models often learn from item-level outcomes: clicks, views, purchases, or next-item predictions. Yet users experience an ordered collection. They respond not only to whether an item is relevant, but also to whether the list is repetitive, well ordered, and genuinely connected to their interests.

HiGR therefore adds listwise preference optimization after pretraining.

The method constructs positive and negative slate pairs from implicit user feedback.

A positive slate is formed from a user’s engaged watch sequence, ordered according to observed feedback. Negative slates are constructed to represent different failure modes:

a reordered slate that damages ranking fidelity;
a slate containing negatively received items, intended to distinguish exposure from genuine interest;
a diversity-oriented negative construction intended to teach the model about undesirable slate composition.

These pairs are used with Odds Ratio Preference Optimization, or ORPO.

ORPO is reference-model-free: it does not require a separate frozen reference model during preference optimization. That can reduce forward-pass computation and memory requirements compared with methods that must evaluate both a policy and reference model.

Reference-free does not mean feedback-free. The system still depends on carefully constructed preference pairs from logged user behavior. One less model is in the room; the data problem remains seated at the table.

ORPO Outperforms the Compared Post-Training Alternatives

The paper compares HiGR without preference optimization against versions post-trained with DPO, SimPO, and ORPO.

Post-training method	Impression Hit@5	Effective-view Hit@5	NDCG@5
Without preference optimization	0.2994	0.2012	0.0753
DPO	0.3055	0.2069	0.0780
SimPO	0.3090	0.2085	0.0766
ORPO	0.3163	0.2145	0.0831

All three preference-optimization methods improve the unaligned model across the reported industrial offline metrics. ORPO provides the strongest overall result in this comparison.

This table is best interpreted as a comparison with alternative post-training methods. It supports ORPO as the strongest tested alignment approach for HiGR’s setting. It does not establish ORPO as universally superior across recommender systems, datasets, or preference-pair construction strategies.

The appendix then tests the three alignment objectives separately:

ranking fidelity;
genuine interest;
slate diversity.

Each individual objective improves the unaligned model. Ranking fidelity produces the strongest single-objective NDCG@5 result at 0.0812, while combining all three objectives reaches 0.0831 and performs best across every reported metric.

That is a useful ablation because it shows the full result is not driven entirely by one objective wearing two decorative companions. The objectives appear complementary within the paper’s offline evaluation.

There is, however, an ambiguity worth preserving rather than politely editing away. The paper describes the diversity-oriented negative slate as anchoring the first item and appending semantically dissimilar items, while also stating that this construction is intended to discourage repetition and promote diversity. Treating a highly dissimilar slate as a negative appears difficult to reconcile with that stated goal. The paper does not provide enough detail to determine whether this is a wording error, a more subtle sampling design, or an unresolved methodological issue.

The objective ablation shows that the component labelled as diversity improves the reported metrics. It does not fully explain what slate behavior the component rewards.

The Main Results Show an Architecture Effect, an Alignment Effect, and a Scale Effect

HiGR’s offline result table contains several variants, which makes it possible to separate different sources of improvement rather than attributing every gain to one grand architectural gesture.

On the industrial dataset, the 25-million-parameter HiGR model without preference optimization already outperforms the similarly sized OneRec baseline across all reported measures. Effective-view Hit@5 rises from 0.1603 to 0.1810, while NDCG@5 rises from 0.0589 to 0.0631.

That comparison supports the representation-and-generation architecture before post-training enters the picture.

Adding preference alignment to the 25-million-parameter model raises Effective-view Hit@5 further to 0.1945 and NDCG@5 to 0.0714.

Scaling HiGR to 100 million parameters produces another increase, reaching Effective-view Hit@5 of 0.2145 and NDCG@5 of 0.0831.

The public KuaiRec results follow the same ordering, although the margins are generally smaller. This matters because it reduces—without eliminating—the possibility that HiGR’s performance is entirely specific to Tencent’s internal data environment.

The correct interpretation is therefore layered:

Result	What changes	What the comparison suggests
OneRec-25M → HiGR-25M without alignment	Representation and generation architecture	Hierarchical design improves offline recommendation quality
HiGR-25M without alignment → HiGR-25M with alignment	Listwise preference post-training	Slate-level preference pairs add further gains
HiGR-25M → HiGR-100M	Model capacity	HiGR benefits from increased scale
Industrial data → KuaiRec	Data environment	The advantage persists on the tested public dataset

This distinction is important for business decisions. A platform unable to support preference post-training may still benefit from the hierarchical architecture. A platform unable to justify a larger model may still obtain gains at the smaller model size. The paper does not present one indivisible package whose value disappears unless every component is deployed at maximum scale.

The Online Test Is the Business Result—But Not Yet the ROI Calculation

Offline recommendation metrics are useful because they allow controlled model comparisons. They remain proxies.

The paper’s strongest practical evidence is a live A/B test on Tencent’s commercial media platform. HiGR received 2% of live traffic and was compared with the incumbent multi-stage recommendation system.

The reported relative improvements were:

Online metric	Relative improvement
Average Stay Time	+1.03%
Average Watch Time	+1.22%
Average Video Views	+1.73%
Average Request Count	+1.57%

For a large content platform, percentage changes of this size can be commercially meaningful. More importantly, the gains appear across both time-based engagement and interaction-frequency measures. The model is not reported as merely encouraging users to open more items while spending less time with them.

What the test directly shows is limited but valuable: under the tested production conditions, allocating traffic to HiGR coincided with improvements in four core engagement metrics relative to the incumbent system.

What Cognaptus infers is broader. HiGR suggests that a recommendation organization can obtain value by aligning the unit of representation, the unit of generation, and the unit of optimization with the product surface users actually experience.

What remains uncertain is the economic conversion.

The paper does not report the duration of the A/B test, confidence intervals, statistical-significance calculations, segment-level effects, or whether gains persisted after novelty effects faded. It also does not provide a serving-cost comparison, latency percentiles, engineering migration costs, or the incremental revenue associated with the engagement lift.

A 1.22% increase in watch time may be a business event. Whether it is a profitable business event depends on what the system costs to build and operate.

What Changes for a Recommendation Organization

HiGR is most relevant to organizations already operating recommendation at substantial scale. Its practical requirements extend well beyond selecting a new model architecture.

1. The Catalog Must Become a Learnable Semantic System

Traditional pipelines can treat item identifiers as arbitrary keys. HiGR requires semantic IDs whose prefix structure supports planning.

That means item representation becomes a maintained product capability. Teams must monitor collisions, consistency, codebook usage, catalog drift, and the behavior of newly introduced items. A broken token hierarchy can quietly weaken every downstream component.

2. Slate-Level Data Becomes a First-Class Asset

Listwise alignment depends on feedback about complete slates and on credible methods for constructing preferred and rejected examples.

Platforms must decide what makes a slate genuinely better. Watch time, purchase value, satisfaction, diversity, creator exposure, inventory obligations, and long-term retention may point in different directions. Preference-pair construction is therefore not merely data preparation. It encodes policy.

3. Serving Architecture Must Be Evaluated End to End

The paper’s efficiency gains make hierarchical generation more plausible, but a deployment decision still requires measurement within the existing stack.

A useful evaluation should include:

tail latency rather than average throughput alone;
GPU and memory costs under realistic traffic;
candidate-catalog update frequency;
interaction with safety and policy filters;
fallback behavior when generation fails;
monitoring for semantic-ID or preference drift.

The relevant comparison is not “HiGR versus another model in isolation.” It is “the new generative slate system versus the current production pipeline after both have been fully engineered.”

4. Business Objectives Must Be Made Explicit

HiGR’s alignment layer is attractive because it can optimize several slate-level objectives together. It also makes disagreements harder to hide.

A reranking rule can quietly enforce diversity after the ranking model has completed its work. A listwise preference model must learn what diversity means, how much of it is desirable, and what trade-offs are acceptable.

That is technically cleaner and organizationally less comfortable. Progress occasionally has poor manners.

Where the Evidence Stops

HiGR presents an unusually complete industrial story: representation design, decoding architecture, offline comparisons, ablations, efficiency experiments, scale testing, and a live deployment result.

Several boundaries still matter.

First, the industrial training data is drawn from one month of activity on a Tencent media platform. The authors use one billion samples for pretraining, select 3% of users for post-training, and reserve the final three days—about 100 million samples—for testing. The training slates are filtered for high-quality engagement, requiring thresholds for both individual-item viewing duration and cumulative slate duration.

That filtering may improve training quality, but it also narrows the observed behavior. The model is trained primarily on slates already associated with meaningful engagement. Its performance on sparse users, weak sessions, unusual intents, or low-engagement catalog regions is less clear.

Second, the deployment is not lightweight. The appendix describes clusters using tens of NVIDIA H20 and L20 GPUs for training and serving. The paper demonstrates industrial feasibility for a large platform; it does not demonstrate affordability for a smaller one.

Third, the online experiment reports engagement lift without the details needed for a complete causal or financial assessment. Long-term satisfaction, creator or seller distribution, content quality, fairness, and revenue effects are not reported.

Fourth, the public-dataset result provides useful external evidence, but both the public and industrial settings remain close to media-consumption recommendation. Commerce, advertising, employment, lending, or regulated recommendations introduce objectives and constraints that cannot be reduced to watch behavior.

Finally, the source code is described as pending internal approval. Independent reproduction will therefore depend on whether it is released and whether the missing implementation details are sufficient to reconstruct the system faithfully.

Planning Before Picking Is Really About Choosing the Right Unit of Intelligence

HiGR’s most useful lesson is not that every recommender should become generative.

It is that systems often fail because they optimize at a smaller unit than the one customers experience.

Users experience a slate, while conventional models optimize items. Businesses evaluate engagement across a session, while training losses often reward isolated predictions. Generative methods promise to close that gap, but ordinary token-by-token generation introduces enough latency and structural confusion to make the promise difficult to deploy.

HiGR responds by aligning three levels of the system.

Its semantic IDs give the planner usable abstractions. Its hierarchical decoder separates global slate intent from specific item selection. Its preference post-training evaluates lists against multiple failure modes. The resulting architecture improves offline metrics, exceeds the compared generative baseline in matched efficiency tests, and produces positive engagement changes in a live Tencent deployment.

The model does not literally think before it picks. It does something more operationally credible: it assigns different decisions to different levels of representation and computation.

For industrial AI, that is often what planning actually looks like.

Cognaptus: Automate the Present, Incubate the Future.

Yunsheng Pang et al., “HiGR: Efficient Generative Slate Recommendation via Hierarchical Planning and Multi-Objective Preference Alignment,” arXiv:2512.24787, https://arxiv.org/html/2512.24787 ↩︎

A Slate Is a Product Surface, Not a Bag of High Scores#

Representation: A Planner Cannot Use Semantic IDs Whose Prefixes Lie#

What the Tokenization Evidence Supports#

Generation: Move the Expensive Work to the Slate Level#

Why the Separation Can Reduce Latency#

The Decoder Ablations Separate Core Evidence from Configuration Tests#

Efficiency Is Main Evidence, Not a Decorative Benchmark#

Alignment: Users Evaluate the List, So Train on List-Level Failures#

ORPO Outperforms the Compared Post-Training Alternatives#

The Main Results Show an Architecture Effect, an Alignment Effect, and a Scale Effect#

The Online Test Is the Business Result—But Not Yet the ROI Calculation#

What Changes for a Recommendation Organization#

1. The Catalog Must Become a Learnable Semantic System#

2. Slate-Level Data Becomes a First-Class Asset#

3. Serving Architecture Must Be Evaluated End to End#

4. Business Objectives Must Be Made Explicit#

Where the Evidence Stops#

Planning Before Picking Is Really About Choosing the Right Unit of Intelligence#