Choosing Topics Without Counting: When LDA Meets Black-Box Intelligence

Topic modeling has a small, annoying question hiding inside a very large workflow:

How many topics should the model use?

Not what the topics mean. Not whether the dashboard looks elegant. Not whether management will discover a “strategic insight” after renaming a cluster from miscellaneous complaints to emerging customer sentiment. Just the integer: 10 topics, 30 topics, 80 topics, 200 topics?

In Latent Dirichlet Allocation, that number is not decoration. It changes the model’s fit, the granularity of discovered themes, the stability of outputs, and the amount of analyst time spent explaining why “billing issue” and “payment problem” became two separate topics. The usual answer is painfully familiar: try many values, train many LDA models, compare validation perplexity or coherence, then pick the one that looks least wrong.

A recent paper, Topic Modelling Black Box Optimization, treats this not as a topic-modeling ritual, but as a black-box optimization problem: each candidate topic count $T$ is a costly query, because evaluating it requires training an LDA model and measuring validation perplexity.¹ That reframing is the paper’s useful move. The interesting question is not “which optimizer won?” The interesting question is why some optimizers can stop wasting queries earlier.

The real problem is not LDA; it is the price of asking one more question

LDA represents documents as mixtures of topics, and topics as probability distributions over words. The familiar matrix intuition is:

$$ [\text{Documents} \times \text{Words}] \approx [\text{Documents} \times \text{Topics}] \times [\text{Topics} \times \text{Words}] $$

The number of topics $T$ controls the width of the hidden middle layer. Too few topics, and the model compresses different themes into vague buckets. Too many, and it may split meaningful themes into fragile fragments. In this paper, the authors fix the LDA priors as $\alpha = \beta = 1/T$, then optimize $T$ using validation perplexity as the objective.

That matters because perplexity is not a free number. To know whether $T=40$ is better than $T=80$, the system must train and validate an LDA model. If the candidate space is large, exhaustive search becomes unattractive. If the corpus is large, even a modest grid becomes expensive. If the organization repeats the workflow across product reviews, support tickets, filings, news archives, internal memos, and multilingual datasets, the “just try a few values” habit becomes a recurring compute tax.

The paper’s formulation is therefore:

$$ \text{Perplexity}(D) \rightarrow \min_{T \in \mathcal{T}_{sanity}} $$

where $\mathcal{T}_{sanity}$ is the finite set of admissible topic counts. The function has no convenient analytic form, no gradient, and no cheap preview. You query it by training the model.

That is the mechanism-first point: the paper is not mainly about making LDA fashionable again. LDA is the stage. The actual plot is about reducing the number of expensive configuration trials needed before a usable model setting is found.

Four optimizers, four different ways of spending the query budget

The authors compare four methods under a fixed evaluation budget of 20 optimization steps per run:

Method	How it searches	Operational personality
Genetic Algorithm (GA)	Maintains a population of topic counts, uses tournament selection, binary crossover, mutation, and elitism	Methodical, but slow; it earns its result by spending most of the budget
Evolution Strategy (ES)	Mutates parent candidates and keeps the best parents/offspring under a $(\mu+\lambda)$ scheme	Simple local improvement; not especially well suited here
PABBO	Uses a pre-trained Transformer policy trained on synthetic functions, converts histories into preference-style feedback, and selects candidates through acquisition scores	Sometimes jumps quickly to a good region; faster but more variable
SABBO	Uses sharpness-aware black-box optimization with a Gaussian search distribution and local robustness logic	More expensive early, but the first move is unusually informative

This comparison is more useful than a conventional “optimizer leaderboard” because the methods do not merely differ in final score. They differ in how they consume uncertainty.

GA and ES behave like classical search workers: generate candidates, evaluate them, keep the better ones, repeat. PABBO behaves more like a learned selector: it has seen optimization-like tasks before, so it can make stronger early guesses. SABBO adds a different kind of discipline: it does not simply chase the lowest observed value; it updates a search distribution while considering local sharpness, trying to avoid brittle solutions that look good only at a narrow point.

In practical terms, GA and ES ask many small questions. PABBO asks more informed questions. SABBO asks a more expensive first question, but one that seems to land close to the useful region.

That difference is the economic center of the paper.

The experiments test search efficiency, not topic interpretability

The paper uses four corpora: 20 Newsgroups, AG News, Yelp Reviews, and a custom mixed validation corpus called Val_out, formed from the three source datasets. Text is preprocessed with lowercase normalization, stop-word removal, token filtering, and CountVectorizer, with the vocabulary capped at 10,000 terms. LDA is trained with online variational inference. Each dataset is tested across 10 independent trials.

The evidence has three layers:

Evidence component	Likely purpose	What it supports	What it does not prove
Figure 2: best validation perplexity by number of evaluations	Main evidence on sample efficiency	SABBO and PABBO reach good perplexity with fewer LDA evaluations than GA and ES	That the resulting topics are more human-interpretable
Figure 3: best validation perplexity by wall-clock time	Main evidence on runtime efficiency	GA is slower overall; PABBO benefits from inexpensive iterations; SABBO’s first evaluation is costly but informative	That one method is always cheaper under every implementation or hardware setup
Table 1: final best perplexity after 20 evaluations	Final performance comparison	SABBO has the best final perplexity on all four datasets; PABBO is usually second; ES is weakest	That the advantage generalizes beyond LDA/perplexity/topic-count selection
Val_out mixed corpus	Robustness-style check across heterogeneous text sources	The ranking is not limited to one clean benchmark corpus	That production domain shift is fully solved
Appendix implementation details	Reproducibility and configuration context	Shows scikit-learn LDA, DEAP for GA/ES, PyTorch for PABBO, and SABBO implementation choices	A separate methodological thesis

This distinction matters because the tempting misread is obvious: “SABBO produces better topics.” That is not what the paper directly shows.

The paper measures validation perplexity. Lower perplexity means the model assigns higher likelihood to held-out text under the model’s assumptions. That is useful, but it is not the same as topic coherence, analyst trust, or business usability. The paper itself discusses prior work showing that perplexity and interpretability can diverge. So the responsible conclusion is narrower and more useful: SABBO and PABBO are better at finding low-perplexity topic counts under the paper’s experimental setup.

That is already valuable. It just is not magic dust.

The main result is about early movement, not just final ranking

The final table is clean enough:

Dataset	GA	ES	PABBO	SABBO
20NEWS	1776.35 ± 185.94	2056.91 ± 173.29	1809.55 ± 103.79	1679.96 ± 25.09
AGNEWS	2154.98 ± 23.82	3800.07 ± 359.69	2184.55 ± 40.17	2150.56 ± 22.05
VAL_OUT	1653.49 ± 196.91	2448.87 ± 201.28	1565.99 ± 27.31	1557.91 ± 30.50
YELP	1378.81 ± 104.88	1822.72 ± 54.55	1356.98 ± 32.48	1351.21 ± 24.32

After 20 evaluations, SABBO achieves the best final validation perplexity on all four datasets. PABBO is close behind overall, especially on Val_out and Yelp. GA is respectable but slower and less stable. ES performs worst.

If we stopped there, the article would be dull but accurate: “SABBO wins; PABBO second; GA okay; ES sad.” Thank you, leaderboard, very enlightening.

The more important result appears in the convergence behavior. The paper reports that GA improves steadily but needs nearly the full 20-evaluation budget to reach the final perplexity band. ES improves but remains clearly above the others. PABBO often makes large jumps: sometimes it proposes a strong topic count early, sometimes it needs a few more evaluations, and this creates wider confidence bands. SABBO is the sharpest contrast: across corpora, its curve drops close to its final level after essentially one evaluation, with later iterations mostly refining an already competitive solution.

That changes the interpretation of the table. The table tells us who ended with the best validation perplexity. The convergence curves tell us who found the useful neighborhood before the budget was mostly gone.

For business use, the second question is usually more important. A production analytics team rarely cares whether an optimizer is slightly prettier after 20 controlled trials. It cares whether the model-selection process can stop after a few expensive runs without embarrassing itself.

Why SABBO’s first move matters

SABBO’s behavior is the paper’s most interesting mechanism.

A naive optimizer treats each candidate value as a point to test. Try $T=20$. Try $T=35$. Try $T=70$. Keep the best. This is not irrational, but it is sample-hungry. It is especially clumsy when each test is expensive.

SABBO instead works through a search distribution. It samples candidate points, estimates how the objective behaves around the current region, and updates the distribution while considering local sharpness. The aim is not merely to find a point that looks good, but to move toward a region that remains good under nearby perturbations.

For LDA topic-count selection, this is plausible. Topic-count performance is usually not a perfectly smooth curve, but it also is not pure chaos. There may be bands where the model fit is stable and bands where performance is sensitive. A sharpness-aware optimizer can benefit if low-perplexity regions are not isolated needles but locally meaningful neighborhoods.

This is why the paper’s “one evaluation” result should not be read as clairvoyance. SABBO is not mystically guessing the exact number of topics. It is using a search procedure that makes its early query more informative. The first point lands near the final performance band; subsequent evaluations mostly confirm or refine.

In a business setting, that distinction is crucial. The value is not that SABBO knows the “true” number of topics. The value is that it reduces the cost of reaching a good enough configuration under a constrained evaluation budget.

And yes, “good enough” is doing real work here. In business analytics, the best model after unlimited tuning is often less valuable than the useful model delivered before the meeting becomes a postmortem.

PABBO is the reminder that learned search can be useful even when it is uneven

PABBO is less clean than SABBO but still important.

The method uses a pre-trained Transformer policy, trained on synthetic functions such as GP1D and Rastrigin, then transferred to the LDA topic-count problem. It uses optimization history and candidate sets to assign acquisition scores, with an exploration component that sometimes samples randomly and sometimes exploits the policy’s candidate distribution.

The result is jumpy. In some runs, PABBO quickly proposes a topic count close to the final optimum. In others, early guesses are less successful. The paper explicitly notes wide confidence bands around the mean curve. That variability is not a footnote; it is part of the operational profile.

Still, PABBO consistently beats GA and ES on average and ends in the same broad final band as SABBO. Its time-based performance is also attractive because its iterations are comparatively inexpensive. The paper notes that within the time required for a single ES or SABBO evaluation, PABBO can often perform multiple evaluations and enter a good region quickly.

This makes PABBO a useful kind of imperfect tool: not always graceful, but often efficient. For organizations tuning models across many related text corpora, a learned optimizer that can make decent early guesses may be valuable even if it occasionally needs extra checks.

The lesson is not “replace all search with a Transformer.” Please do not make that slide. The better lesson is that learned acquisition policies can compress repeated tuning experience into a reusable search habit.

The business value is cheaper configuration, not automatic understanding

The direct business relevance is easiest to see in recurring text analytics workflows.

A company may use topic models to organize:

Business corpus	Typical use	Why topic-count tuning matters
Customer reviews	Discover recurring complaints, product strengths, and market segments	Too few topics hide product-specific issues; too many create noisy micro-themes
Support tickets	Route problems, identify operational bottlenecks, summarize incident patterns	Topic granularity affects triage and escalation
Internal documents	Map knowledge areas across teams or projects	Poor topic counts create misleading “knowledge maps”
News and filings	Track industry themes, risks, and competitor narratives	Over-fragmentation makes monitoring harder
Survey responses	Summarize open-ended feedback	Topic stability affects whether managers trust the output

In all of these settings, topic modeling is often a middle step, not the final product. The user wants a dashboard, taxonomy, report, alerting system, or search layer. Tuning $T$ is necessary plumbing. Expensive plumbing, but still plumbing.

The paper suggests a practical workflow shift:

Treat topic-count selection as a black-box optimization task.
Stop using exhaustive or naive grid search as the default.
Use efficient optimizers to minimize costly LDA evaluation runs.
Add human-facing validation afterward, especially coherence and interpretability checks.
Reuse optimization experience across related corpora where possible.

That last point is where the paper’s future discussion becomes business-relevant. The authors suggest that once multiple corpora and empirically determined optimal topic counts are collected, estimating $T^\ast$ could become a supervised learning problem. They also discuss reinforcement learning, where corpus features become the state, the topic count becomes the action, and model quality becomes the reward.

For enterprise MLOps, this points toward a model-selection service rather than a one-off tuning script. The system would not just train LDA; it would learn how to configure LDA faster over time.

That is a modest claim, but it is a useful one.

What the paper directly shows, and what Cognaptus would infer

A clean boundary is necessary here, because this topic is unusually easy to oversell.

Layer	Statement
What the paper directly shows	Under a 20-query budget, on four text corpora, using validation perplexity for LDA topic-count selection, SABBO reaches the best final perplexity and usually gets near its final result after roughly one evaluation. PABBO is usually second and more efficient than GA/ES. ES is weakest.
What Cognaptus infers for business use	If an organization repeatedly tunes LDA topic counts across corpora, efficient black-box optimization can reduce compute waste and analyst waiting time. The most valuable benefit is early convergence, not merely a slightly better final score.
What remains uncertain	Whether the same ranking holds for other topic-model objectives, coherence-driven selection, embedding-based clustering, BERTopic-style workflows, multilingual corpora, domain-specific production data, or human-reviewed taxonomy quality.

This is not a “topic models are solved” paper. It is closer to a systems paper about configuration economics. The optimizer is not replacing interpretation. It is making the path to a candidate configuration less wasteful.

The human still has to ask whether the topics make sense.

Terrible, I know. Reality remains employed.

The boundary: perplexity is not the same as usefulness

The paper is refreshingly explicit about the broader evaluation issue. Perplexity is a standard measure, but topic modeling quality has several dimensions: statistical fit, coherence, stability, redundancy, and human interpretability. Prior research discussed in the paper shows that no single metric is universally optimal.

That creates a practical boundary. A low-perplexity topic model can still produce topics that analysts find vague, redundant, or operationally useless. Conversely, a slightly worse perplexity score may produce topics that are easier to name, monitor, and explain.

So the right deployment pattern is not:

Optimize perplexity, publish topics, declare insight.

The better pattern is:

Use black-box optimization to cheaply identify promising topic-count candidates, then evaluate them with coherence, stability, redundancy, and human review.

In other words, SABBO or PABBO can reduce the cost of reaching candidate models. They do not eliminate the need to judge those models.

This is especially important for enterprise search and RAG-adjacent workflows. Many teams now use topic modeling or clustering as a preprocessing step for document organization, content routing, or knowledge-base structuring. The business failure mode is not only poor statistical fit. It is building a taxonomy that users do not understand, cannot navigate, and quietly stop trusting.

Perplexity can help. It cannot carry the whole governance burden.

The quiet strategic lesson: optimization experience becomes an asset

The paper’s future direction is worth taking seriously. If an organization accumulates many corpora and their empirically selected topic counts, it can begin to treat topic-count prediction as a reusable learning problem. That changes the economics.

The first project pays the old price: train, test, tune. The second project can reuse some optimizer structure. The tenth project may benefit from a learned selection policy. The hundredth project may have enough internal history to make configuration itself a semi-automated capability.

This is the deeper business pathway:

Technical step	Operational consequence	ROI relevance
Reframe $T$ selection as black-box optimization	Topic tuning becomes a budgeted search process	Fewer wasted LDA training runs
Use sample-efficient optimizers	Good candidate topic counts appear earlier	Faster analyst cycles
Track tuning histories across corpora	Configuration experience becomes data	Reusable internal model-selection knowledge
Add coherence and human validation	Statistical fit is checked against usability	Lower risk of beautiful but useless topics
Build a model-selection layer	Topic modeling becomes repeatable infrastructure	Better scaling across teams and datasets

This is where the paper is more useful than its narrow experimental setting might suggest. Even if a company does not deploy SABBO tomorrow, it should notice the design pattern: configuration choices are not clerical. They are optimization problems with memory.

For Cognaptus-style automation work, that distinction matters. A fragile automation pipeline hard-codes the analyst’s current habit. A better one observes the tuning process, learns where search is wasteful, and gradually turns repeated judgment into reusable infrastructure.

Conclusion: stop treating topic count as a guessing ritual

The paper’s contribution is not that LDA suddenly became new. It did not. LDA remains LDA: useful, limited, interpretable in the old probabilistic way, and still capable of producing topics that require a human to squint politely.

The contribution is sharper: choosing the number of topics can be treated as a discrete black-box optimization problem, and the optimizer’s efficiency matters because every query means another model training-and-validation run. Under the paper’s setup, SABBO is strongest, PABBO is a credible second, GA is serviceable but slow, and ES is mostly there to remind us that “simple baseline” is not a synonym for “good enough.”

The practical lesson is not to worship SABBO. The practical lesson is to stop pretending that repeated manual tuning is free.

When organizations build text analytics systems, they often focus on the visible layer: dashboards, summaries, labels, and search interfaces. But the hidden configuration layer quietly consumes compute, time, and analyst patience. If that layer becomes more intelligent, the whole workflow becomes less brittle.

Choosing topics without counting every possibility is not glamorous. It is just the sort of small automation improvement that compounds.

Which, annoyingly, is how useful systems are usually built.

Cognaptus: Automate the Present, Incubate the Future.

Roman Akramov, Artem Khamatullin, Svetlana Glazyrina, Maksim Kryzhanovskiy, and Roman Ischenko, “Topic Modelling Black Box Optimization,” arXiv:2512.16445, 2025. ↩︎

The real problem is not LDA; it is the price of asking one more question#

Four optimizers, four different ways of spending the query budget#

The experiments test search efficiency, not topic interpretability#

The main result is about early movement, not just final ranking#

Why SABBO’s first move matters#

PABBO is the reminder that learned search can be useful even when it is uneven#

The business value is cheaper configuration, not automatic understanding#

What the paper directly shows, and what Cognaptus would infer#

The boundary: perplexity is not the same as usefulness#

The quiet strategic lesson: optimization experience becomes an asset#

Conclusion: stop treating topic count as a guessing ritual#