Outrun the Herd, Not the Lion: A Smarter AI Strategy for Business Games

TL;DR for operators

Search-contempt is not “AI plays worse so it learns more”. That would be the lazy interpretation, and business strategy already has enough lazy interpretations wearing expensive shoes.

The paper introduces a hybrid MCTS method for AlphaZero-like self-play systems. It behaves like standard PUCT search for the player to move, but at opponent nodes it eventually freezes the opponent’s visit distribution after a threshold, $N_{scl}$, and samples from that frozen distribution rather than constantly updating it toward stronger play.¹ The effect is subtle but important: the system stops assuming the opponent will always improve its response with more search.

That changes the training data. Instead of repeatedly drifting toward clean, high-quality, drawish positions, search-contempt steers games toward positions where the network or search procedure misunderstands danger, tactics, or opponent resources. In chess language, these are “interesting” positions. In business language, they are where forecasts break, rivals hesitate, governance committees stall, and models quietly overfit to polite assumptions.

The paper’s strongest direct evidence is in chess, especially odds chess and regular chess self-play. In queen odds chess, search-contempt beats PUCT at the same node counts: at 800 nodes, the reported win rate is 57.81% versus 35.25%; at 20,000 nodes, 81.73% versus 25.40%. The author also reports an additional roughly 150 Elo gain in Leela odds bots after adding search-contempt. In regular chess self-play, tuning $N_{scl}$ appears to produce useful win-draw-loss distributions while preserving more playing strength than simply increasing temperature; in one 100-game comparison, search-contempt scores a reported 70.43 Elo above tuned PUCT under comparable 1,000-visit conditions.

For operators, the lesson is not that every company needs a chess engine. Please, no. The lesson is that training efficiency may improve when systems actively search for their own blind spots instead of sampling more of the same competent-but-uninformative experience. The practical question becomes: can your AI workflow identify the situations where its own opponent model, customer model, market model, or risk model is most likely to be wrong?

The trick is to stop modelling the opponent as a god

Many AI decision systems inherit a quiet fantasy: the opponent is optimal.

In board games, this is a useful simplification. In markets, procurement, pricing, logistics, hiring, product launches, regulation, and corporate politics, it is often adorable nonsense. Competitors delay. Customers misunderstand. Incumbents protect old margins. Executives wait for consensus. Risk committees ask for another quarter of data, because apparently the future becomes safer after a spreadsheet has aged.

Search-contempt starts from this same tension, but inside chess search. AlphaZero-like systems use Monte Carlo Tree Search guided by neural-network priors and value estimates. Standard PUCT search keeps revisiting and updating candidate moves as more evidence accumulates. This makes sense when the goal is strong play: more search should refine the estimate.

But self-play training has a different requirement. It needs variety and information, not merely elegance. If both sides always play extremely cleanly, chess games drift toward draws and repeated high-quality patterns. That may be strong play, but it is not necessarily rich training data. A model that only studies smooth equilibrium positions is like a business analyst who has only read board-approved strategy decks. It may sound sophisticated, until something inconvenient happens.

The paper’s proposal is to introduce asymmetry into the search. At the root and at nodes belonging to the player to move, search-contempt uses PUCT. At opponent nodes, it uses PUCT only until the subtree reaches $N_{scl}$. Once that threshold is crossed, it freezes the child visit distribution and samples from it thereafter.

That means the opponent’s search no longer keeps improving indefinitely. The acting player can still search more deeply, but the imagined opponent response is held closer to an earlier, more fallible estimate.

This is the mechanism. Everything else in the paper follows from it.

Freezing the opponent changes the distribution, not just the strength

The important part of search-contempt is not that it adds randomness. Temperature already does that.

In AlphaZero-style self-play, temperature controls how moves are selected after search. Higher temperature creates more variety by allowing less-favoured moves to be played. That is useful for exploration, but it weakens the game. The system gets variety by lowering the quality of action selection.

Search-contempt changes something earlier: the internal search trajectory. It asks the system to explore lines where the opponent may fail to respond correctly. That is a different kind of exploration. It is not merely “pick a worse move sometimes”; it is “look harder at positions where the model’s opponent assumptions are fragile”.

The distinction matters because training games are not valuable just because they are diverse. They are valuable when they expose errors that future training can correct. Random weirdness is cheap. Diagnostic weirdness is expensive.

The author argues that low $N_{scl}$ values make self-play games more tactically rich: piece imbalances, trapped pieces, perpetual checks, sacrifices, dynamic stalemates, and short tactical motifs become more common. The paper describes this as search-contempt seeking positions that the network evaluates badly at lower node counts but more accurately at higher node counts. That is why the algorithm resembles a self-adversarial attack on the model’s own policy and value estimates.

So the reader misconception to avoid is this: search-contempt is not simply “playing weaker”. It is controlled weakness in the opponent model, used to generate positions that reveal where the system is brittle.

That is a much more useful idea.

Odds chess is the cleanest demonstration of the mechanism

The odds chess results are the easiest to understand because the premise is asymmetric from the start.

In odds chess, one side gives material odds: a knight, rook, or queen is missing. The stronger player is handicapped; the weaker player has a material advantage. Standard engine search has a problem here. As node counts rise, PUCT-based play initially improves, but eventually begins to resemble strong non-odds chess behaviour. That can be counterproductive, because odds chess is not about playing the theoretically cleanest engine chess. It is about exploiting a weaker opponent while starting from an objectively inferior position.

The paper reports that older Leela odds engines therefore had practical node limits: around 800 nodes for queen odds, 10,000 for rook odds, and 20,000 for knight odds. More search was not automatically better. That should make every enterprise AI team mildly uncomfortable. “More compute” is a strategy only until the objective changes.

In queen odds chess, the reported comparison is stark:

Nodes searched	PUCT win rate	Search-contempt win rate	Interpretation
100	23.43%	36.76%	Search-contempt already performs better at low compute.
800	35.25%	57.81%	At the old practical sweet spot for PUCT, search-contempt is far ahead.
20,000	25.40%	81.73%	PUCT degrades, while search-contempt scales with search.

The likely purpose of this experiment is main evidence for playing-strength improvement under asymmetric conditions. It is not an ablation of every design choice, and it does not prove general reinforcement-learning efficiency across domains. It does, however, show that the mechanism behaves exactly where it should: when the stronger side benefits from modelling the weaker side as imperfect.

That makes odds chess more than a cute demonstration. It is the cleanest version of the business analogy. In many competitive settings, the goal is not to beat an ideal rival. It is to beat the rival that actually exists: capital-constrained, politically slow, distracted, risk-averse, mispriced, or trapped by its own prior commitments.

Search-contempt formalises that distinction inside search.

Regular chess self-play turns the idea into a training-data argument

The second part of the paper is more important for AI training economics.

Regular chess is highly drawish at strong levels. If self-play produces too many draws, the training distribution may be high-quality but narrow. If the system uses high temperature to force variety, it gets more wins and losses, but at the cost of weaker play. The training data becomes broader, but noisier.

The paper uses the win-draw-loss distribution as a practical proxy for whether self-play games are useful for training. In chess, the author focuses on the ratio $(w + l)/d$, where $w$, $l$, and $d$ are wins, losses, and draws from White’s perspective. The ideal is not necessarily maximum strength; it is a distribution with enough decisive games to train from, without turning self-play into nonsense.

Search-contempt gives the trainer another lever. By lowering $N_{scl}$, the system increases decisive outcomes because each player assumes the opponent is weaker at search. The paper reports that when $N_{scl}$ is very high, the method reduces to normal PUCT behaviour, producing strong but draw-heavy games. As $N_{scl}$ falls below 50, wins and losses increase. Around $N_{scl}=5$ with 1,000 visits per move, the paper identifies a training-suitable ratio close to 1.

Temperature can also move the distribution. The difference is cost in playing strength. The paper compares two configurations tuned to produce a similar win-draw-loss ratio of about 1.38:

Test	Likely purpose	Reported result	What it supports	What it does not prove
Vary $N_{scl}$ in search-contempt self-play	Sensitivity test for the new parameter	Lower $N_{scl}$ increases decisive games and changes game character	$N_{scl}$ can tune training distribution	That every lower value improves final trained networks
Vary temperature in PUCT self-play	Baseline comparison	Higher temperature improves decisive-game ratio but weakens play	Temperature creates variety, but crudely	That temperature is unnecessary
Match tuned search-contempt vs tuned PUCT	Main comparative evidence for training-game quality	Search-contempt scores 37 wins, 46 draws, 17 losses; reported 70.43 Elo advantage	Similar distribution can be achieved with stronger games	That final training from scratch has already been demonstrated
Repeated-game analysis	Robustness/sanity check	Lower $N_{scl}$ does not materially worsen repetition; games are nearly all unique by around move 20	Search-contempt is not merely collapsing into repeated tricks	That diversity is sufficient for all learning tasks
Qualitative tactical example	Mechanism illustration	Search-contempt selects a line mis-evaluated by the network at low search but losing under deeper evaluation	The method targets blind spots	Statistical frequency beyond the paper’s reported scan

This is the heart of the contribution. Search-contempt does not remove the need for temperature. The paper is explicit that temperature remains useful to avoid repeated games, especially early in play. The new point is that $N_{scl}$ and temperature can be separated. Temperature can be used for variety; $N_{scl}$ can be used to tune decisiveness and difficulty with less damage to playing strength.

That is a cleaner control surface.

The business value is cheaper diagnosis, not just cheaper training

The obvious business interpretation is “this could reduce compute cost”. That is true, but too shallow.

The deeper interpretation is that training cost depends on the information density of experience. A million training episodes are expensive if they mostly revisit comfortable regions of the state space. A smaller number of episodes can be more valuable if they repeatedly expose the system to situations where its current model fails.

This is familiar outside chess. In business, the most useful simulations are rarely the ones where every actor behaves according to the median forecast. The useful ones expose the weird-but-plausible failure: the distributor who delays payment, the regulator who interprets a rule differently, the incumbent that ignores a low-end market until it is too late, the customer segment that does not respond to the campaign everyone expected to work.

Search-contempt suggests a design principle for AI decision systems:

Do not only simulate optimal responses. Simulate where your response model is confidently wrong about imperfect actors.

For enterprise AI, that points to several possible applications, with strict boundaries:

Business system	Search-contempt-inspired idea	Practical value	Boundary
Competitive strategy simulation	Freeze or degrade competitor response models after limited reasoning depth	Finds opportunities created by rival inertia or misjudgement	Requires evidence-based competitor behavioural models
Pricing and promotion agents	Explore paths where rivals do not instantly match discounts	Tests exploitability of rigid pricing policies	Can become reckless if competitor reaction data is poor
Supply-chain planning	Simulate counterparties with bounded responsiveness rather than perfect coordination	Reveals bottlenecks hidden by idealised optimisation	Needs operational data on delays and failure modes
Sales and GTM planning	Stress-test assumptions about customer understanding and adoption	Identifies segments where messaging is misread or underused	Human behaviour is noisier than board games
AI agent evaluation	Generate cases where the agent’s world model mis-evaluates another actor’s constraints	Produces harder evaluation data	Requires metrics for “interesting” failure, not just pass/fail

The ROI relevance is not “use this exact algorithm in your CRM”. That would be a heroic misunderstanding, even by software procurement standards.

The ROI relevance is methodological: better training and evaluation may come from adversarially selecting situations where the model’s assumptions about other agents are most fragile. That can reduce wasted simulation, improve robustness, and reveal strategic options that perfect-opponent modelling suppresses.

The paper is strongest as a mechanism, weaker as a universal claim

The paper’s strongest claims are chess-specific and mechanism-grounded.

It directly shows that search-contempt can improve odds chess performance under the reported conditions. It directly compares tuned search-contempt and tuned PUCT in regular chess self-play under similar compute and similar win-draw-loss ratio. It directly illustrates that the method changes the character of generated games. It also provides a plausible training schedule suggesting that search-contempt could allow strong self-play training with far fewer games than AlphaZero-scale runs.

But there is a line between demonstrated and plausible.

The paper does not present a full from-scratch training run that proves consumer-GPU parity with large-scale AlphaZero-like training. The proposed schedule is a candidate schedule, not a completed industrial benchmark. The 100-game comparisons are informative but not enormous. The qualitative claim that “interesting” positions are 20–30 times more frequent is based on a quick scan, not a formal taxonomy of tactical motifs. The chess metric $(w+l)/d$ is useful because chess has draws; it would not transfer cleanly to Go, markets, or enterprise workflows without a domain-specific replacement.

This does not make the paper weak. It makes it precise.

The right reading is: search-contempt provides credible evidence that modifying search to target opponent-model fallibility can produce stronger asymmetric play and more useful self-play distributions in chess. The broader claim — cheaper, more robust RL training across domains — is a promising extrapolation, not a settled result.

That boundary matters because business readers love importing metaphors and forgetting the import duty.

Outrunning the herd means modelling actual weakness

The title metaphor still works, but it needs sharpening.

You do not need to outrun the lion. You need to outrun the herd. But the useful AI lesson is not “assume everyone else is stupid”. That is how startups write pitch decks in Q1 and apology emails in Q4.

The useful lesson is to model bounded competence. Search-contempt does not remove the opponent. It does not ignore responses. It simply stops granting the opponent unlimited search improvement at every node. That small mechanical change produces a different world of games: more asymmetric, more tactical, more diagnostic, and in the reported settings, more useful.

For business AI, this is the difference between optimisation theatre and strategic realism. Perfect-opponent models often make every bold move look dangerous. Random exploration makes every strange move look equally worth testing. The more valuable middle ground is targeted exploration of plausible misjudgement.

That is where training data becomes richer. That is where simulations become more useful. And that is where AI systems may learn faster without merely asking for a bigger compute budget, which remains the industry’s favourite substitute for thinking.

Search-contempt is a chess paper. Its lesson is wider: the most valuable search is not always the search for the best move against perfection. Sometimes it is the search for the move that exposes where the system, the opponent, or the market is wrong.

In business games, that is often where the money is hiding.

Cognaptus: Automate the Present, Incubate the Future.

Ameya Joshi, “Search-contempt: a hybrid MCTS algorithm for training AlphaZero-like engines with better computational efficiency,” arXiv:2504.07757, 2025, https://arxiv.org/abs/2504.07757. ↩︎

TL;DR for operators#

The trick is to stop modelling the opponent as a god#

Freezing the opponent changes the distribution, not just the strength#

Odds chess is the cleanest demonstration of the mechanism#

Regular chess self-play turns the idea into a training-data argument#

The business value is cheaper diagnosis, not just cheaper training#

The paper is strongest as a mechanism, weaker as a universal claim#

Outrunning the herd means modelling actual weakness#