Think Wide, Then Think Hard: Forcing LLMs to Be Creative (On Purpose)

Imagine a brainstorming meeting in which every new idea must immediately pass legal review, fit the quarterly budget, use the existing technology stack, satisfy six executives, and arrive formatted as a PowerPoint slide.

The meeting will probably produce something feasible.

It will also produce the same three ideas everyone proposed last quarter.

Large language models often work like that meeting. Ask one to create something “creative” while simultaneously obeying a long list of requirements, and it begins satisfying the requirements before meaningful exploration has occurred. The model generates a safe candidate near the centre of its learned distribution, decorates it with a few unusual nouns, and presents the result as imagination.

A paper by Manh Hung Nguyen and Adish Singla examines this problem through creative programming-task generation and proposes a simple intervention called CreativeDC: make the model explore first, then make it satisfy constraints.¹

The method matters less because it produces superhero exercises involving villain socks—although it does—and more because it identifies a practical source of generative sameness: constraints are being applied at the wrong stage of the reasoning process.

Premature constraint satisfaction narrows the idea space before generation begins

A conventional prompt might ask an LLM to create a programming exercise that:

follows a specified theme;
requires only a particular programming concept;
includes a complete problem description;
provides at least five valid tests;
contains a working reference solution;
avoids several prohibited techniques;
and follows an exact output schema.

Every requirement is reasonable. Together, however, they encourage the model to search for an idea that is obviously compatible with all requirements from the beginning.

That search strategy favours familiar patterns. A problem about lists becomes inventory management. A problem about loops becomes repeated counting. A superhero-themed exercise becomes a roster of heroes and powers. Different outputs may use different names and sentences while remaining conceptually interchangeable.

The paper describes this broader tendency as part of the “Artificial Hivemind” effect: repeated outputs within a model and homogeneous outputs across models. In operational terms, an organization may request 1,000 generated items and receive something closer to 150 ideas wearing 1,000 different shirts.

Several popular creativity interventions do not directly solve this sequencing problem.

Increasing temperature changes sampling behaviour, but randomness can produce unusual wording without moving far from the underlying concept. Adding “think step by step” gives the model more reasoning space, yet does not tell it which kind of reasoning should happen first. Persona simulation may introduce alternative perspectives, but the simulated accountant, astronaut, or biologist can still begin by optimizing against the final constraints.

CreativeDC changes the order of operations.

CreativeDC turns constraints from an idea generator into an idea filter

CreativeDC divides generation into two explicit phases.

During the divergent phase, the model focuses only on the given theme. It is instructed to list wildly different, underexplored, surprising, and unconventional objects, scenarios, or situations. Crucially, it must temporarily ignore the required programming concept.

During the convergent phase, the model selects one of those ideas and connects it to the required programming concept. It then checks whether the resulting problem satisfies the full specification. When an idea cannot be made compliant, the model is instructed to return to the candidate pool and choose another.

The resulting process is straightforward:

Theme
  ↓
Explore many unusual thematic directions
  ↓
Select a promising direction
  ↓
Apply technical and formatting constraints
  ↓
Reject or refine
  ↓
Produce the final problem

The separation changes the role of constraints.

Under ordinary prompting, constraints guide the initial search and therefore determine which ideas are considered. Under CreativeDC, constraints evaluate ideas that have already been surfaced. The model receives permission to visit less convenient regions of the idea space before the feasibility committee arrives with its clipboard.

The paper’s superhero-and-lists example illustrates the mechanism. In its divergent phase, the model proposes ideas including capes trapped in time loops, sidekick support groups, emotional-support vending machines, and a retired superhero cataloguing socks taken from defeated villains. It then selects the sock collection and converts it into a list-based inventory exercise.

The final task remains constrained and solvable. Its creative seed, however, was generated before “must use lists” could pull the model toward a standard roster-management problem.

This is the paper’s most transferable contribution. Creativity becomes an inference-time workflow rather than an adjective inserted into a prompt.

The experiment tests creativity under unusually strict output requirements

The authors evaluate CreativeDC on programming-problem generation, a useful test environment because creativity cannot quietly replace correctness.

Each generated problem must contain:

a natural-language description;
a test suite;
a reference solution.

The solution must use only the requested programming concept and pass the accompanying tests. The problem must also remain relevant to its assigned theme and understandable to a learner.

The evaluation combines four themes—Cooking, Science Fiction, Superheroes, and Board Games—with five programming concepts—Variables, Selection Statements, Loops, Lists, and Strings. This produces 20 distinct generation contexts.

Using Qwen3-235B-A22B-Instruct-2507, the study compares three prompting methods:

Method	What changes
Base	Directly requests a constrained programming problem
Chain of thought	Adds “Think step by step” to the base prompt
CreativeDC	Explicitly separates divergent exploration from convergent refinement

The authors repeat the comparison both without personas and with persona simulation applied to every method. Applying personas uniformly is important: it tests whether CreativeDC still contributes after each method receives the same perspective-diversification aid.

The evaluation then measures three dimensions of creativity.

Diversity asks whether outputs differ from one another

Lexical diversity measures variation in the language used across generated problems.

Semantic diversity measures the average distance between problem embeddings, capturing whether the problems differ in meaning rather than merely wording.

A set of tasks about “counting hero powers,” “tallying superhero abilities,” and “calculating the number of powers” may appear lexically varied while remaining semantically repetitive. Semantic diversity is intended to detect that distinction.

Novelty asks whether outputs differ from competing methods

Novelty is measured against problems produced by the other generation methods under the same context.

This is a demanding reference set. A cooking-and-loops problem generated by CreativeDC is compared with other cooking-and-loops problems, rather than with a broad web corpus containing mostly unrelated material. High novelty therefore indicates distance from close alternatives, not merely distance from arbitrary text.

Lexical novelty measures unfamiliar phrasing. Semantic novelty measures the distance between each generated problem and its nearest semantic neighbour in the reference pool.

Utility asks whether creativity survived contact with the task

A problem receives utility credit only when it satisfies all three criteria:

validity;
context relevance;
comprehensibility.

These criteria are binary, and utility is their product. Failure on any one criterion makes the problem unsuccessful.

This matters because a model can achieve impressive novelty by generating nonsense. The experiment is designed to prevent that particularly easy victory.

CreativeDC moves farther semantically without a statistically significant utility loss

At the 100-problem evaluation setting, CreativeDC outperforms the Base and chain-of-thought prompts across every reported lexical and semantic diversity and novelty metric.

Without persona simulation, the results are:

Method	Lexical diversity	Semantic diversity	Lexical novelty	Semantic novelty	Utility
Base	0.74	0.46	0.62	0.20	92.95%
Chain of thought	0.75	0.46	0.66	0.18	91.35%
CreativeDC	0.81	0.54	0.73	0.30	90.85%

CreativeDC’s semantic diversity is 16.7% higher than chain-of-thought prompting, while its semantic novelty is 63.5% higher. The improvements in diversity and novelty are statistically significant across the evaluated contexts.

The utility scores move in the opposite direction numerically: Base achieves 92.95%, chain of thought 91.35%, and CreativeDC 90.85%. The paper reports that these differences are not statistically significant.

That result deserves careful interpretation.

The evidence does not establish that CreativeDC makes individual outputs more useful. Its contribution is that the model can travel substantially farther across the semantic space while preserving utility at a statistically comparable level.

For teams generating a single item, this may be mildly interesting. A user asking for one acceptable programming exercise may not care whether the unused alternatives were diverse.

For teams building a library, dataset, or catalogue, the result is much more consequential. Repeated generation is valuable only when each additional output contributes something meaningfully different.

The scaling result exposes the difference between output count and effective variety

A generator can produce more items without producing proportionally more ideas.

To measure this distinction, the authors use the Vendi Score, which treats diversity as an effective number of distinct items. When all outputs are nearly identical, the effective count remains low even if the raw file contains hundreds of entries. When outputs occupy more distinct semantic regions, the effective count rises.

All methods become more diverse as the number of sampled problems increases from 10 to 100. CreativeDC, however, accumulates effective variety faster.

Without personas, its Vendi Score is:

24.0% higher than chain-of-thought prompting at $K=10$;
72.0% higher at $K=100$.

The widening gap is more informative than either percentage alone.

If CreativeDC merely added a fixed dose of eccentricity, its advantage would remain roughly constant as more items were generated. Instead, the advantage grows with the size of the collection. The two-phase process appears to reduce repeated convergence on the same familiar concepts.

This is the strongest business-relevant result in the paper.

Many generative workflows are evaluated one output at a time: Is this advertisement acceptable? Is this scenario valid? Does this synthetic record look realistic? Yet the operational value of bulk generation depends on the collection. One hundred individually acceptable outputs may still form a poor dataset when they repeatedly represent the same few cases.

The more useful question is:

After generating another 100 items, how many genuinely new items did the system add?

CreativeDC is designed around that question, even though the prompt itself remains simple.

Persona simulation helps exploration, but it does not replace the reasoning schedule

The persona experiment serves as an augmentation and robustness test.

Every method receives a sampled persona from the Persona Hub dataset. A model might, for example, simulate a scientist interested in ageing research and then explore superhero ideas involving power decay, inherited mutations, genetic screening, or rejuvenation.

Personas improve diversity and novelty across the methods. This is unsurprising: a supplied perspective gives the model an alternative thematic starting point.

CreativeDC nevertheless retains its advantage.

With persona simulation, it achieves 8.5% greater semantic diversity and 32.9% greater semantic novelty than chain-of-thought prompting. Its absolute semantic novelty reaches 0.31, compared with 0.23 for chain of thought and 0.22 for Base.

Method with personas	Semantic diversity	Semantic novelty	Utility
Base	0.49	0.22	91.80%
Chain of thought	0.52	0.23	89.70%
CreativeDC	0.56	0.31	89.65%

The comparison clarifies the division of labour between the techniques.

A persona influences where the model begins exploring. Divergent-convergent scaffolding influences how exploration and selection are sequenced. The first supplies a viewpoint; the second prevents that viewpoint from being immediately compressed into the nearest compliant answer.

Personas can therefore be useful inputs to CreativeDC, but they are not substitutes for it.

There is also a modest utility trade-off when personas are added. The paper does not establish why. A plausible operational interpretation is that stronger perspective steering can make strict constraint satisfaction slightly harder, but that explanation remains an inference rather than a demonstrated mechanism.

Theme familiarity and constraint complexity determine how much room creativity receives

The paper also examines CreativeDC’s performance across its 20 contexts.

This analysis is best treated as exploratory evidence about where the method works more or less easily, rather than a separate thesis.

The Cooking theme produces the highest utility but the lowest semantic diversity and novelty. Science Fiction and Superheroes allow more varied and novel outputs. Familiar everyday settings appear to make valid generation easier while pulling the model toward conventional scenarios.

Programming concepts show a similar pattern. Simpler concepts such as Variables and Selection Statements permit greater novelty than more restrictive concepts such as Loops and Lists.

The result is intuitive, but operationally useful: the creative range of a system is partly determined before the model begins generating.

A workflow can create additional exploration space by delaying constraints. It cannot make all constraint sets equally permissive. When a task requires a narrow technical concept, rigid schema, regulated language, and a familiar scenario, the remaining creative degrees of freedom may be limited.

This suggests that creative-generation systems should distinguish two problems:

Problem	Likely intervention
The model ignores available creative space	Improve the reasoning schedule
The specification leaves little creative space	Redesign or relax the specification

Prompt engineering can address the first. It cannot politely negotiate the second away.

A practical creative-generation pipeline needs portfolio evaluation

The paper directly demonstrates a two-phase prompting method for programming tasks. Translating that result into a broader production workflow requires one additional layer: evaluate the generated collection, not merely each final item.

A practical pipeline could use three stages.

Stage 1: Generate candidates without operational constraints

The divergent prompt should define the thematic territory while temporarily withholding implementation requirements.

For a campaign-idea system, the model might explore customer situations, emotional tensions, rituals, unexpected product uses, or cultural settings without yet considering channel formats and brand rules.

For synthetic incident generation, it might explore unusual sequences of events before mapping them into an approved data schema.

For educational content, it might brainstorm contexts and narrative situations before enforcing learning objectives and answerability.

The purpose is not to accept these ideas. It is to prevent feasibility from determining which ideas become visible.

The convergent stage selects promising candidates and converts them into usable outputs.

Requirements such as legality, technical feasibility, formatting, difficulty, brand consistency, or data-schema validity belong here. When a candidate cannot survive the constraints, the system should select another candidate rather than forcing the first one into an unnatural shape.

This fallback mechanism is important. Convergence should be allowed to reject ideas. Otherwise, the system merely postpones the same failure.

Stage 3: Evaluate both item quality and collection quality

Traditional validators can check each output for compliance, correctness, and relevance.

A second evaluation layer should measure the portfolio:

How semantically similar are the outputs?
Which ideas form dense clusters?
How many effective distinct items were produced?
Does another generation batch add new coverage?
Are unusual outputs still useful?

This stage turns diversity from a vague stylistic preference into an operational metric.

A team generating 5,000 synthetic customer complaints, for example, should care whether the collection covers varied failure modes—not whether each complaint contains slightly different wording.

The immediate business value is avoiding counterfeit scale

CreativeDC is most relevant to workflows where output quantity is supposed to create coverage.

That includes:

educational exercise libraries;
synthetic training and testing scenarios;
product-concept exploration;
advertising and content ideation;
simulation cases for agents;
risk and incident catalogues.

In these settings, duplicated concepts create counterfeit scale. Storage volume increases. Review workloads increase. The apparent dataset grows. The useful information content grows much more slowly.

CreativeDC offers a relatively lightweight intervention because it operates at inference time. It does not require fine-tuning a model on creativity preferences or coordinating a debate among multiple agents. The tested method adds structured reasoning instructions to the generation prompt.

Still, “prompt-only” does not mean “free.”

The CreativeDC prompt requests additional divergent and convergent reasoning. This likely increases token usage and latency. More importantly, the study discards inconsistent problems and continues generating until it obtains the required number of usable problems. It does not report how many attempts each method required, their rejection rates, or their total inference costs.

A system could therefore produce a more diverse final collection while also requiring more computation to assemble it. The paper measures the quality of the resulting set, not the economics of producing that set.

For deployment, the relevant return calculation is closer to:

$$ \text{Value per generation budget} = \frac{\text{effective useful variety}}{\text{generation, validation, and review cost}} $$

CreativeDC improves the numerator in the tested setting. The paper does not establish what happens to the denominator.

What the evidence supports—and what remains an open question

The paper supports several fairly strong conclusions within its experimental setting:

Claim	Evidence	Boundary
Explicit divergent-convergent prompting increases output diversity and novelty	CreativeDC leads all baselines across lexical and semantic metrics	Tested on programming-problem generation with one model
The gains extend beyond wording differences	Semantic diversity and semantic novelty improve substantially	Semantic metrics rely on one embedding model
Utility can remain comparable while exploration expands	Utility differences from baselines are not statistically significant	Utility is automatically evaluated rather than judged by educators
CreativeDC produces greater effective variety at scale	Its Vendi Score advantage grows from 24.0% at $K=10$ to 72.0% at $K=100$ over chain of thought	Generation cost and discard rates are unreported
Personas and CreativeDC can be combined	CreativeDC retains its advantage when all methods receive personas	The experiment does not compare many alternative persona strategies

Several broader interpretations remain plausible but unproven.

The method may transfer well to marketing concepts, product design, stories, simulations, or business scenarios. Those domains also suffer when feasibility constraints arrive before exploration. Yet the paper evaluates only programming exercises.

The outputs may also feel more creative to human readers. Semantic distance is a useful signal, but humans do not experience creativity as embedding distance. The study includes no human evaluation of surprise, usefulness, elegance, educational value, or taste.

Generalization across models is another open issue. The experiment uses one large Qwen3 mixture-of-experts model. Smaller models, proprietary models, and systems with different reasoning behaviours may respond differently to the scaffold.

Production systems may also implement the method differently. Rather than asking one model response to expose both phases, a team could run exploration and convergence as separate calls, store the candidate pool, apply independent validators, and measure portfolio diversity over time. That architecture is a reasonable extension of the paper’s mechanism, but it was not tested in the study.

Put the approval committee after the brainstorm

CreativeDC’s contribution is almost suspiciously simple: when creativity and constraint satisfaction compete, stop asking the model to perform both at the same moment.

The paper does not show that LLMs possess imagination in a human sense. It shows that their output distribution can be widened by changing the sequence in which they handle a task.

First, let the model explore the theme without carrying every operational burden.

Then, force it to select, test, reject, and refine.

Finally, evaluate whether the resulting collection contains genuinely different useful items, rather than counting files and congratulating the generator.

Randomness may add variation. Personas may add perspective. Generic reasoning may add deliberation. None of them directly guarantees that exploration occurs before convergence.

Sometimes the model is not short of possible ideas. The prompt simply invited the approval committee into the room too early.

Cognaptus: Automate the Present, Incubate the Future.

Manh Hung Nguyen and Adish Singla, “Divergent-Convergent Thinking in Large Language Models for Creative Problem Generation,” arXiv:2512.23601, 2025, https://arxiv.org/abs/2512.23601. ↩︎

Premature constraint satisfaction narrows the idea space before generation begins#

CreativeDC turns constraints from an idea generator into an idea filter#

The experiment tests creativity under unusually strict output requirements#

Diversity asks whether outputs differ from one another#

Novelty asks whether outputs differ from competing methods#

Utility asks whether creativity survived contact with the task#

CreativeDC moves farther semantically without a statistically significant utility loss#

The scaling result exposes the difference between output count and effective variety#

Persona simulation helps exploration, but it does not replace the reasoning schedule#

Theme familiarity and constraint complexity determine how much room creativity receives#

A practical creative-generation pipeline needs portfolio evaluation#

Stage 1: Generate candidates without operational constraints#

Stage 2: Apply constraints through selection and refinement#

Stage 3: Evaluate both item quality and collection quality#

The immediate business value is avoiding counterfeit scale#

What the evidence supports—and what remains an open question#

Put the approval committee after the brainstorm#