Benchmarked Brilliance: How CreBench Rewrites the Rules of Machine Creativity

Design review is where creativity usually goes to become awkward.

One person likes the concept because it feels original. Another dislikes it because it looks impractical. A third praises the visual polish while quietly ignoring whether the idea solves the actual problem. Then someone asks whether the AI can “evaluate creativity”, and everyone pretends the word creativity has a stable meaning. Excellent. Very efficient.

The paper behind CreBench takes a more useful route. It does not claim that multimodal models are now creative in the grand philosophical sense. That would be a TED Talk, not a benchmark. Instead, it asks a narrower and more operational question: can multimodal large language models evaluate creative work in ways that align with trained human judges across the idea, the process, and the final product?¹

That distinction matters. CreBench is not a trophy case for machine imagination. It is an attempt to turn creativity assessment into an auditable pipeline.

The main move is to stop treating creativity as a single score

Most AI creativity evaluation is still haunted by output fetishism. Show the model an image. Ask whether it is novel, beautiful, or aligned with a prompt. Compress the answer into a score. Then pretend the score captures creativity because numbers are soothing.

CreBench attacks that shortcut directly. The authors define creativity across three linked layers:

Layer	What it evaluates	Indicators
Creative idea	Whether the concept is novel and appropriate	Originality, appropriateness
Creative process	Whether the creator explored, structured, revised, and refined	Immersion/preparation, divergence, structuring, evaluation, elaboration
Creative product	Whether the final drawing works as an expressive and feasible solution	Effectiveness, aesthetic, novelty, manufacturability, systemic complexity

This is the paper’s central contribution. The benchmark is not simply asking, “Is the final artefact good?” It asks whether the concept made sense, whether the creator explored alternatives, and whether the final visual solution communicated something coherent and buildable.

That is a much harder target for AI evaluation. It also looks more like how organisations actually judge creative work when they are being honest. A product concept is not valuable merely because the mock-up is attractive. A marketing idea is not strategic merely because the image is fresh. A student design is not creative merely because it is weird. Weirdness is cheap. Coherent, novel, feasible weirdness is where the invoice starts.

CreBench turns creativity into a data collection problem

The benchmark’s second mechanism is multimodal evidence. CreBench is built from more than final images. It includes textual ideas, creation process data, and visual outputs from open-ended design tasks. The paper reports 2.2K creative instances across four tasks, including scenarios such as cargo river crossing, parking, reach, and fence-style problem settings.

The human side is also important. The study recruited 512 secondary students from five schools, collected their creative responses, and included AI-generated attempts as part of the broader data construction. Expert annotators then evaluated the work across the three major dimensions and twelve sub-indicators.

This is where the benchmark becomes more than a label set. The paper reports 79.2K expert feedback entries, produced by three experts trained and calibrated under a structured annotation protocol. The authors also report average Fleiss’ $\kappa = 0.71$ and ICC $(2,1) = 0.78$, which they interpret as substantial agreement.

That does not make creativity objective. It makes the assessment procedure more disciplined. The difference is not academic hair-splitting; it is the entire business case. For practical deployment, the question is rarely “What is true creativity?” The question is “Can we make this evaluative judgement consistent enough to support feedback, ranking, triage, or training?”

CreBench says: maybe, if you specify the rubric and capture enough of the creative trace.

The instruction dataset is the quiet engine

CreBench is the benchmark. CreMIT is the training fuel.

The authors convert expert feedback into 4.7M instruction-following samples across six question styles: reasoning, what, how, why, yes/no, and multiple-choice. The operational idea is straightforward: do not just give the model a score and hope enlightenment occurs. Teach it to answer the kinds of questions a human evaluator would ask.

That design choice matters because creativity evaluation is not one task. A useful evaluator might need to explain why an idea is original, identify how the solution works, judge whether it satisfies constraints, compare possible ratings, or justify a score against a rubric. These are different cognitive moves. A single scalar label will not teach them all.

CreMIT therefore functions less like a dataset of answers and more like a rehearsal environment for evaluative reasoning. It exposes the model to the language of judgement: why an idea is appropriate, how a process shows divergence, whether a product is manufacturable, and what makes a final artefact more than decorative noise wearing a lab coat.

For business readers, this is the first practical lesson: if you want AI to evaluate subjective work, the expensive asset is not merely the model. It is the structured judgement corpus. The rubric, examples, process traces, and expert explanations are the moat. The model is the delivery mechanism.

CreExpert is a specialised evaluator, not a universal genius

The paper then fine-tunes CreExpert from LLaVA-1.5-7B. Its architecture follows the LLaVA family: a CLIP-ViT-L14 vision encoder, a two-layer projection module to connect visual and language representations, and a Vicuna-based language decoder. During tuning, the visual encoder remains frozen, while the projection module and language model are trained using LoRA through LLaMA-Factory.

This is not a maximalist architecture story. The point is not that CreExpert wins because it is bigger or more exotic. The point is that a general multimodal model becomes much stronger at a narrow evaluative task when it is tuned on carefully structured, expert-derived creativity instructions.

That is exactly the part many AI procurement conversations still miss. Businesses keep asking whether a general model “can do creativity”. The better question is whether the organisation can define the evaluative standard clearly enough for a model to learn it.

CreExpert’s results suggest that specialised judgement can beat general brilliance when the judgement task is well-scaffolded. Annoying for hype decks, useful for operations.

The main result is strong, but the product layer keeps everyone honest

The headline result is large. CreExpert achieves an overall score of 65.50, compared with 29.27 for GPT-4V, 27.78 for Gemini-Pro-Vision, and 20.57 for the LLaVA-1.5-7B baseline. The paper evaluates alignment using Pearson correlation between model predictions and human expert feedback, reported in percentage-style scores.

Model	Creative idea	Creative process	Creative product	Overall
CreExpert	84.14	72.19	40.18	65.50
GPT-4V	15.16	45.01	27.64	29.27
Gemini-Pro-Vision	11.47	54.39	17.50	27.78
LLaVA-1.5-7B	13.06	28.78	19.87	20.57

The most interesting part is not simply that CreExpert wins. It is where it wins.

Creative idea evaluation improves dramatically. The model becomes much better aligned with human ratings of originality and appropriateness. Creative process evaluation also improves strongly, especially on immersion/preparation and divergence. That makes sense: process evidence gives the model observable signals of exploration, revision, and structure. It can see more of the thinking, or at least more of the behavioural residue of thinking.

Creative product evaluation is weaker. CreExpert still leads, but its product score is 40.18, far below its idea score of 84.14 and process score of 72.19. This is not a footnote; it is the paper behaving like a useful benchmark. Final products are harder because they require judging visual coherence, feasibility, aesthetic quality, novelty, and system integration all at once. That is a lot to ask from a model whose visual encoder is frozen and whose training objective is mostly evaluative alignment.

In plain English: CreExpert learns to judge the story behind the creative work more easily than it learns to judge the finished artefact as a design object. Very human, actually.

The ablations explain where the benchmark is doing work

The paper’s later experiment tables are best read as ablations and dimension-by-task diagnostics, not as a second thesis. Their purpose is to show how CreMIT changes evaluation ability across the three creativity layers and across the four tasks.

Test	Likely purpose	What it supports	What it does not prove
Creative idea by task	Ablation by dimension and task	Fine-tuning sharply improves alignment on originality and appropriateness	That the model can generate better original ideas in open-world business settings
Creative process by task	Ablation by process indicator	Process-aware data helps the model evaluate preparation, divergence, structuring, revision, and elaboration	That the model truly understands human cognition
Creative product by task	Ablation by final-output indicator	Product-level judgement improves, but unevenly	That visual creativity assessment is solved

The idea-level gains are the cleanest. Across Transport, Parking, Reach, and Fence tasks, the overall creative idea improvements range from +52.16 to +68.91. The largest gains appear in tasks where originality and appropriateness are strongly expressed through textual or conceptual framing.

The process-level gains are also substantial. Overall improvements range from +38.49 to +45.56 across tasks, with strong movement in immersion/preparation and divergence. This is where the benchmark’s design pays off: if you collect process logs, the model can learn to evaluate more than the polished endpoint.

The product-level results are more modest. Overall gains range from +3.21 to +19.62. The Reach task shows only a small product-level improvement, including a slight negative movement in manufacturability. That does not invalidate the paper. It clarifies the boundary. Judging whether a final drawing is original, aesthetically effective, and physically plausible remains harder than judging whether a written idea sounds novel or whether a process log shows exploration.

This is exactly the kind of uneven result a serious evaluator should preserve. If every dimension improved equally, one would suspect the benchmark was measuring the training distribution’s echo, not creativity alignment.

The business value is evaluation infrastructure, not “creative AI”

The most obvious business interpretation is also the least useful: “AI can now evaluate creativity.” Too broad. Too convenient. Please put it back.

The better interpretation is that CreBench points toward rubric-based AI evaluators for subjective work. That is narrower, and much more actionable.

In education, a system like this could support formative feedback on open-ended design tasks. Not by replacing teachers, but by making first-pass diagnosis cheaper: this idea is original but weakly appropriate; this process shows little divergence; this final product is visually polished but mechanically implausible.

In marketing, a similar approach could help teams triage campaign concepts before human review. The model would not decide brand strategy. It could, however, flag whether a concept is merely aesthetic, whether the idea has task relevance, and whether the execution carries the intended message.

In product design, the same logic applies to early-stage concept screening. Process-aware evaluation could reward exploration and refinement rather than just final mock-up polish. That is useful because many organisations accidentally select the most presentation-ready concept, not the best-developed one. Design theatre remains undefeated, but at least now it has competition.

The broader lesson is that subjective evaluation can be operationalised when three assets exist:

Asset	Operational role	Business consequence
Rubric	Defines what good judgement means	Reduces arbitrary review criteria
Multimodal evidence	Captures idea, process, and output	Evaluates creative work beyond surface polish
Expert feedback transformed into instructions	Teaches the model how to reason about judgement	Enables repeatable first-pass critique

This is the route from research result to business use. Not “let the AI be creative”. Rather: define your evaluative standard, collect examples, annotate judgement, and tune a model to reproduce that judgement consistently enough to assist humans.

Less magical. More deployable. Usually a good trade.

The real competitive implication is domain-specific judgement

CreExpert beating GPT-4V and Gemini-Pro-Vision is tempting to overread. Resist that temptation; it is sticky and bad for the furniture.

This is not evidence that a 7B open-source derivative is generally more capable than frontier proprietary models. It is evidence that a specialised model trained on a dedicated rubric-aligned dataset can outperform general models on that specific benchmark.

That distinction is commercially important. The next wave of AI value may not come from asking one giant model to be universally wise. It may come from building narrow evaluators around proprietary judgement data: legal risk review, product concept screening, technical design critique, customer support quality scoring, safety incident classification, procurement assessment, and yes, creative evaluation.

For firms, the defensible asset is not necessarily the model checkpoint. It is the institutional judgement encoded into examples and feedback. The company that can articulate how its experts make decisions can train systems to assist those decisions. The company that cannot will continue shouting “innovation” into a chatbot and receiving slightly shinier confusion.

What the paper directly shows

CreBench directly shows that, under the paper’s experimental setup, instruction-tuning a multimodal model on expert-derived creativity feedback substantially improves alignment with human creativity evaluations.

More specifically:

The benchmark covers three creativity dimensions and twelve indicators.
CreMIT provides 2.2K multimodal creative instances, 79.2K expert feedback entries, and 4.7M instruction samples.
CreExpert, fine-tuned from LLaVA-1.5-7B, substantially outperforms GPT-4V, Gemini-Pro-Vision, and nine open-source MLLMs on the reported alignment metric.
Gains are strongest for creative idea and creative process evaluation.
Product-level evaluation improves less consistently and remains the most difficult layer.

That is already meaningful. It says human-aligned creativity assessment is not hopelessly mystical. It can be decomposed, annotated, trained, and measured.

What Cognaptus infers for practice

The practical inference is that many creative workflows should stop treating AI as only a generator. The more interesting role may be evaluator, coach, or diagnostic reviewer.

A generation-first AI tool produces more options. An evaluation-aware AI tool helps distinguish options. Those are different forms of productivity. The second may be more valuable in organisations already drowning in drafts, concepts, mock-ups, and “quick ideas” from every direction.

CreBench also implies that process data is underused. Most workplace AI tools evaluate final artefacts because final artefacts are easy to store. But process traces—revision history, rejected alternatives, planning notes, design logs, critique rounds—may be precisely where useful judgement lives. The messy middle is not waste. It is signal.

The business opportunity is therefore not just better scoring. It is workflow redesign: capture the creative process in machine-readable form, attach expert rubrics, and use AI to surface patterns humans can review faster.

Boundaries that matter

The paper’s evidence comes from constrained visual problem-solving tasks, not the full universe of creativity. The participants include secondary students, and the tasks are structured enough to support rubrics. That is appropriate for research, but it limits direct transfer to brand strategy, industrial design, entertainment, architecture, or scientific discovery.

The model is aligned to expert ratings, not to some universal truth about creativity. The reported inter-rater reliability is useful, but it reflects agreement among trained evaluators within this framework. A different creative domain may require different rubrics, different experts, and different forms of process evidence.

The metric is Pearson correlation with human feedback. Correlation captures consistency with ratings; it does not prove the model understands creativity, nor that its feedback improves human outcomes. A model can become very good at matching a rubric without being a good collaborator.

The comparison with proprietary models should also be read carefully. CreExpert is specialised for this benchmark. GPT-4V and Gemini-Pro-Vision are general systems evaluated without being tuned on CreMIT. The result supports the value of domain-specific tuning; it does not settle the pecking order of multimodal intelligence.

Finally, the instruction expansion process relies on GPT-mediated transformation of expert reports into millions of samples. That scale is useful, but it also makes quality control important. When synthetic instruction data amplifies expert feedback, it can amplify clarity; it can also amplify phrasing patterns and rubric assumptions. Future deployments would need audits for leakage, overfitting, and domain brittleness.

Creativity becomes manageable when it becomes inspectable

CreBench is valuable because it refuses the lazy binary: either creativity is mystical and cannot be measured, or it is just novelty with better typography. The paper chooses a third path. Creativity is treated as a structured judgement over ideas, behaviours, and products.

That is the practical breakthrough. Not that AI has become creative. Not that benchmarks have finally captured the soul of invention, because please, everyone take a breath. The useful claim is that models can be trained to evaluate creative work more like human experts when the assessment framework is explicit, multimodal, and process-aware.

For businesses, that changes the question. The issue is no longer whether AI can “do creativity”. The better question is whether your organisation can define what creative quality means in its own workflows, collect the evidence that reveals it, and build evaluators that make the review process less arbitrary.

CreBench rewrites the rules of machine creativity by making the rules visible. In creative work, that may be the least glamorous advance and the most commercially useful one.

Cognaptus: Automate the Present, Incubate the Future.

Kaiwen Xue et al., “CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product,” arXiv:2511.13626, 2025. https://arxiv.org/abs/2511.13626 ↩︎

The main move is to stop treating creativity as a single score#

CreBench turns creativity into a data collection problem#

The instruction dataset is the quiet engine#

CreExpert is a specialised evaluator, not a universal genius#

The main result is strong, but the product layer keeps everyone honest#

The ablations explain where the benchmark is doing work#

The business value is evaluation infrastructure, not “creative AI”#

The real competitive implication is domain-specific judgement#

What the paper directly shows#

What Cognaptus infers for practice#

Boundaries that matter#

Creativity becomes manageable when it becomes inspectable#