TL;DR for operators
Image-editing demos are easy. Ask a model to remove one object, recolour a jacket, or add a tasteful lamp, and most modern systems can produce something impressive enough for a product page and a LinkedIn post. Ask it to perform eight connected edits while keeping the original subject, layout, texture, lighting, and realism intact, and the polite showroom smile begins to crack.
The Complex-Edit paper introduces a benchmark for exactly that harder regime: instruction-based image editing where complexity is controlled by combining multiple atomic edit operations into one coherent instruction.1 Its main contribution is not merely another leaderboard. It gives operators a cleaner way to ask: did the model make the requested change, did it preserve what should not change, and does the final image still look real?
The most operationally useful result is also the least flattering to current tooling. As instruction complexity rises, models do not simply “miss a few details.” They increasingly damage identity preservation and perceptual quality. In other words, the object may be edited, but the image forgets what it was. Wonderful. The AI followed the brief and quietly replaced the product.
The paper also punctures a tempting analogy from language models. Chain-of-thought-style decomposition works well in many text reasoning tasks, so one might expect complex visual edits to improve when split into sequential atomic steps. Complex-Edit finds the opposite: sequential editing tends to accumulate artifacts and distortions, degrading performance across Instruction Following, Identity Preservation, and Perceptual Quality. Pixels, unlike paragraphs, are not very forgiving of intermediate mistakes.
For businesses, the practical lesson is direct: do not evaluate image-editing AI with cute one-step demos. Test multi-edit production requests, score preservation separately from compliance, and treat “looks synthetic” as a measurable operational defect, not a vague aesthetic complaint.
The real task is not editing an image; it is not ruining the rest of it
Instruction-based image editing sounds simple enough: provide an input image, give a text command, receive an edited image. The trouble is that most business uses are not asking for unconstrained generation. They are asking for controlled modification.
That distinction matters.
A retailer does not just want “change the handbag colour to red.” It wants the same handbag, same stitching, same proportions, same model pose, same background, same lighting logic, and a red colour that does not make the product look like it was dipped in jam. A hotel marketing team does not just want “add warm lighting and remove clutter.” It wants the room to remain recognisably the same room. A brand team does not just want “make the packaging more premium.” It wants the actual packaging identity to survive.
Complex-Edit is useful because it treats image editing as a three-way constraint rather than a one-dimensional success story.
| Evaluation dimension | What it asks | Business translation |
|---|---|---|
| Instruction Following | Did the requested edits appear? | Did the model do what the operator asked? |
| Identity Preservation | Did unchanged elements stay unchanged? | Did the product, person, place, or brand survive the edit? |
| Perceptual Quality | Does the final image look coherent and artifact-free? | Can this output plausibly enter a workflow without embarrassment? |
That separation is the paper’s most practical evaluation move. A model can score well on Instruction Following while quietly damaging Identity Preservation. Another can preserve the image but fail the requested edit. A third can satisfy both and still produce a strange synthetic sheen, the visual equivalent of a corporate stock photo after a spiritual crisis.
For operators, a single “quality score” hides these differences. Complex-Edit makes them visible.
Complex-Edit builds complexity instead of merely naming it
The benchmark’s data-generation pipeline is deliberately structured. It starts by defining 24 atomic editing operations across nine broad categories, including object manipulation, colour and tone adjustment, texture and material changes, background edits, lighting, text, pose, composition, and special effects.
The pipeline then runs through three stages.
First, GPT-4o generates a sequence of atomic instructions for each input image. These are small, basic edit tasks such as adding an object, changing a colour, removing text, adjusting lighting, or reframing a composition.
Second, the generated instructions are simplified. This matters because LLM-generated instructions often include helpful-sounding but unnecessary commentary. A benchmark does not need “to create a more cheerful and inviting visual atmosphere”; it needs “change the wall colour to yellow.” Less poetry, fewer loopholes.
Third, the simplified atomic instructions are compounded into a single coherent complex instruction. The paper’s example is nicely revealing: instead of separately saying “add a ball of yarn” and then “change the colour of the yarn to red,” the compounded instruction becomes “add a red ball of yarn.” The benchmark can then control complexity by adjusting how many atomic instructions are merged, from the simplest level to a hardest level that combines eight atomic operations.
This is stronger than collecting a bag of vaguely “hard” prompts. Complexity becomes a parameter, not a vibe.
The dataset contains both realistic and synthetic input images: 531 deduplicated real images from the EMU-Edit test set and 531 synthetic images generated with FLUX.1 from corresponding captions. The authors evaluate three open-source editing models — UltraEdit, OmniGen, and AnyEdit — and two proprietary models, Imagen3 and SeedEdit. They also run a preliminary GPT-4o evaluation, limited by web-interface access and restricted to roughly 30% of real images at the hardest complexity level.
That design is not perfect, but it is coherent. It lets the paper compare model behaviour as editing instructions become more demanding, rather than treating complexity as an anecdote.
The metric work is not decoration; it is load-bearing
The paper spends meaningful effort on meta-evaluating its evaluation method. Good. Vision benchmarks often collapse into a familiar ritual: invent a metric, report numbers, pretend the metric has become reality by being formatted in a table.
Complex-Edit tries to avoid that trap by checking whether its VLM-based autograder correlates with human comparisons. Instead of asking human raters to produce absolute numeric scores — a task humans are famously inconsistent at, because humans are not spreadsheet accessories — the authors ask raters to compare pairs of outputs for the same input image and instruction. They then examine whether the metric score differences align with those human preferences.
Several design choices matter.
First, direct numeric scoring outperforms token-probability scoring. The token-probability approach reframes evaluation as a yes/no question and uses the model’s probability of “Yes” as the score. That sounds elegant, especially if one has spent too much time around verifier models. But in this benchmark, numeric scoring aligns better with human evaluation.
Second, detailed rubrics help. The appendix defines 0–10 scoring descriptions for Instruction Following, Identity Preservation, and Perceptual Quality. This is not bureaucratic garnish. It reduces ambiguity in the evaluator’s job, especially when the same output can be good in one dimension and poor in another.
Third, chain-of-thought prompting is not universally helpful even for evaluation. The paper finds that CoT improves some correlations for Instruction Following and Identity Preservation when paired with rubrics, but it hurts Perceptual Quality under numeric scoring. The final evaluation setup therefore uses CoT for Instruction Following and Identity Preservation, but disables it for Perceptual Quality.
Fourth, Perceptual Quality should be judged without the instruction. This is a subtle but important point. One might assume the evaluator should know the edit request so it can distinguish intentional weirdness from visual failure. The paper finds the opposite: including the instruction when judging Perceptual Quality drops correlation with human evaluation from 0.234 to 0.046. Apparently, once told what the model was trying to do, the evaluator becomes too forgiving of visual oddities. A familiar managerial disease.
Finally, the paper handles per-sample variance by averaging 20 independent VLM evaluations per sample. That increases cost, but the authors show that the variance stabilises around that number. This is especially relevant for test-time scaling, where a model may select the “best” candidate output based on per-sample scores. If those scores wobble, the selection procedure becomes expensive dice-rolling.
| Component | Likely purpose in the paper | What it supports | What it does not prove |
|---|---|---|---|
| Human-comparison meta-evaluation | Main validation for the metrics | The VLM scores are not arbitrary decorations | That the metrics perfectly capture all human preferences |
| Numeric vs token-probability scoring | Ablation on scoring method | Numeric scoring is more reliable here | That token-probability scoring is generally bad |
| CoT in the evaluator | Ablation on prompting strategy | CoT is selectively useful, not magic seasoning | That CoT never helps visual evaluation |
| Perceptual Quality without instruction | Sensitivity test | Visual naturalness is better judged from the output alone | That business users never care about instruction context |
| 20 repeated evaluations | Robustness/stability choice | Per-sample scores become more reproducible | That the evaluation pipeline is cheap |
For business readers, the message is simple: an evaluation framework is itself a product decision. Bad metrics will buy the wrong model very confidently.
Simple instructions flatter models; complex instructions expose them
The central experimental contrast is between low and high instruction complexity.
On real images, all five main models lose overall score when moving from the simplest to the hardest complexity level. The declines are not uniform, which is exactly why the three-metric decomposition matters.
For real-image direct editing, Identity Preservation is hit especially hard. UltraEdit drops from 7.76 to 5.93 in Identity Preservation. OmniGen drops from 8.69 to 6.42. Imagen3 drops from 8.93 to 6.55. SeedEdit drops from 9.01 to 6.91. These are not tiny measurement wrinkles. They suggest that as instructions compound, models increasingly alter things they were supposed to leave alone.
Instruction Following behaves less consistently. AnyEdit, for example, suffers a severe Instruction Following drop on real images, from 5.94 to 1.61, while its Perceptual Quality actually rises slightly from 6.78 to 7.25. That is not a contradiction. It means the model can produce a visually acceptable image that fails the requested transformation. In production, that is still a failure. A beautiful wrong answer is a wrong answer with better lighting.
On synthetic images, the pattern broadly repeats. Overall scores fall for all five main models as complexity rises. Open-source models show drops larger than one point in Overall score, while the proprietary systems decline less sharply. The paper’s broader trend is therefore not just “hard prompts are hard.” It is that complexity increases the separation between stronger and weaker systems.
The preliminary GPT-4o result complicates the picture in a useful way. At the hardest complexity level on real images, GPT-4o scores 9.29 on Instruction Following, 7.51 on Identity Preservation, 9.47 on Perceptual Quality, and 8.76 Overall, outperforming the other evaluated systems in that limited setting. But the paper is careful about the boundary: there was no API access, outputs came through the web interface, images were centre-cropped to match aspect ratio, and the evaluation covered only roughly 30% of real images at the hardest level.
So GPT-4o appears strong. It does not make the benchmark obsolete. If anything, it gives the benchmark a more interesting top-end stress case.
Direct editing beats step-by-step editing because images accumulate scars
The most tempting misconception is that complex image editing should benefit from the same decomposition logic that helps language reasoning. Break the task into smaller steps. Solve each step. Accumulate progress. What could possibly go wrong?
The answer is: the image.
Complex-Edit compares direct editing against CoT-like sequential editing. In direct editing, the model receives the compounded instruction and edits the original image once. In sequential editing, the model applies each atomic instruction step by step, using the previous output as the next input.
In language, intermediate reasoning can be revised, ignored, or abstracted away. In image editing, every intermediate output becomes the substrate for the next operation. A slightly distorted face, a softened edge, an inconsistent shadow, or a hallucinated background detail does not remain a private scratchpad note. It becomes pixels. The next edit inherits the damage.
The paper finds that sequential editing yields a steady decline across all three metrics, with visual artifacts and distortions accumulating as the number of intermediate steps grows. This holds even for stronger proprietary systems such as Imagen3 and SeedEdit. AnyEdit shows one partial exception: sequential editing can improve its Instruction Following, but that gain is offset by worse Identity Preservation.
That trade-off is exactly what operators should watch. A workflow that decomposes image edits into “manageable” steps may improve compliance with individual instructions while degrading the asset’s core identity. For brand, product, and person-centred workflows, that is usually a bad bargain.
The lesson is not that decomposition is useless. It is that visual decomposition needs a preservation mechanism. Without it, step-by-step editing becomes step-by-step erosion.
Best-of-N helps, but it is selection rather than understanding
The paper also tests Best-of-N, a simple test-time scaling strategy. Generate multiple candidate outputs, evaluate them, and select the best one. For direct editing, candidates are scored by the Overall metric. For sequential editing, candidates are generated and selected at each intermediate step, with the evaluator judging whether the output reflects the cumulative instruction up to that point.
The results are sensible. Best-of-N improves direct editing across metrics. It also helps sequential editing, especially for Identity Preservation and Perceptual Quality. But there is a catch: sequential editing with Best-of-N still struggles to surpass direct editing without Best-of-N.
That matters commercially because Best-of-N is not free. It multiplies generation and evaluation cost. It can be a useful mitigation strategy when quality matters more than latency or compute spend, but it should not be mistaken for a deeper solution. Selection can choose the least damaged candidate among several attempts. It does not remove the underlying tendency to damage the image through repeated transformations.
For production teams, Best-of-N belongs in the “quality escalation” layer, not the default answer to every workflow. Use it where the asset value justifies the extra cost: campaign hero images, product catalogue masters, executive portraits, regulated visual claims, and other places where silent distortion is expensive.
The synthetic-data curse is a style leak, not merely a score drop
One of the paper’s more interesting observations is qualitative: under very complex instructions on real images, some model outputs lose their realistic appearance and drift toward a synthetic aesthetic. The authors describe outputs resembling oil paintings or animations.
UltraEdit appears especially susceptible, and the authors connect this to its training data composition, which contains a higher proportion of synthetic images than OmniGen or AnyEdit. The phenomenon also appears in proprietary models, including SeedEdit, Imagen3, and GPT-4o, although their training data sources are undisclosed. The paper cautiously suggests that synthetic data may contribute both to strong generative capability and to the tendency for complex edits to look synthetic.
This point should be handled carefully. The benchmark does not prove the full training-data causal chain for closed models. It cannot, because those datasets are not public. But it does identify a practical failure mode: as edit complexity rises, some systems may preserve semantic success while leaking synthetic style into realistic assets.
That is commercially important. Many visual workflows depend on realism as part of trust. Real estate photos, hospitality imagery, product listings, insurance documentation, medical-adjacent visual communication, identity-sensitive portraits, and marketplace images all suffer when the output starts looking “AI-ish.” The issue is not aesthetic snobbery. It is asset credibility.
A model that produces attractive synthetic-looking images may be excellent for concept art and poor for controlled commercial editing. Same output quality, different job. Procurement teams should not confuse them. Naturally, they will, unless forced not to.
What operators should change in their evaluation scorecards
Complex-Edit is a research benchmark, not a plug-and-play enterprise procurement template. Still, its structure translates well into business testing.
A serious image-editing AI evaluation should include at least four changes.
First, test complexity explicitly. Do not rely on one-step prompts. Build evaluation sets with two, four, six, and eight connected modifications. Complexity should be measured by the number and type of edit operations, not by how dramatic the prompt sounds.
Second, score preservation separately. If the product changes shape while changing colour, that is not a small defect. If a person’s face shifts while the model changes the background, that is not a “creative variation.” It is identity loss.
Third, separate realism from instruction compliance. A model can satisfy the prompt and still produce an image that looks fake. For many business settings, that output is unusable.
Fourth, test direct and sequential workflows separately. Many teams will naturally build editing agents that decompose a request into tool calls. That may be sensible for planning, logging, and user interaction. But the final visual operation may still need to happen as a single controlled edit, or with preservation-aware mechanisms, rather than as a naive chain of image transformations.
A practical vendor scorecard could look like this:
| Procurement question | What to test | Failure pattern to catch |
|---|---|---|
| Can it handle real production briefs? | Multi-operation prompts at increasing complexity | Strong demo performance but weak complex editing |
| Does it preserve core assets? | Before/after checks on product, person, logo, layout, and background | The edit succeeds while the asset mutates |
| Does it still look real? | Perceptual review without showing the instruction | Synthetic sheen, inconsistent shadows, pasted objects |
| Does decomposition help or hurt? | Direct edit vs sequential edit on the same request | Step-by-step artifact accumulation |
| Is quality scaling worth the cost? | Best-of-N at different candidate counts | Expensive selection with marginal gains |
The broader implication is that “image editing AI” should not be purchased as a monolithic capability. It should be matched to asset class, risk tolerance, latency needs, and the cost of visual drift.
The benchmark’s boundaries matter for practical use
The paper’s limitations are not fatal, but they are important.
The benchmark uses 531 real images from EMU-Edit and 531 FLUX-generated synthetic images. That is enough to expose meaningful patterns, but it does not cover every commercial domain. Product photography, hotel interiors, medical imagery, fashion catalogues, legal evidence, industrial inspection, and food delivery photos each have their own preservation constraints.
The model set is also limited. The main comparison covers three open-source systems and two proprietary systems. Proprietary model results are further constrained by usage policies: Imagen3 is evaluated on approximately 60% of images and SeedEdit on 95%. GPT-4o is evaluated only preliminarily, through the web interface, on roughly 30% of real images at the hardest complexity level. Those numbers are useful signals, not a universal ranking of the market.
The evaluation pipeline depends on GPT-4o as a VLM autograder. The authors meta-evaluate that pipeline against human comparisons and make several sensible design choices, but VLM evaluation is still an approximation. In real procurement, human review remains necessary for high-risk asset categories. The point is not to replace human judgment; it is to make human judgment less random and less expensive.
Finally, the “curse of synthetic data” observation should be read as a practical warning, not a definitive causal proof for every model. The paper can connect visible synthetic drift to known or suspected synthetic data exposure in some cases. For closed models, the causal explanation remains partly inferential.
The useful lesson is diagnostic, not dramatic
Complex-Edit does not tell us that image-editing AI is bad. That would be lazy, and worse, unhelpful. The paper shows something more precise: current image-editing systems can be strong at visible instruction compliance while remaining fragile under compounded constraints, especially when preservation and realism matter.
That is exactly the kind of fragility businesses need to understand before they wire these tools into content pipelines.
The next stage of image-editing AI will not be judged only by whether it can perform an impressive transformation. It will be judged by whether it can make the transformation while leaving everything else alone. This sounds modest until one remembers that “everything else” is usually the product, the brand, the person, the room, the evidence, or the thing customers were supposed to trust.
Step-by-step reasoning made language models look smarter. Step-by-step editing, at least in this benchmark, often makes images look worse. The difference is not philosophical. It is operational. Text can hide its scratchpad. Images have to live with theirs.
Cognaptus: Automate the Present, Incubate the Future.
-
Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, and Cihang Xie, “Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark,” arXiv:2504.13143, 2025. ↩︎