Image generation has become good enough to be useful and unreliable enough to remain annoying. That is the normal condition of enterprise AI: impressive demos, awkward edge cases, and someone in operations quietly asking whether the model actually understood the instruction or merely produced something that looked plausible from a distance.

A user asks for “a red ceramic mug on a wooden desk, next to an open notebook, in morning light.” The model produces a beautiful desk, credible sunlight, maybe even the notebook. The mug is blue. Or metallic. Or missing. If a separate vision model can look at the image and say, “That is not a red ceramic mug,” the failure feels almost rude. The system can see the problem after creating it. Very efficient, in the same way that a committee can discover a typo after approving the brochure.

That awkward asymmetry is the point of the paper behind this article. Han et al. study what they call the internal gap between generation and understanding in unified multimodal large language models: models that are supposed to both generate and understand multimodal content inside one architecture.1 Their core claim is not simply that multimodal models make mistakes. That would be an expensive way to rediscover Tuesday. The more interesting claim is that some models already contain a stronger understanding branch that can diagnose weaknesses in their weaker generation branch.

That changes the practical question. Instead of asking only “How do we get more labeled data?” or “Which external judge should supervise the model?”, the paper asks whether a model’s own self-contradictions can become training material.

The answer is cautiously promising. The business interpretation is more specific: self-contradiction is not magic self-awareness. It is a potentially cheap error-discovery mechanism. Cheap diagnosis is not the same as autonomous quality assurance. The difference matters.

The useful failure is internal disagreement, not generic hallucination

The common reader mistake is to treat every multimodal failure as hallucination. The model was asked for one thing; it produced another; therefore it hallucinated. Fine, as a complaint. Not fine as a diagnosis.

The paper’s distinction is sharper. In a unified multimodal model, generation and understanding are expected to converge. The model should generate an image aligned with a prompt, and it should also be able to judge whether an image matches that prompt. If generation says, in effect, “Here is the requested image,” while understanding later says, “This image does not match the request,” the model is not merely failing externally. It is internally non-unified.

That matters because the remedy depends on the source of the failure. If the model misunderstands the prompt, then using its own understanding as a judge is dangerous. It will confidently supervise the wrong behavior. If the model understands the prompt but generates poorly, then its understanding can be used as a filter, scorer, or preference signal. Same visible error, different engineering strategy.

The paper formalizes this with a non-unification score: the proportion of cases where the model’s understanding branch judges its own generated output as misaligned with the prompt. In their evaluation across six unified multimodal models and tasks of different difficulty, non-unification can approach 60%. More importantly, their analysis attributes most of the internal gap—roughly 60% to 100% in the reported settings—to weak generation rather than weak understanding.

That is the first useful result. The contradiction is not just embarrassment. It is a map.

The paper measures the model against itself, not against human taste

A subtle strength of the paper is that it does not begin by importing a separate reward model and pretending that external judgment is neutral. External evaluators are useful, but they introduce their own biases, coverage limits, and costs. The authors instead measure whether the model’s own understanding branch agrees with its generation branch.

This choice narrows the claim. The metric does not prove that the model is aligned with human preference in the broad sense. It does not tell us whether the image is aesthetically pleasing, legally safe, brand-compliant, culturally appropriate, or commercially effective. It tells us whether the model’s internal understanding can detect that its generation failed the prompt.

That narrower claim is more useful than a grander one. In operations, many AI quality failures are not philosophical. They are boring and expensive: the generated product image misses a required attribute; the visual report omits a labeled component; a synthetic training example contradicts its caption; a multimodal assistant describes something not present in the image. For those cases, internal consistency is not sufficient, but it is a sensible first checkpoint.

The nearby literature is moving in the same direction. GapEval, for example, separately studies whether unified multimodal models truly align generation and understanding across bidirectional tasks, concluding that many systems still show surface-level rather than deep unification.2 LLaVA-Critic takes another route by training an open multimodal evaluator to judge multimodal outputs and support preference learning.3 Self-Grounded Verification studies a related evaluator problem: multimodal models can over-validate agent behavior, an “agreement bias” that makes naïve self-judgment risky.4

Put together, the landscape is not saying, “Models can judge themselves, problem solved.” It is saying something less comforting and more actionable: judging is becoming a capability, but judging the judge is now part of the job. Naturally, the bureaucracy has become recursive.

The main evidence says generation is the weaker branch

The paper’s evidence is important because it changes the operational interpretation of failure.

If understanding is stronger than generation, then post-training can use the understanding branch to score generated outputs and select better examples. The authors test this idea by constructing post-training data from internally scored generations. They apply standard post-training methods such as supervised fine-tuning and direct preference optimization-style preference training. Their reported results show generation gains of up to about 20% on T2I-CompBench++ and reductions in non-unification of up to about 16%.

The magnitude should be read carefully. “Up to” is not a business case. It is a ceiling observed in specific experimental settings. The more reliable takeaway is the direction and mechanism: models and subtasks with larger internal gaps tend to benefit more because the gap exposes more training signal. In other words, the failure rate is not merely a liability. Under the right conditions, it becomes a source of selection pressure.

Paper result Interpretation Business meaning Boundary
Non-unification can approach 60% across evaluated unified MLLMs Generation and understanding are not automatically integrated just because they live in one model “Unified model” should not be treated as “unified capability” in procurement or deployment The metric measures internal prompt-image consistency, not broad human satisfaction
Most measured misalignment comes from weak generation The model often knows what went wrong after producing the wrong thing Internal scoring can help identify usable and unusable generated samples This assumes the understanding branch is actually stronger for the task
Self-improvement improves generation and reduces the internal gap Internal contradiction can become post-training signal Useful for synthetic data pipelines, brand asset generation, visual QA, and multimodal workflow testing Gains depend on model architecture, task type, and evaluation design
Understanding also improves after generation-targeted training The two branches may share learning dynamics Better generation can sharpen future diagnosis, not just final output quality Co-improvement is empirical and theoretical, not a guarantee for every model

This is where the article should resist the easy slogan. The paper does not show that a model can simply “reflect” and become wise. It shows that, when the understanding branch is strong enough, internally detected contradiction can be converted into higher-quality post-training data. That is not consciousness. That is data curation with fewer interns.

Self-improvement works by turning the stronger branch into a temporary critic

The method is conceptually simple.

First, the model generates outputs from prompts. Second, its understanding branch scores whether those outputs match the prompts. Third, the system uses those scores to construct post-training data. For supervised fine-tuning, better aligned samples can be selected as training targets. For preference learning, better and worse generations can form preference pairs.

This is useful because many enterprise multimodal workflows face a supervision bottleneck. Human review is accurate but slow. External reward models may be expensive, unavailable, or misaligned with the company’s specific content rules. Internal scoring offers a middle layer: not trusted enough to replace human review, but useful enough to reduce the volume of obvious failures that humans must inspect.

The paper also reports that the self-improvement process can outperform baselines that rely on multiple external reward models in some evaluated settings. That does not mean external reward models are obsolete. It means they are not always the only path to improvement. For businesses, that distinction matters because the cost structure changes. A workflow that can extract useful training signal from its own failed generations may reduce dependence on large-scale labeling or third-party evaluators.

But there is a trap. The internal critic can only help when it is right often enough. If the critic has agreement bias, rubber-stamps weak outputs, or shares the same blind spot as the generator, self-improvement becomes self-congratulation. We already have enough of that in corporate strategy decks.

Co-improvement is the expensive part of the argument

The paper’s most interesting claim is not simply that generation improves. That part is plausible: select better outputs, fine-tune on them, get better outputs. The more costly claim is co-improvement: training targeted at generation also improves understanding.

The authors argue that generation and understanding share representations and learning dynamics. In their theoretical analysis, a shared empirical neural tangent kernel helps align updates across both branches. Less formally, when the model learns from prompt-aligned generated samples, the same underlying representations that support generation also help the understanding branch become better at detecting mismatch. The model becomes not only a better producer, but a better inspector.

This is the part that deserves more attention because it separates one-off cleanup from compounding improvement. If generation improves but the critic remains static, each round of self-improvement eventually hits the critic’s ceiling. If the critic also improves, later rounds can recover samples that were previously unusable, detect subtler failures, and expand the useful training set.

The paper supports this with evidence that self-improved models become better at detecting false positives: cases previously judged as prompt-aligned but actually misaligned. It also reports understanding benchmark improvements, though the magnitude is much smaller than the generation-side gains, with gains up to about 3% in the cited understanding evaluations.

That difference in magnitude is important. The large commercial value is not “understanding magically improves a lot.” The more disciplined reading is: generation-targeted self-improvement may also sharpen the verifier enough to make the next training round cleaner.

For AI operations, that is a familiar pattern. A better error detector improves the training data. Better training data improves the system. A better system creates harder errors. Harder errors improve the detector, if the process is designed well. Quality control becomes iterative rather than episodic.

Curriculum replay turns rejected prompts into later training material

The paper extends the method with curriculum learning. The basic idea is that some prompts are not useful early in training because the model cannot generate good outputs or cannot judge them reliably. After some improvement, those previously discarded prompts can be revisited.

This is operationally elegant. Many AI teams already throw away difficult examples because they are messy. The paper suggests that “discarded” should sometimes mean “not yet.” Once generation and understanding improve together, the model can regenerate and rescore examples that were previously beyond its useful range. The authors report that curriculum learning expands post-training data, with some settings increasing sample size by up to about 50%.

The point is not merely more data. More bad data is an old hobby of machine learning. The point is staged data admission. The model first trains on examples it can use reliably, then returns to harder prompts when its own generator and critic have improved.

This has a direct business analogy. In process automation, mature teams do not automate the hardest exception cases on day one. They automate stable cases, collect failures, improve the classifier or workflow, and then absorb more complex cases. The paper’s curriculum version applies a similar logic to multimodal post-training.

The lesson is not “let the model learn forever.” The lesson is “do not treat early rejection as permanent waste.” In synthetic data systems, visual QA workflows, marketing asset generation, and product-content automation, rejected examples can become a structured backlog for later rounds of improvement.

The business value is cheaper diagnosis, not autonomous quality assurance

For companies, the paper is useful because it reframes self-improvement as an operations problem rather than a research fantasy.

The direct finding is that internal model disagreement can identify useful training signal. The business inference is that multimodal systems should log and exploit contradiction events: cases where generated content fails the model’s own understanding check. The uncertain part is whether this signal remains reliable under domain-specific constraints, especially when brand rules, legal requirements, or user safety are involved.

Layer What the paper directly shows Cognaptus inference for business use What remains uncertain
Diagnosis Internal non-unification can be measured Log contradiction cases as a quality-control asset, not just error noise Whether the metric maps cleanly to business-specific quality standards
Training data Internally scored generations can support post-training Use self-scored samples to reduce human labeling load How much human audit is needed to prevent drift
Workflow design Curriculum replay improves use of difficult prompts Maintain a “revisit later” pool for rejected multimodal tasks When replay begins to reinforce model blind spots
Evaluation Understanding can improve alongside generation Treat verifier quality as a tracked KPI, not a fixed utility Whether co-improvement holds outside tested architectures
ROI Improvement can occur without external reward signals Lower marginal cost for iterative multimodal improvement Full cost depends on compute, review policy, and failure tolerance

This is where the commercial story should stay sober. The paper does not eliminate human oversight. It gives teams a way to spend human oversight more intelligently.

A practical deployment pattern would look like this:

  1. Generate multimodal outputs from operational prompts.
  2. Run internal understanding checks against explicit prompt requirements.
  3. Separate outputs into accepted, rejected, and uncertain pools.
  4. Human-review only a sampled or risk-weighted subset, especially from uncertain and high-impact categories.
  5. Use accepted and rejected examples for fine-tuning or preference learning.
  6. Track whether the internal gap shrinks without external quality metrics deteriorating.

The last step is essential. A shrinking internal gap is not automatically good. A model can become more internally consistent and still consistently wrong. The nightmare is not contradiction; it is confident agreement around a bad standard.

The method applies best where requirements are visible and checkable

The paper’s setting is strongest when the desired output can be checked against prompt-visible attributes: color, shape, spatial relationship, object presence, texture, and compositional constraints. That is already commercially relevant. Product images, ad creatives, visual instructions, packaging mockups, architectural sketches, UI screenshots, and synthetic training images often contain requirements that can be partially verified.

The method is weaker when the success criterion is subjective or external. “Make it premium,” “fit our brand,” “avoid cultural awkwardness,” “look trustworthy,” and “appeal to mid-market procurement managers” are not impossible to evaluate, but they require standards outside the model’s internal prompt-image alignment. Internal contradiction can catch the blue mug. It cannot fully certify brand taste. Civilization remains difficult.

There is also a governance boundary. If self-improvement loops are run without external audits, the model may overfit to its own scoring habits. The broader MLLM self-improvement literature explicitly treats data collection, data organization, and optimization as separate design problems, not as one magical loop.5 That separation is useful for enterprise practice. A self-improvement pipeline needs data policy, review thresholds, logging, rollback, and independent evaluation—not just a training script with a motivational name.

The paper’s appendix-level details also matter for cost interpretation. The authors report experiments using four 80GB NVIDIA A800 GPUs and training runs of several hours for their self-improvement setup. That is not outrageous by frontier-model standards, but it is not free. For most companies, the immediate opportunity is not training foundation-scale multimodal systems from scratch. It is building smaller domain loops around existing models: collecting contradiction cases, building evaluator dashboards, and fine-tuning where the economics justify it.

Robustness checks are support beams, not a second thesis

The paper includes additional evaluations, ablations, DPO-based results, curriculum timing analysis, and component-update experiments. These should be read as robustness support for the main thesis, not as separate claims to be inflated into new narratives.

One practical example is the component ablation. The authors report that updating only the LLM component can be sufficient to improve both performance and unification in their setting, while updating other components such as the vision tower and projectors yields no significant additional gain. This is commercially interesting because it suggests cheaper adaptation paths may exist. But the boundary is obvious: “sufficient here” is not “always sufficient.” Model architecture and modality interface design still matter.

The curriculum timing ablation is similar. Introducing curriculum replay later, after the model has improved, works better than introducing it too early in their experiments. The operational lesson is intuitive: revisit harder cases when the system is ready to extract value from them. Do not feed a weak model hard examples and call the resulting mess “ambitious.”

These details reinforce the main contribution: the paper is not merely proposing self-training. It is proposing a disciplined way to identify when self-training data is useful, why the useful data emerges from internal contradiction, and how staged replay can expand the data pool.

What Cognaptus would watch before using this in production

The business opportunity is real, but it is not evenly distributed. Before applying this method to production workflows, three questions matter.

First, is the model’s understanding branch stronger than its generation branch for the relevant task? The paper’s mechanism depends on that asymmetry. If the internal evaluator is weak, biased, or too agreeable, self-improvement may amplify errors. Agreement bias in multimodal evaluation is already a documented problem, especially when models are asked to validate agent behavior rather than inspect concrete visual attributes.4

Second, are the requirements checkable? Internal contradiction is most useful when success can be decomposed into verifiable attributes. This makes the method attractive for structured creative production and synthetic data generation. It is less reliable for open-ended taste, strategy, persuasion, or compliance judgments.

Third, is there an external audit loop? Internal consistency should be tracked alongside independent evaluation. A company should measure not only whether non-unification falls, but whether downstream human acceptance, task success, safety review, or customer-facing quality improves. Otherwise, the system may become beautifully consistent inside its own little aquarium.

This is the difference between a research insight and a deployment discipline. The paper gives a promising mechanism. The enterprise version needs measurement infrastructure.

Self-contradiction is a signal, not a soul

The most valuable part of this paper is its refusal to treat model failure as a single category. A multimodal model that generates the wrong thing and then recognizes the mismatch is not simply broken. It is uneven. Uneven systems can sometimes be improved by letting the stronger part supervise the weaker part.

That is a useful idea for the next phase of multimodal AI. As models become more unified in architecture, the operational question will shift from “Can one model do many things?” to “Are its capabilities actually synchronized?” A model that can generate, inspect, explain, and revise is only as useful as the alignment among those roles.

For businesses, the message is practical. Do not buy the word “unified” too cheaply. Test whether generation and understanding agree. Log the disagreements. Use them to build training data. Add human review where the internal judge is likely to be weak. Track external quality, not just internal harmony.

Self-contradiction will not make AI wise. But it may make AI systems easier to debug, cheaper to improve, and less dependent on external supervision for every incremental correction.

That is a decent upgrade path. Not enlightenment. Just better plumbing. In enterprise AI, better plumbing is usually where the money is.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yujin Han, Hao Chen, Andi Han, Zhiheng Wang, Xinyu Liu, Yingya Zhang, Shiwei Zhang, and Difan Zou, “Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs,” arXiv:2507.16663, 2025, https://arxiv.org/abs/2507.16663↩︎

  2. Chenlong Wang, Yuhang Chen, Zhihan Hu, Dongping Chen, Wenhu Chen, Sarah Wiegreffe, and Tianyi Zhou, “Quantifying the Gap between Understanding and Generation within Unified Multimodal Models,” arXiv:2602.02140, 2026, https://arxiv.org/abs/2602.02140↩︎

  3. Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li, “LLaVA-Critic: Learning to Evaluate Multimodal Models,” arXiv:2410.02712, 2024, https://arxiv.org/abs/2410.02712↩︎

  4. Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav, and Zsolt Kira, “Let’s Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification,” arXiv:2507.11662, 2025, https://arxiv.org/abs/2507.11662↩︎ ↩︎

  5. Shijian Deng, Kai Wang, Tianyu Yang, Harsh Singh, and Yapeng Tian, “Self-Improvement in Multimodal Large Language Models: A Survey,” arXiv:2510.02665, 2025, https://arxiv.org/abs/2510.02665↩︎