When Agents Start Thinking Twice: Teaching Multimodal AI to Doubt Itself

A model that fails its own eye test

Mirror.

That is where the problem becomes easy to see. Ask a multimodal model to generate an image of a plush lion toy in front of a mirror. The model may produce something plausible at first glance: lion, mirror, warm lighting, adorable synthetic confidence. Then ask the same model, through its understanding branch, whether the image makes physical sense. Suddenly it notices the issue: if the toy faces the camera, the mirror should mostly show its back, not another front-facing lion.

This is not a small visual glitch. It is a governance problem wearing a cute costume.

The paper behind this article, Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs, studies a failure pattern in unified multimodal large language models: the generation side can produce images that the understanding side later judges as misaligned with the original prompt.¹ In plain terms, the model can create something it cannot honestly defend.

That is the useful part.

The tempting misconception is that self-contradiction is merely a bug to be hidden by better prompting, stronger guardrails, or a stern system message written in the voice of a compliance officer. The paper points in a more interesting direction: contradiction can be measured, sorted, and recycled into training data. The model’s stronger faculty—visual understanding—can supervise its weaker faculty—image generation—without immediately requiring an external reward model or a fresh human-labeling campaign.

This is not “AI becomes self-aware.” Calm down. It is closer to a factory inspection loop: one internal component produces; another internal component checks; the disagreement becomes a signal for process improvement. Less science fiction, more quality control. Usually how serious things begin.

Generation and understanding are not automatically unified

Unified multimodal models promise a tidy story. One model handles visual understanding and visual generation. It can answer questions about images, generate images from prompts, and move between text and pixels as if these were merely different dialects of the same intelligence.

The story is attractive. The architecture diagrams are usually attractive too. Reality, being less cooperative, adds a gap.

In many current MLLMs, understanding is stronger than generation. The model may be able to inspect an image and judge whether it matches a prompt, but still fail to generate such an image reliably. This asymmetry is already visible in systems such as Show-o and Janus-Pro, which are designed to unify multimodal understanding and generation under a shared or coordinated model framework.²³ The ambition is real. The imbalance is also real.

The paper formalizes this imbalance using a metric called the non-unification score. It asks a simple question:

When the model generates an image from a prompt, how often does its own understanding branch reject that image as not matching the prompt?

That score matters because it avoids a common measurement trap. If an external judge says the generated image is wrong, we learn something about output quality. If the model’s own understanding branch says the generated image is wrong, we learn something about internal inconsistency.

The difference is subtle but operationally important. External evaluation tells us whether the model disappoints outsiders. Internal non-unification tells us whether the model disappoints itself. For an AI agent expected to plan, act, inspect, and revise, the second failure is especially expensive.

The evidence says the gap is widespread, not cosmetic

The paper evaluates six unified MLLMs across tasks with different difficulty levels. The authors do not treat “generate a cat” and “generate a physically plausible mirror reflection” as equivalent tests, which is sensible. Easy prompts can hide the gap because both generation and understanding succeed. Very hard prompts can exaggerate the gap because both branches may struggle. So the paper stratifies tasks by difficulty.

The result: non-unification is pervasive. In one reported case, VILA-U reaches a non-unification score of 58.47%, meaning that nearly six out of ten generations are rejected by the model’s own understanding branch.

That number should not be read as a universal failure rate for all multimodal systems. It is benchmark- and model-specific. But it is large enough to make the main point difficult to ignore: the gap is not a rounding error.

The next question is more important. When the model’s understanding branch rejects its own generation, is the understanding branch being too harsh, or is the generation branch actually weak?

The paper checks this using a stronger external judge, Qwen2.5-VL-72B-Instruct, and reports that most misalignments are attributable to weak generation rather than misunderstanding. Across the analyzed settings, weak generation accounts for a large share of the rejected cases, with the paper reporting ranges that reach 60–100% in the broader analysis.

That interpretation is the hinge of the whole article.

If understanding were unreliable, self-critique would merely amplify confusion. One confused part of the model would lecture another confused part. A familiar workplace pattern, but not a training strategy. If understanding is stronger, however, then internal critique becomes useful supervision.

Self-contradiction becomes data when it is structured

The mechanism is straightforward enough to be dangerous, which is where many good machine learning ideas live.

For a given prompt, the model generates multiple candidate images. Its understanding branch then scores whether each image matches the prompt. Images judged aligned become positive examples for supervised fine-tuning. Aligned and misaligned pairs can also become preference data for Direct Preference Optimization, a training method that optimizes preferences without requiring the full reinforcement-learning machinery of classic RLHF.⁴

The loop can be summarized as follows:

Stage	What happens	Why it matters
Generate	The model produces candidate images from a text prompt	Creates multiple attempts, not a single brittle output
Inspect	The model’s understanding branch checks prompt-image alignment	Uses the stronger branch as an internal evaluator
Select	Aligned images become training examples; misaligned images become rejected samples	Converts contradiction into structured post-training data
Improve	SFT or DPO updates the model	Strengthens generation while also reducing internal inconsistency
Revisit	Hard prompts can be retried later through curriculum learning	Expands useful training data as both branches improve

This is not just “ask the model to reflect.” The distinction matters.

Many self-correction systems operate at inference time: generate an answer, ask the model to reconsider, hope the second answer is better, bill the user for both. The evidence on vision-language self-correction is mixed; prior work has found that VLMs often struggle to self-correct reliably without fine-tuning or external feedback, even though self-generated correction data can become useful once organized into preference training.⁵

This paper’s contribution is closer to post-training infrastructure. It does not merely ask the model to think twice at runtime. It uses the model’s internal disagreement to build training data that changes future behavior.

That difference is the business-relevant part. Runtime reflection is a cost. Structured self-improvement can become an asset.

The main result is generation improvement, with unification as the audit trail

The paper validates the method mainly on Janus-Pro-7B and Show-o. It uses standard post-training approaches such as SFT and DPO, with T2I-CompBench++ as a key benchmark. T2I-CompBench is designed around compositional text-to-image generation, with prompts covering attribute binding, object relationships, and complex compositions—the kinds of prompts where “looks vaguely correct” is not enough.⁶

The results are not framed as magic. Good. Magic is usually a lack of instrumentation.

The paper reports generation gains of up to about 20 percentage points on T2I-CompBench++ and reductions in the internal gap by as much as 16 points. It also reports that improvements in generation correlate with improvements in unification, with a correlation coefficient of $\rho = 0.53$ between generation gains and the internal gap.

That correlation is not a proof that every model with a gap will improve automatically. It does suggest a useful diagnostic principle: the larger the measurable internal gap, the more room there may be for internal self-improvement.

In business terms, this changes how teams should read failure logs. A failed generation is not only a bad output. If the model can detect the failure internally, it is also a candidate training signal.

The paper also reports a co-improvement effect: generation-targeted self-improvement can improve understanding. For Janus-Pro, the self-improved model achieves win rates above 50% on five of six subtasks when judging prompt-image alignment against the pretrained version. On standard understanding benchmarks such as MMB, SEED, GQA, and MMMU, the self-improved Janus-Pro variants show modest but consistent improvements, with gains reaching roughly two to three points depending on the benchmark and training variant.

This is the part that deserves careful interpretation.

The result does not mean “train generation and understanding magically improves everywhere.” It means that in unified architectures, where the two capabilities share representations or interact through related parameters, improving one branch can reshape the other. The paper explains this through shared empirical neural tangent kernel dynamics: generation and understanding do not evolve as completely isolated modules. When their learning dynamics align, correcting generation can also sharpen understanding.

That is a mechanism claim, not just an empirical decoration. It gives the result a plausible technical spine.

The curriculum result is about recovering discarded value

One of the paper’s cleaner ideas appears after the main improvement loop.

At the start of self-improvement, some prompts are unusable. The model cannot generate any candidate image that its understanding branch accepts. These prompts enter a discard pool. A naive pipeline would leave them there, like unread compliance PDFs in a shared drive.

But if generation and understanding improve together, some previously unusable prompts may become useful later. The model can regenerate candidates, rescore them, and add newly acceptable samples into the training set. This becomes a curriculum: start with prompts the model can handle, improve, then revisit harder prompts.

The paper reports that co-improvement adds 1,091 samples from the discard pool when both generation and understanding are self-improved. Improving only one branch adds roughly 600 samples: 603 when generation is self-improved against the original understanding branch, and 649 when the original generation branch is judged by self-improved understanding.

That is not just a larger number. It tells us what curriculum learning is doing here. It is not a decorative training schedule. It is a recovery system for data that was initially too hard to use.

For operational AI teams, this is a useful pattern. Many automation projects discard difficult cases early: unclear invoices, messy product photos, ambiguous floor plans, low-quality medical scans, inconsistent design briefs. Some of those cases are genuinely unusable. Others are only unusable at the current capability level. A curriculum loop gives the system a reason to revisit them later instead of treating the first failure as final judgment.

What the paper shows, what Cognaptus infers, and what remains open

The practical reading should stay disciplined. The paper directly shows a measurable internal generation-understanding gap in tested unified MLLMs, proposes a self-improvement method based on internal scoring, and demonstrates improvements on selected generation, unification, and understanding benchmarks.

Cognaptus infers a broader product lesson: internal disagreement can become part of model operations. Not every failure needs to be shipped to a human labeler first. Some failures can be triaged internally, ranked by confidence, and turned into post-training candidates. That is not a replacement for external evaluation. It is a way to reduce the amount of external evaluation wasted on cases the system can already diagnose.

The remaining uncertainties are also clear.

Question	What the paper supports	What it does not yet prove
Can internal contradiction measure multimodal model weakness?	Yes, for the tested unified MLLMs and benchmarks	Not a universal metric for every architecture or modality
Can understanding guide generation without external signals?	Yes, through SFT/DPO data constructed from internal scoring	Not a guarantee that internal scoring is reliable in high-stakes settings
Does generation improvement also improve understanding?	Evidence supports co-improvement in the tested models	The deeper reason for shared eNTK behavior remains open
Does curriculum replay matter?	It recovers more previously discarded samples than single-branch improvement	It may depend on model architecture, prompt distribution, and scoring quality
Is this ready for enterprise autonomy?	It offers a useful training and diagnostic pattern	It is not a complete safety, audit, or compliance framework

This distinction is worth preserving because “self-improving AI” is an easy phrase to overinflate. The paper is not saying models can supervise themselves indefinitely. It is saying that when one internal capability is measurably stronger than another, the stronger capability can help produce training signals for the weaker one.

That is narrower. Also more useful.

The business value is cheaper diagnosis, not mystical autonomy

For businesses, the relevant use case is not a chatbot having philosophical doubts about its pixel soul. The relevant use case is an AI system that can inspect its own outputs before those outputs enter a workflow.

Consider four categories.

First, creative production systems. A marketing team generating product images needs prompt compliance: correct colors, correct objects, correct spatial relationships, correct brand constraints. A model that can internally flag misaligned generations can reduce wasted human review time.

Second, document and image automation. Many workflows involve visual-text alignment: invoices, shipping documents, screenshots, insurance photos, ID verification, warehouse images. If a model extracts or generates a visual interpretation, its own understanding branch can help detect mismatches before downstream automation acts on them.

Third, agentic interfaces. Multimodal agents increasingly operate across screens, websites, dashboards, and generated artifacts. A model that can act but cannot verify what it has done is not autonomous. It is merely enthusiastic. Internal unification metrics can become part of agent evaluation: did the agent’s perception agree with its generated plan or artifact?

Fourth, model operations and retraining. Internal contradiction can be logged as a training-data source. Instead of collecting random failures, teams can collect structured disagreements: prompt, output, internal score, external audit result when available, and eventual correction. That creates a learning flywheel with less noise.

The ROI pathway is therefore not “replace human reviewers.” That line is overused and usually under-specified. The better pathway is:

reduce the volume of obvious bad outputs reaching human review;
prioritize ambiguous cases where internal and external judgments disagree;
convert repeated internal contradictions into post-training data;
track whether non-unification falls after retraining;
use the metric as an operational health signal for multimodal agents.

That is less dramatic than “autonomous AI.” It is also how systems become dependable enough to matter.

The boundary: internal judges are still judges, not truth machines

The paper’s limits matter because they affect deployment.

The self-improvement experiments focus mainly on Janus-Pro and Show-o. That is enough to make the mechanism credible, but not enough to generalize casually across all unified MLLMs, let alone proprietary multimodal agent stacks with tool use, memory, retrieval, and UI control layered on top.

The external-checking setup also has boundaries. The paper uses Qwen2.5-VL-72B-Instruct as a stronger judge for some evaluations and reports that human evaluation is broadly aligned with Qwen-based evaluation. But the discrepancy becomes larger on harder tasks. That matters because hard tasks are exactly where businesses care most about robust judgment. Easy cases are cheap. Edge cases send invoices to legal.

Finally, the theory explains co-improvement through shared learning dynamics, but the paper itself notes that why such NTK sharing arises in unified MLLMs remains an open question. For practitioners, that means the co-improvement effect should be treated as an empirical property to test per model, not an entitlement.

In deployment terms: internal self-critique should be logged, calibrated, and periodically checked against external evaluation. The internal judge is useful. It is not a court of final appeal.

Teaching models to doubt is really teaching teams to measure disagreement

The best part of this paper is not that it makes multimodal models more “thoughtful.” Models do not need a personality arc. They need measurable failure modes.

The paper turns a vague complaint—generation and understanding do not align—into a practical loop: measure internal contradiction, identify whether the generation branch is usually at fault, use understanding to select training data, improve generation, check whether unification improves, and revisit harder samples through curriculum learning.

That is a clean idea. Not easy, but clean.

For AI agents, especially multimodal agents that must perceive, generate, and act, confidence is cheap. Internal consistency is more expensive. The next stage of useful automation will not come from models that merely produce more fluent outputs. It will come from systems that can inspect their own work, expose their own disagreement, and turn that disagreement into improvement.

In other words, the model does not become trustworthy because it never contradicts itself. It becomes more useful when contradiction is no longer swept under the interface.

Doubt, properly instrumented, is not weakness. It is maintenance.

Cognaptus: Automate the Present, Incubate the Future.

Yujin Han, Hao Chen, Andi Han, Zhiheng Wang, Xinyu Liu, Yingya Zhang, Shiwei Zhang, and Difan Zou, “Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs,” arXiv:2507.16663, 2025. https://arxiv.org/abs/2507.16663 ↩︎
Jinheng Xie et al., “Show-o: One Single Transformer to Unify Multimodal Understanding and Generation,” arXiv:2408.12528, 2024. https://arxiv.org/abs/2408.12528 ↩︎
Xiaokang Chen et al., “Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling,” arXiv:2501.17811, 2025. https://arxiv.org/abs/2501.17811 ↩︎
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn, “Direct Preference Optimization: Your Language Model is Secretly a Reward Model,” arXiv:2305.18290, 2023. https://arxiv.org/abs/2305.18290 ↩︎
Jiayi He, Hehai Lin, Qingyun Wang, Yi Fung, and Heng Ji, “Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks,” arXiv:2410.04055, 2024. https://arxiv.org/abs/2410.04055 ↩︎
Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu, “T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation,” arXiv:2307.06350, 2023. https://arxiv.org/abs/2307.06350 ↩︎

A model that fails its own eye test#

Generation and understanding are not automatically unified#

The evidence says the gap is widespread, not cosmetic#

Self-contradiction becomes data when it is structured#

The main result is generation improvement, with unification as the audit trail#

The curriculum result is about recovering discarded value#

What the paper shows, what Cognaptus infers, and what remains open#

The business value is cheaper diagnosis, not mystical autonomy#

The boundary: internal judges are still judges, not truth machines#

Teaching models to doubt is really teaching teams to measure disagreement#