Mirror, Mirror in the Model: How MLLMs Learn from Their Own Mistakes

TL;DR for operators

Image generators fail in a familiar way: the output looks polished, but the prompt was quietly ignored. A product photo misses the specified texture. A campaign image reverses a spatial relation. A science illustration draws the visually plausible version, not the physically correct one. Everyone then discovers, with appropriate corporate surprise, that “high quality” and “correct” are not synonyms.

The paper behind this article studies a sharper version of that problem in unified multimodal large language models: models designed to both generate images and understand images. Its core finding is simple and slightly awkward. A model can share architecture across generation and understanding, yet still fail its own mirror test. It can generate an image from a prompt, then have its own understanding branch judge that image as not matching the prompt.¹

The authors turn that embarrassment into a training signal. They introduce a non-unification score: the share of generated images that the model’s own understanding branch rejects as prompt-misaligned. Across six MLLMs and nine subtasks, the gap is widespread, reaching 58.47% in the reported evaluations. The follow-up diagnosis matters: the disagreement is usually not because the understanding branch is confused. External checks with Qwen2.5-VL-72B-Instruct and human validation suggest that most rejected images are genuinely weak generations.

The mechanism is the article. The model generates multiple candidate images, its own understanding branch scores them, and those internal judgments become post-training data. The chosen images feed supervised fine-tuning; chosen-versus-rejected pairs feed DPO. On Janus-Pro-7B and Show-o, this internal loop improves generation and reduces non-unification without external reward models. Reported gains reach up to 20% on generation and up to 16% on unification.

The interesting twist is co-improvement. Training is aimed at the generator, but understanding also improves. The paper argues that this happens because generation and understanding share learning dynamics through the common model backbone. In practical terms: when the generator learns not to produce a certain class of wrong image, the verifier can also become better at rejecting that wrong image. The model becomes less gullible about its own mistakes. Charming, in a low-bar-for-machine-self-awareness sort of way.

For operators, the business lesson is not “models can train themselves now, please delete the labelling budget.” The lesson is narrower and more useful: internal disagreement can be used as a cheap diagnostic and a candidate-data filter before buying external reward models, human annotation, or bespoke evaluation pipelines. The boundary is equally important. The experiments focus mainly on Janus-Pro-7B and Show-o, image-generation benchmarks, and internal/external checks. This is promising alignment plumbing, not a universal law of self-correction.

A unified model can still fail its own mirror test

The reader misconception to remove first is architectural. “Unified” does not mean “unified in behaviour.” A model may use shared components for text, images, generation, and understanding, yet still behave like two colleagues who attended the same meeting and left with different action items.

In this paper, the authors examine unified MLLMs that are meant to perform both image generation and image understanding. A truly unified model should pass a basic internal consistency test. Given a prompt, it should generate an image; when asked whether the image matches the prompt, its understanding branch should agree that it does.

The paper shows that this often does not happen. This is not merely a benchmark leaderboard issue. It is an operational issue. If a model’s own visual verifier rejects its own output, then downstream users have a problem that is more subtle than low image quality. They have a reliability gap between production and inspection.

The authors call this gap non-unification. Their metric is intentionally internal:

$$ \text{Non-unification} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}{U(p_i, G(p_i)) = 0} $$

Here, $G(p_i)$ is the image generated from prompt $p_i$, and $U(p_i, G(p_i))$ is the understanding branch’s binary judgment of whether the image matches the prompt. A score of 0 means the branch rejects the image as not fully aligned.

This is not the same as an external accuracy metric. It asks a narrower question: does the model agree with itself? That matters because an internal score can be collected cheaply at scale. It also matters because disagreement reveals where a unified system is only unified on the diagram. Product architecture slides love diagrams. Reality, unhelpfully, ships behaviours.

The gap is real, but task difficulty can distort it

The paper’s first main evidence block verifies the phenomenon. The authors evaluate six unified MLLMs across nine subtasks drawn from GenEval, T2I-CompBench++, and Science-T2I. They deliberately stratify the subtasks by difficulty.

That design choice is important. Easy prompts can hide the gap because both generation and understanding are easy. “A photo of a cat” is not exactly a PhD defence. Very hard prompts can exaggerate the gap because both branches are stressed by implicit reasoning. The authors therefore divide tasks into easy, medium, and hard categories.

The easy group includes object-level prompts such as single objects, two objects, and colour attributes. The medium group includes texture, spatial, and complex compositional prompts. The hard group includes physics, chemistry, and biology prompts where correct generation may require unstated world knowledge, such as what happens to ice at high temperature or a tree in winter.

The result: non-unification appears across models and tends to increase with task difficulty. In one reported case, VILA-U reaches a non-unification score of 58.47%. In plain language, nearly six out of ten generated images in that setting were rejected by the model’s own understanding branch.

That number is not a universal defect rate for all MLLMs. It is a measured peak in this evaluation setting. But it is large enough to change how we should think about unified multimodal systems. The risk is not only that the generator is weak. The risk is that the product team assumes the understanding branch and generation branch are already aligned because the model is marketed, trained, or architected as unified.

They are not automatically aligned. Apparently “unified” still requires actual unification. Who could have guessed.

Most disagreement comes from weak generation, not weak understanding

Internal disagreement has two possible explanations.

The first is weak generation. The model’s image is genuinely wrong, and the understanding branch correctly rejects it.

The second is weak understanding. The image is actually fine, but the understanding branch misjudges it.

These explanations imply different interventions. If understanding is bad, using it as a training signal is dangerous. If generation is bad and understanding is stronger, the model contains a useful internal critic.

The paper tests this by using Qwen2.5-VL-72B-Instruct as a stronger external judge. When the MLLM’s understanding branch rejects an image, the authors ask whether Qwen agrees. They define a weak-generation score as the probability that the internal rejection matches the stronger judge.

Across task difficulties, this weak-generation score is above 50% and reaches 100% in some cases. The appendix adds human checks that are consistent with the Qwen-based result. The conclusion is not that the understanding branch is perfect. The conclusion is more practical: in many rejected cases, the model’s own understanding branch is good enough to expose real generation failures.

That is the hinge of the paper. If internal disagreement were mostly noisy self-criticism, the method would collapse into self-reinforced confusion. Instead, the stronger branch can guide the weaker one.

This is also the business hinge. Many organisations already use external review layers for generated content: human QA, separate VLM judges, CLIP-like scoring, brand-safety filters, or domain-specific validators. This paper suggests that, for some unified MLLMs, an internal verifier can be used earlier in the pipeline to filter and shape training data. Not as a replacement for all external assurance. As a cheaper first pass. Less glamorous. More useful.

The self-improvement loop turns disagreement into training data

The proposed method is almost offensively simple, which is usually a good sign.

For each prompt, the MLLM generates multiple candidate images. The understanding branch then evaluates each candidate against the original prompt. Images judged more aligned become chosen samples. Images judged misaligned become rejected samples.

From there, the method plugs into standard post-training:

Training route	Data constructed from internal judgment	What it optimises
SFT	Prompt plus internally chosen aligned image	Make the generator imitate better internal samples
DPO	Chosen image versus rejected image for the same prompt	Make the generator prefer internally aligned outputs over misaligned ones

The paper applies this loop to Janus-Pro-7B and Show-o. The training is generation-focused, but the key is where the supervision comes from: not an external reward model, not a new human-labelled dataset, but the model’s own understanding branch.

The setup still has real cost. The reported experiments use four 80GB NVIDIA A800 GPUs, with self-improvement taking roughly seven to eight hours. That is not “free,” unless one uses the word the way cloud invoices do, with a straight face. But it is materially different from building or licensing a separate reward stack and labelling a large new dataset.

The method also updates only the shared LLM component in the main configuration. An ablation later tests wider component updates and finds that adding the image aligner, generation head, or vision tower does not produce significant extra gains and can slightly hurt understanding or unification. Operationally, that matters. The most valuable intervention may be narrower than expected: touch the shared reasoning substrate, not every visual module with a pulse.

The main result is not just better images; it is better agreement

The headline result is that internal gap-based self-improvement improves both generation and unification.

On T2I-CompBench++, Janus-Pro-7B under SFT improves its overall generation score from 35.21 to 43.29, while non-unification drops from 26.22 to 16.98. With curriculum SFT, generation rises further to 44.18, while non-unification remains lower at 16.92. Show-o starts with a much smaller internal gap, so the absolute non-unification reduction is smaller, but it still improves: overall generation rises from 49.66 to 52.67 under SFT and 52.82 with curriculum SFT, while non-unification drops from 0.95 to 0.11 and then 0.06.

The authors also report broader headline gains: up to 20% improvement in generation and up to 16% improvement in unification. They find that models and subtasks with larger gaps tend to benefit more. This makes intuitive sense. If the internal verifier has more correct failures to detect, it has more useful signal to convert into training data.

The result is not evenly magical across every benchmark and subtask. DPO is generally consistent but less dramatic in some reported rows. GenEval, being easier, shows smaller or mixed changes in non-unification because the baseline gap is already low. Science-T2I remains difficult, with modest absolute generation scores. The useful reading is not “self-improvement dominates everything.” The useful reading is that internal judging can concentrate training signal where the model’s own disagreement is most informative.

That distinction matters in deployment. If a model already has near-zero non-unification on a narrow use case, the loop may have little room to help. If the model frequently produces plausible but prompt-misaligned images, the loop becomes much more interesting.

The verifier improves because it learns what false confidence looks like

The paper’s most interesting result is not the generation gain. It is the co-improvement effect.

The method trains the generator. Yet the understanding branch also improves. The authors measure this with a win-rate metric: when the pre-trained and self-improved models disagree in judging a prompt-image pair, how often does the self-improved model match the stronger external judge? A neutral result would be around 50%. For Janus-Pro, the self-improved model exceeds 50% on five of six subtasks in the reported T2I-CompBench++ analysis.

The key qualitative change is false positive correction. Before self-improvement, the model sometimes judges a misaligned image as aligned. After self-improvement, it is more likely to reject that image correctly. On Janus-Pro with SFT, the authors report that roughly 80% of understanding improvement comes from this false-positive-correction category. Additional understanding benchmarks show smaller but supportive gains, including up to 3% improvement in the appendix results.

This is a subtle point. The model is not merely becoming more generous toward its improved generator. It becomes better at detecting bad outputs produced by the pre-trained model. That is more valuable for operations because production systems need rejection competence, not just prettier samples.

A verifier that says yes to everything is cheap, fast, and useless. A verifier that learns the shape of its own previous mistakes is infrastructure.

Shared learning dynamics explain why one branch can pull the other

The theoretical section uses learning dynamics and empirical neural tangent kernel analysis to explain co-improvement. Operators do not need to live inside the equations, which is fortunate because there are better hobbies. But the mechanism is worth translating.

The model has generation and understanding pathways that share parts of the underlying network. When self-improvement updates the model on selected generation data, those updates can also affect the understanding pathway. The paper argues that this happens through shared eNTK terms: if a validation case is similar to post-training samples, the update direction for generation and understanding can become aligned.

The authors test this explanation empirically. They look at false positive correction samples and ask whether those samples have similar counterparts in the post-training data. They use prompt similarity and image similarity as proxies. For Janus-Pro, false positive correction samples show higher similarity to post-training samples than random references. For Show-o, appendix evidence reports average cosine similarity around 0.8 in the comparable analysis. The authors also show that the relevant shared term dominates the understanding update and that the probability of the previously misaligned generation decreases.

The practical reading is this:

The generator produces a class of wrong images.
The understanding branch sometimes mistakenly accepts those wrong images.
Internal scoring finds better examples for similar prompts.
Training on those examples reduces the generator’s probability of producing the wrong class.
Because the same shared backbone participates in understanding, the verifier also becomes less likely to accept that wrong class.

That is the mechanism-first story. Internal disagreement becomes training data. Training data shifts generation. Shared representation updates sharpen rejection. Sharper rejection enables more useful future data selection.

This is not quite self-awareness. It is more like a machine learning version of “having receipts.”

Curriculum replay turns discarded prompts into delayed fuel

The curriculum extension is the paper’s second mechanism layer.

In the initial self-improvement loop, some prompts cannot be used well. The generator may fail to produce a good candidate; the understanding branch may be too uncertain or inaccurate; the resulting sample is discarded. But after several epochs of self-improvement, both generation and understanding may be stronger. The authors therefore revisit the discard pool later in training.

This is curriculum replay. At epoch 10 in the main setup, the model regenerates and rescores previously unused prompts. Some of those discarded prompts now yield usable training samples. The process dynamically expands the post-training data.

The reported expansion is meaningful. For Janus-Pro-7B with SFT, curriculum replay adds 1,091 samples to an original post-training set of 2,265. For Janus-Pro with DPO, it adds 359. For Show-o, where the usable initial data is much smaller, the additions are 64 for SFT and 59 for DPO. The authors also compare branches and find that jointly improving generation and understanding adds 1,091 samples from the discard pool, while single-branch enhancement adds roughly 600.

The ablation on curriculum timing is useful. Replay at both epoch 4 and epoch 10 helps, but epoch 10 performs better. The likely reason is straightforward: by epoch 10, the model has improved enough to use previously wasted prompts more reliably. Try too early, and the model is still rummaging through the discard bin with the same old hands.

This is the article’s strongest operational analogy. Do not treat failed prompts only as failure logs. Treat them as a delayed training reserve. The model may not be ready to learn from them now, but after the first pass of improvement, they may become usable.

Which evidence is doing which job

The paper contains several experiments, appendices, and ablations. They do not all carry the same argumentative weight. A disciplined reading separates main evidence from support.

Test or result	Likely purpose	What it supports	What it does not prove
Non-unification score across six MLLMs and nine subtasks	Main evidence	Internal generation-understanding gaps are widespread and vary with task difficulty	That every unified MLLM will show the same magnitude of gap in production
Weak-generation check using Qwen and human validation	Main evidence / diagnostic validation	Most rejected generations are genuinely weak, so the understanding branch can be useful as an internal critic	That the internal critic is always reliable or domain-safe
SFT and DPO self-improvement on Janus-Pro-7B and Show-o	Main method evidence	Internal judgments can construct useful post-training data	That external reward models or human labels are unnecessary
Win-rate analysis and false positive correction	Mechanism evidence	Generation-targeted tuning can also improve understanding, especially rejection of misaligned images	That all understanding capabilities improve equally
eNTK learning-dynamics analysis	Mechanistic explanation	Shared updates can align generation and understanding changes	A complete causal account of why all unified architectures share such dynamics
Curriculum replay	Exploratory extension with practical value	Improved models can reuse previously discarded hard samples and expand data	That curriculum replay timing is universally optimal
Component-update ablation	Implementation detail / ablation	Updating only the LLM component is sufficient in this setup	That vision modules should never be tuned
External Qwen-assisted SFT comparison	Boundary test	External judges can still outperform internal self-improvement, but internal signals are close	That self-improvement is better than external rewards

This table is also the difference between reading the paper and summarising it. The main claim is not “all experiments improved.” The main claim is that internal inconsistency can become an alignment resource when the verifier is stronger than the generator.

The business value is cheaper diagnosis before expensive alignment

For multimodal product teams, the immediate use case is not autonomous self-training in production. Please do not let the model silently rewrite its own generator from live customer traffic and call it “continuous improvement.” That is how one earns a very exciting incident review.

The practical workflow is more controlled:

Operational step	What the paper directly shows	Cognaptus inference for business use	Remaining uncertainty
Internal audit	A non-unification score can quantify when a model rejects its own generated output	Track prompt-alignment risk without immediately hiring a separate judge for every sample	Internal scores inherit the understanding branch’s blind spots
Candidate filtering	The understanding branch can select chosen and rejected images for SFT/DPO	Use internal scoring to reduce annotation volume and prioritise review	Works best when understanding is materially stronger than generation
Gap-targeted training	Larger-gap tasks benefit more because they contribute more useful data	Allocate post-training budget to prompt families where self-disagreement is high	Correlation may shift by domain, model, and benchmark
Co-improvement	Generator-focused tuning can improve false-positive rejection	Treat generation tuning as a possible verifier-improvement path, not only output-quality work	Gains may be narrow and task-specific
Curriculum replay	Previously discarded prompts can become usable after initial improvement	Maintain a structured discard pool instead of deleting hard cases	Replay timing and sample quality need validation per deployment

The strongest commercial relevance is in domains where image correctness matters more than visual gloss: advertising asset generation, product imagery, e-commerce catalogue enrichment, brand-compliant creative tooling, educational diagrams, technical illustrations, and internal design workflows.

In these settings, the cost centre is not only generation. It is verification. Every generated image creates a QA burden. If a model can identify a useful share of its own failures, teams get a cheaper triage layer. The internal verifier can route obvious failures out, identify high-gap prompt categories, and produce candidate data for later supervised improvement.

This does not remove human review. It changes where human review should be spent. Instead of inspecting every image equally, reviewers can focus on disagreement clusters, high-value prompts, borderline verifier cases, and domains where internal judgment is known to be weak.

The external judge still wins sometimes, and that is the point

One appendix result is especially important for sober interpretation. The authors construct post-training data using Qwen2.5-VL-72B-Instruct as an external judge and compare it with internal self-improvement. Qwen-assisted SFT performs slightly better than pure self-improvement for Janus-Pro-7B in generation and unification. The paper attributes this to Qwen’s stronger understanding capability.

This is not a flaw in the paper. It is the boundary condition.

Internal self-improvement works only as well as the internal critic can support. If the model’s understanding branch is weak, biased, overconfident, or systematically blind to a domain-specific constraint, then self-improvement may select the wrong samples. In regulated, brand-sensitive, medical, engineering, financial, or legal-adjacent visual workflows, internal self-scoring should be treated as a filter, not final authority.

The correct business takeaway is layered evaluation:

Use internal non-unification to find cheap signal.
Use external judges where internal judgment is weak or high stakes.
Use human review for domain-specific risk, brand fit, and final acceptance.
Feed verified disagreement clusters back into controlled post-training.

The paper makes internal scoring more attractive. It does not make external assurance obsolete. Anyone claiming otherwise is not reading the appendix; always a risky lifestyle.

Where the result should not be overextended

The study is careful enough to make its own limitations visible.

First, the self-improvement experiments concentrate on Janus-Pro-7B and Show-o. The initial non-unification verification covers more models, but the training loop is not validated across every major unified MLLM architecture. BAGEL, for example, is mentioned as future work.

Second, the benchmarks are image-generation and prompt-alignment benchmarks. They are useful, but they do not capture every enterprise constraint: brand rules, legal restrictions, factual grounding, localisation, accessibility, safety, or domain-specific visual standards.

Third, the mechanism explanation uses shared eNTK learning dynamics. The theory helps explain how generation and understanding can move together, and the empirical evidence supports it. But the authors explicitly leave open a deeper question: why this NTK sharing arises in unified MLLMs in the first place. For model researchers, that is not a footnote. It is the next research problem.

Fourth, internal self-improvement is not automatically safe. If the model’s understanding branch consistently rewards a flawed pattern, post-training can reinforce that pattern. The paper’s method is strongest where understanding is demonstrably stronger than generation. That condition should be tested, not assumed.

Finally, cost is lower than external reward-heavy pipelines, but not zero. The experiments involve multi-GPU post-training, candidate generation, scoring, and replay. The return on investment depends on failure volume, annotation cost, model size, and whether prompt-alignment defects are actually expensive in the target workflow.

The operator’s checklist: when this method is worth trying

A multimodal team should consider this approach when five conditions hold:

Condition	Practical test
The product depends on prompt-faithful image generation	Misalignment causes rework, rejection, customer complaints, or brand risk
The model can judge its own outputs better than it generates them	Internal rejection agrees with an external judge or human reviewers more often than chance
Failures cluster by prompt family	Texture, spatial relation, object binding, physical reasoning, or domain-specific constraints show repeated gaps
The team can run controlled post-training	SFT or DPO can be tested offline with evaluation gates
The discard pool is preserved	Failed prompts and weak candidates are stored with metadata for later replay

The method is less attractive when the model’s internal verifier is weak, the use case is already low-risk, the prompt space is narrow and solved, or the organisation cannot safely evaluate post-training changes before deployment.

In short: do not adopt the technique because it sounds self-improving. Adopt it because internal disagreement is measurable, externally validated, and concentrated in economically meaningful failure modes.

Conclusion: the mirror is useful because it disagrees

The paper’s contribution is not that multimodal models can generate better images after fine-tuning. We already had an entire industry trying that, with varying degrees of GPU smoke.

The contribution is sharper: a unified MLLM’s internal gap can be measured, diagnosed, and used. The understanding branch can act as a low-cost critic for the weaker generator. Its judgments can build SFT and DPO data. The resulting post-training can improve generation, reduce non-unification, and sometimes sharpen understanding itself by correcting false positives. Curriculum replay then turns previously discarded hard prompts into later-stage training fuel.

The larger business lesson is about alignment infrastructure. The next generation of multimodal systems will not be improved only by more data, bigger models, or prettier outputs. They will need internal audit loops that expose where a model disagrees with itself, external checks that calibrate those loops, and training workflows that convert disagreement into useful correction.

A model that can see its own mistakes is not automatically wise. But it is less useless than one that smiles confidently at every bad image it produces.

That, in enterprise AI terms, is progress. Not enlightenment. Progress.

Cognaptus: Automate the Present, Incubate the Future.

Yujin Han et al., “Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs,” arXiv:2507.16663, 2025. ↩︎

TL;DR for operators#

A unified model can still fail its own mirror test#

The gap is real, but task difficulty can distort it#

Most disagreement comes from weak generation, not weak understanding#

The self-improvement loop turns disagreement into training data#

The main result is not just better images; it is better agreement#

The verifier improves because it learns what false confidence looks like#

Shared learning dynamics explain why one branch can pull the other#

Curriculum replay turns discarded prompts into delayed fuel#

Which evidence is doing which job#

The business value is cheaper diagnosis before expensive alignment#

The external judge still wins sometimes, and that is the point#

Where the result should not be overextended#

The operator’s checklist: when this method is worth trying#

Conclusion: the mirror is useful because it disagrees#