When Models Disagree With Themselves: Turning Multimodal Conflict into Signal

Screenshots lie differently from HTML.

That sounds like a small engineering nuisance until the model is not merely answering a demo question, but reading a supplier invoice, comparing products on a procurement portal, interpreting a dashboard, or deciding which button an autonomous web agent should click next. The same underlying object may appear as a rendered page, raw DOM, OCR text, chart pixels, table JSON, or a caption. Humans usually treat these as different windows onto the same thing. Multimodal models often treat them as different worlds.

That is the uncomfortable starting point of R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning.¹ The paper studies a familiar failure in multimodal large language models: give the model the same content as an image and as text, and it may produce different answers. The easy fix is to ask several times and vote. The paper’s argument is that this fix is sometimes worse than lazy. It can be actively misleading.

The contribution is not simply “another RL method improves another benchmark.” We have enough of those to wallpaper a small office, if anyone still had offices with walls. The useful idea is more structural: when a model disagrees with itself across modalities, that disagreement can become training signal. Not by pretending the majority answer is correct, but by asking whether an answer can survive a round trip across representations.

In business terms, this is a shift from output popularity to representational stability. It matters because many enterprise AI systems are now multimodal by accident, not by strategy. A document pipeline may have PDF pages, OCR text, extracted tables, and screenshots. A web agent may see both DOM and rendered UI. A dashboard agent may see chart images and underlying data. When those views disagree, the system should not merely average the confusion and call it confidence.

The real failure is not disagreement; it is rewarding the wrong agreement

The paper begins with a failure mode that is easy to understand and surprisingly easy to overlook. Current self-improvement methods often use majority voting: generate several candidate answers, choose the most frequent one, and train the model to prefer that answer. In domains such as math or code, this can sometimes be paired with an external verifier. In open multimodal reasoning, correctness is harder to check, so consensus becomes tempting.

Tempting, not reliable.

Voting assumes that repeated answers are likely to be right. But if the model has a systematic bias, repeated answers merely repeat the bias. This is the “majority-is-wrong” problem. In multimodal settings, the problem gets an extra layer: the image view and text view may each have their own failure pattern. Pooling them into one vote does not resolve the underlying conflict. It just produces a winner from a badly designed election.

Failure mode	What happens	Why it matters operationally
Single-modal majority is wrong	The model repeats the same biased answer often enough to become its own pseudo-label	RL reinforces the mistake instead of correcting it
Cross-modal conflict	Text-based and image-based answers disagree on the same content	The system has no stable learning signal unless it models the disagreement
Spurious agreement	Two modalities agree on the wrong answer	Consistency looks comforting but does not prove truth
Dominant-modality collapse	One representation overwhelms the other in pooled voting	The weaker or less frequent view may contain the correct clue

The subtle point is that disagreement itself is not the enemy. Disagreement is information. The enemy is treating unresolved disagreement as if it were resolved by arithmetic.

This is why R-C2 is more interesting than a simple “better voting” method. It does not ask which answer appears most often. It asks whether a candidate answer remains semantically stable when the model reconstructs the question and verifies it through different modalities.

The mechanism starts from a candidate answer, not from a query. That inversion matters.

Suppose the model has a candidate answer: BHPC Blaze by Beverly Hills Polo Club. Instead of immediately asking whether most rollouts agree with that answer, R-C2 asks a backward question: what query, grounded in the current observation, would have produced this answer?

Then it switches representation. A query inferred from the text view can be answered using the image view; a query inferred from the image view can be answered using the text view. If the reconstructed answer matches the original candidate answer, that path earns a positive reward. If not, it does not.

The paper uses a binary cycle-consistency reward. In simplified form:

$$ r_{path} = \mathbf{1}[a_{reconstructed} \equiv a_{original}] $$

That looks almost too simple. The intelligence is not in the reward value; it is in the path required to earn it. The model has to connect answer, inferred query, and modality-switched evidence. It is being trained not merely to repeat an answer, but to preserve meaning under transformation.

A compact version of the mechanism looks like this:

Step	Action	Purpose
1	Start with a candidate answer	Avoid dependence on labeled query-answer pairs
2	Infer a likely query from the text view	Test whether the answer is grounded in textual evidence
3	Infer a likely query from the image view	Test whether the answer is grounded in visual evidence
4	Answer each inferred query using both same and alternate modalities	Check intra-modal stability and cross-modal alignment
5	Reward paths where the reconstructed answer matches the original	Convert consistency into label-free RL signal

The full cycle has four paths:

Cycle path	What it checks	Why it is useful
Text → Text	Whether text-based reasoning is internally stable	Catches unstable textual inference
Image → Image	Whether visual reasoning is internally stable	Catches unstable visual inference
Text → Image	Whether text-derived meaning survives visual verification	Forces semantic alignment across views
Image → Text	Whether image-derived meaning survives textual verification	Prevents the visual path from becoming an isolated shortcut

The paper’s mechanism-first contribution is here: accuracy is not treated as the direct reward. Cycle survival is. Accuracy is expected to improve because answers that cannot survive modality switching are less likely to be semantically grounded.

That expectation is not a theorem. It is an empirical bet. The rest of the paper tests whether the bet pays off.

The main results show accuracy gains, but the consistency numbers are the real clue

The experiments use Qwen2.5-VL-3B-Instruct and Qwen3-VL-8B-Instruct, comparing base models, text-only voting, image-plus-text voting, and R-C2 reinforcement learning. The evaluation covers a broad set of multimodal reasoning tasks, including ScienceQA, ChartQA, MathVista, Visual Web Arena, A-OKVQA, DocVQA, and InfoVQA.

The headline is straightforward: R-C2 improves accuracy over the base model and voting baselines in most reported settings. For the 3B model, average text accuracy rises from 65.5 to 70.3, and average vision accuracy rises from 76.7 to 79.5. For the 8B model, average text accuracy rises from 72.7 to 74.9, and average vision accuracy rises from 84.6 to 85.7.

Model	Metric	Base	R-C2	Absolute gain
Qwen2.5-VL-3B	Average text accuracy	65.5	70.3	+4.8
Qwen2.5-VL-3B	Average vision accuracy	76.7	79.5	+2.8
Qwen3-VL-8B	Average text accuracy	72.7	74.9	+2.2
Qwen3-VL-8B	Average vision accuracy	84.6	85.7	+1.1

The pattern is sensible. The smaller model has more room to improve; the larger model still benefits, but less dramatically. That is not disappointing. It suggests the method is not just patching a weak model, but also complementing scale when baseline performance is already stronger.

Some individual gains are larger. On Qwen2.5-VL-3B, ScienceQA improves by +7.8 points in text accuracy and +7.3 in vision accuracy. ChartQA gains +6.1 in text accuracy and +2.0 in vision accuracy. MathVista gains +6.0 and +2.8. Visual Web Arena gains +5.5 and +4.2. These are meaningful improvements, especially because the method is not using ordinary labeled query-answer supervision as its core reward.

Still, the more revealing result is not accuracy alone. It is the cross-modal consistency ratio: the proportion of cases where image-based and text-based predictions agree.

For Qwen2.5-VL-3B, average consistency rises from 67.0 to 72.6. For Qwen3-VL-8B, it rises from 72.9 to 76.8. On A-OKVQA, the smaller model shows a +12.5 consistency gain; on ScienceQA, +10.0; on ChartQA, +6.1.

Model	Average consistency, base	Average consistency, R-C2	Gain
Qwen2.5-VL-3B	67.0	72.6	+5.6
Qwen3-VL-8B	72.9	76.8	+3.9

This is where the paper’s argument becomes stronger. If R-C2 only improved accuracy while leaving cross-modal agreement unchanged, it might just be another fine-tuning trick. Instead, the method improves the exact property it is designed to enforce. That does not prove the model has achieved deep world understanding, but it does show the reward is steering behavior in the intended direction.

A business reader should read this as follows: the paper is not merely claiming “higher scores.” It is proposing a measurable reliability dimension for multimodal systems. If the same answer cannot survive text-image conversion, it probably deserves lower confidence, escalation, or retraining attention.

The ablations explain why mixed cycles beat single-path cleverness

The paper includes several experiments that should not be read as separate theses. They are mostly ablations and sensitivity checks: remove or vary parts of the mechanism and see whether the story still holds.

That distinction matters. A surprisingly common way to misread AI papers is to treat every table as a new grand claim. Sometimes a table is just the authors checking whether their own machine has wheels.

Evidence item	Likely purpose	What it supports	What it does not prove
Table 1: accuracy across tasks	Main evidence	R-C2 improves reported task accuracy over base and voting baselines	Universal improvement across all models and enterprise settings
Table 2: consistency ratio	Main evidence / mechanism validation	The method improves text-image agreement, not only raw accuracy	Agreement always means truth
Table 3: single vs cross vs mixed cycles	Ablation	Same-modality and cross-modality cycles are complementary	A single preferred path works everywhere
Figure 6: cycle path heatmaps	Ablation / sensitivity test	Broader path diversity gives stronger accuracy and consistency	The exact path weighting is solved for production use
Figure 7: inconsistency ratio	Controlled sensitivity test	Conflict-heavy training examples can be useful hard cases	Noisy or low-quality data is automatically valuable
Table 4: candidate answer source	Implementation / scalability check	Self-generated answers can bootstrap training, especially on web tasks	Ground-truth answers are never useful
Appendix VWA construction	Dataset implementation detail	The web task was made easier to evaluate via multiple choice	Full autonomous web navigation is solved

The cycle-path ablation is especially important. On ScienceQA and ChartQA, the paper compares three configurations: single-modality cycles, cross-modality cycles, and the mixed configuration that uses all four paths. Mixed performs best.

For ScienceQA, the mixed setup reaches 76.7 text accuracy, 83.3 vision accuracy, and 84.9 consistency. Single cycles are lower: 74.0, 81.7, and 81.2. Cross-only cycles improve consistency relative to single cycles but do not dominate everything: 75.8 text accuracy, 80.1 vision accuracy, and 83.1 consistency. ChartQA shows the same broad pattern, with mixed cycles again highest across text accuracy, vision accuracy, and consistency.

The interpretation is clean. Same-modality cycles help the model avoid unstable reasoning within a representation. Cross-modality cycles force alignment between representations. Use only one family, and the model can still find shortcuts. Use both, and the reward becomes harder to game.

This is the most useful engineering lesson in the paper: reliability is not created by one clever check. It comes from overlapping constraints that make different failure modes visible.

Conflict-heavy data helps because it is hard, not because mess is magic

Figure 7 tests a counterintuitive question: should training examples with cross-modal inconsistency be avoided as dirty data, or used as hard examples?

The authors fine-tune Qwen2.5-VL-3B-Instruct on ScienceQA subsets with the same size but different proportions of cross-modal inconsistent samples, from 0% to 50%. As the inconsistency ratio increases, both answer accuracy and cross-modal consistency generally improve.

This is easy to overstate, so let’s be precise. The result does not mean that any messy dataset is good. It means that, under this reward design, samples where modalities disagree can provide useful pressure for alignment. The model is not being rewarded for absorbing noise; it is being rewarded for resolving a specific kind of structured conflict.

That distinction matters for businesses. A corpus of broken OCR, mislabeled product photos, and stale database exports is not automatically a gold mine. It becomes useful only if the contradictions can be organized into paired representations and tested through a consistency mechanism. Otherwise, “learning from inconsistency” becomes the kind of phrase that sounds strategic in a slide deck and operationally means “we forgot data cleaning.”

Used properly, however, the idea is valuable. Many companies already own the kind of material R-C2 wants: documents with images and OCR, web pages with screenshots and HTML, dashboards with visual plots and source tables, emails with attachments and extracted metadata. These are not just data assets. They are disagreement surfaces.

Self-generated answers make the method economically interesting

Table 4 studies where the initial candidate answer comes from. R-C2 can start from the model’s own generated answer or from a training-set answer. This matters because if the method requires reference answers everywhere, the “label-free” advantage becomes much less exciting.

The results are mixed in a useful way. On A-OKVQA, training-set answers perform better than self-generated answers: vision accuracy reaches 88.8 versus 87.3, text accuracy 75.2 versus 71.6, and consistency 80.3 versus 77.6. So no, labels have not suddenly become worthless. Anyone claiming otherwise is probably selling a tool that invoices monthly.

On Visual Web Arena, however, self-generated answers are competitive: self-generated answers reach 74.9 vision accuracy, 65.3 text accuracy, and 68.2 consistency, while training-set answers reach 74.5, 67.1, and 67.1. The direction differs by metric, but the larger point stands: the self-generated setup can provide useful supervision without ordinary human-written query-answer labels.

For enterprise deployment, this is the cost-structure point. The most expensive part of many AI reliability programs is not running another model. It is producing enough high-quality labels to cover the real distribution of documents, screens, charts, and workflows. R-C2 suggests a way to mine internal structure before reaching for manual annotation.

That does not remove the need for human evaluation. It changes where human effort may be most useful: not labeling everything, but auditing unresolved cycles, validating equivalence rules, and inspecting high-impact disagreement clusters.

What Cognaptus would infer for business systems

The paper directly shows that R-C2 improves reported accuracy and cross-modal consistency on the authors’ benchmark suite and model choices. Cognaptus would infer a broader design principle: when an AI system has multiple representations of the same business object, agreement across transformations should become a first-class reliability signal.

That inference is practical, but it is still an inference. The paper is not a turnkey recipe for every enterprise system. It is a useful pattern.

Business setting	Paired representations	Cycle-consistency use	Practical payoff
Document AI	PDF image, OCR text, extracted tables, captions	Verify that answers survive movement between page image and text extraction	Catch OCR/layout errors before they become decisions
Web agents	Screenshot, DOM, accessibility tree, raw HTML	Check whether a proposed action or answer is stable across rendered and structural views	Reduce brittle UI navigation mistakes
Dashboard QA	Chart image, data table, JSON export	Ask whether chart-derived answers match data-derived answers	Detect chart-reading hallucinations and stale data mismatches
Compliance review	Form scan, extracted fields, policy text	Test whether a compliance conclusion can be reconstructed from multiple evidence views	Improve auditability and escalation routing
Procurement and e-commerce	Product page screenshot, structured listing data, reviews, price fields	Validate product comparisons across visual and structured sources	Avoid confident selection errors from partial page parsing

The operational version does not always require full RL training on day one. A lighter version can be used as a diagnostic layer:

Build paired views of the same object.
Ask the model for an answer or action.
Generate one or more backward queries that should recover that answer.
Verify those queries across alternate representations.
Treat failed cycles as low-confidence cases, audit candidates, or future training examples.

This is less glamorous than “autonomous agent learns from itself.” It is also more deployable. In production systems, the first economic win often comes from triage: knowing which outputs deserve trust, which need another pass, and which should be routed to a human before they quietly damage a workflow.

Where the method stops being magic

R-C2 is clever, but it is not a certificate of truth.

First, consistency is not correctness. A model can be consistently wrong if both modalities omit the same key fact, if the generated textual description carries the same bias as the image interpretation, or if the answer-equivalence check is too forgiving. The paper’s method reduces a specific failure mode: unresolved cross-modal instability. It does not solve epistemology. Nobody has, despite many LinkedIn posts trying.

Second, the method depends on meaningful paired representations. Some datasets naturally provide text and image views, such as web pages and HTML. For image-only datasets, the authors generate textual views using the model. That makes the pipeline scalable, but also introduces another source of bias. If the generated description loses information, the cycle may reward stability around an incomplete representation.

Third, the reward is binary. A reconstructed answer either matches the original or it does not. That is clean for RL, but real enterprise answers often require partial credit, semantic equivalence, units, dates, entity normalization, and tolerance thresholds. “$278 million” and “278 m” are easy enough. Contract clauses, medical fields, and regulatory findings are less polite.

Fourth, the paper uses an offline data-generation strategy for efficiency. Offline cycles are easier to batch and train, but they do not co-evolve with the model during training. The authors acknowledge online generation as possible, but choose offline generation for efficiency. That is reasonable research engineering. It also means production teams should think carefully about refresh cadence when business data shifts.

Finally, the experiments are conducted on Qwen-VL models and benchmark-style tasks. The appendix also shows that the Visual Web Arena setup is repurposed into a multiple-choice QA task using automatically generated questions followed by human verification. That is useful for controlled evaluation, but it is not the same as unconstrained web-agent execution in a messy browser session with popups, permissions, slow loading, and a user asking “can you just handle it?” — the most dangerous phrase in automation.

The better metric is not confidence; it is survivability

The lasting idea in R-C2 is not that every company should immediately run GRPO over all multimodal logs. Some will not have the data. Some will not have the training infrastructure. Some should first fix their extraction pipeline, a sentence that is boring because it is true.

The stronger takeaway is that multimodal reliability should be tested under transformation. If an answer is derived from a screenshot, can it be reconstructed from the DOM? If it comes from OCR text, does the page image support it? If it comes from a chart, does the data table agree? If the answer fails that round trip, the system has learned something important even before it knows the final truth.

This changes how teams should think about AI evaluation. Instead of asking only “was the final answer correct?”, they should also ask:

Evaluation question	What it reveals
Does the answer survive a modality switch?	Cross-representation robustness
Does the inferred query remain grounded in the observation?	Whether the model is inventing a convenient question
Do same-modality cycles and cross-modality cycles fail differently?	Whether errors come from internal instability or modality gap
Are failures clustered by document type, UI component, or chart family?	Where workflow redesign or targeted labeling will pay off
Does consistency rise while accuracy stagnates?	Alignment may be improving, but truth anchoring is still weak

That last row is important. Consistency is a leading diagnostic, not the final business outcome. A model that is internally coherent but wrong is not reliable. It is just easier to debug.

And that may be the most realistic value of the paper. R-C2 does not promise that disagreement disappears. It shows how disagreement can be made productive. For businesses building AI systems over documents, screens, charts, and messy operational records, that is a healthier promise than “trust the model.” Trust should be earned through checks that survive representation changes.

The model should not win because its answer is popular. It should win because the answer can make the round trip.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Zirui Zhang, Haoyu Dong, Kexin Pei, and Chengzhi Mao, “R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning,” arXiv:2603.25720, 2026. https://arxiv.org/abs/2603.25720 ↩︎

The real failure is not disagreement; it is rewarding the wrong agreement#

R-C2 turns an answer into a cross-modal round trip#

The main results show accuracy gains, but the consistency numbers are the real clue#

The ablations explain why mixed cycles beat single-path cleverness#

Conflict-heavy data helps because it is hard, not because mess is magic#

Self-generated answers make the method economically interesting#

What Cognaptus would infer for business systems#

Where the method stops being magic#

The better metric is not confidence; it is survivability#