Charts. Tables. Diagrams. Scanned forms. Product screenshots. Floor plans. Receipts with half-faded numbers and three suspiciously similar line items.

This is where enterprise multimodal AI is supposed to become useful. Not in the demo where the model politely describes a golden retriever on a lawn, but in the operationally annoying question: which number, label, relation, or region in this visual object actually matters for the task?

Miao et al.’s paper, Seeing with You: Perception–Reasoning Coevolution for Multimodal Reasoning, is about that annoying question.1 More specifically, it challenges a convenient assumption behind multimodal reinforcement learning with verifiable rewards: if we reward the final answer, the model will automatically learn both to reason better and to see better.

That assumption is tidy. It is also, according to this paper, too tidy.

The authors introduce PRCO, short for Perception–Reasoning Coevolution, a training framework that splits one shared multimodal policy into two prompted roles. The Observer extracts question-relevant visual evidence into a caption. The Solver answers the question using that caption, with image access restored after an initial warmup. The Solver is rewarded for verifiable answer correctness. The Observer is rewarded for whether its caption helps the Solver succeed, while a leakage checker discourages the Observer from simply hiding the final answer in the caption. In other words, the model is not merely asked to think harder. It is asked to produce better evidence for its own thinking.

That distinction matters because a wrong answer in visual reasoning often has two possible parents. The model may have reasoned badly from the right evidence. Or it may have reasoned competently from a badly read image. A final-answer reward sees only the child and then tries to discipline the entire household. Naturally, the wrong family member sometimes gets blamed.

Final-answer rewards can teach reasoning while leaving perception undertrained

The paper begins with a diagnostic observation rather than a grand architecture reveal, which is the right order. Before proposing a new training scheme, the authors ask what standard multimodal RLVR is failing to improve.

Using GRPO as a representative RLVR baseline on WeMath, they find that outcome-only reinforcement learning reduces reasoning errors far more than perception errors. In their diagnostic figure, GRPO reduces reasoning errors by 24.3% but perception errors by only 7.6%. The model learns to arrange its reasoning more effectively, yet still misreads the visual evidence often enough to fail.

This is the paper’s central mechanism claim: a final-answer reward creates blurred credit assignment between seeing and reasoning. The verifier can say whether the answer matches the gold answer. It cannot directly say whether the image was read correctly, whether the right table column was selected, whether the geometry relation was localized, or whether a chart axis was transcribed faithfully.

That creates a subtle training failure. If the model answers incorrectly, the policy update does not know whether to punish the visual extraction tokens, the reasoning tokens, the formatting tokens, or the whole trajectory indiscriminately. If the model answers correctly, it may also reinforce a lucky or shortcut-driven process. The reward is verifiable, yes. The causal diagnosis is not.

For business readers, this is the useful shift: do not treat “vision-language reasoning” as one undifferentiated capability. In production, it is usually a pipeline disguised as a model call.

Stage What can go wrong Why final-answer reward is insufficient
Evidence extraction The model reads the wrong row, column, point, label, object, or region The final answer only says success/failure, not which visual evidence was missing
Reasoning The model has the right evidence but applies the wrong comparison, count, or formula The same final failure signal looks similar to a perception failure
Formatting The model solves correctly but violates the expected output format A verifier may penalize presentation rather than capability
Shortcutting The model leaks or memorizes the answer instead of preserving evidence The answer can be right while the intermediate evidence is operationally useless

The paper’s contribution is not simply “add captions.” Captions are cheap. Bad captions are cheaper. The contribution is to make the caption useful to the downstream task while preventing it from becoming an answer-smuggling device. Very elegant. Slightly suspicious. Therefore, worth checking with ablations.

PRCO turns one model into two jobs: Observer and Solver

PRCO uses one shared multimodal policy but assigns it two roles through prompting.

The Observer receives the image and question. Its job is not to answer. Its job is to produce a question-conditioned evidence caption: the objects, values, labels, spatial relations, and visual facts likely needed for the answer. The Solver then receives the question and the Observer caption, and later in training can also consult the image.

The training loop is roughly:

Image + Question
Observer: write task-relevant evidence caption
Solver: answer from caption, with image access when enabled
Verifier: check final answer
Reward Solver directly; reward Observer by downstream Solver success, unless caption leaks the answer

The clever part is the Observer reward. The authors do not try to grade caption “quality” in the abstract. That would invite a familiar enterprise nightmare: beautifully written captions that are irrelevant to the business question. Instead, the Observer receives a utility reward based on whether Solver rollouts conditioned on its caption reach the correct verified answer. A caption is good if it helps the Solver solve.

Formally, the Observer reward is the Solver’s verifier-validated success, multiplied by a leakage control term. If the caption leaks the final answer, the reward is suppressed. This is important because without leakage suppression, the Observer has an obvious loophole: instead of extracting evidence, it can simply write something equivalent to “the answer is B.” That would make the Solver look good and the evidence layer useless. AI systems, like interns and senior executives, discover badly designed incentives quickly.

The Solver receives a more standard reward: mostly answer correctness, plus a smaller format-compliance component. The paper uses a default weighting of $\lambda = 0.9$, so correctness dominates while formatting still matters. This is not conceptually flashy, but it is practical. Verifiable reward systems often fail on boring formatting mismatches before they fail on intelligence. Boring failures still count.

Caption-first warmup prevents the Solver from ignoring the evidence layer

One design choice deserves more attention than a summary table would give it: the caption-first warmup.

Early in training, PRCO removes image access from the Solver. The Solver must answer from the question and the Observer caption. After 40 warmup steps, image access is restored. This sounds like a small curriculum trick, but the mechanism is central.

If the Solver sees both image and caption too early, it can learn to bypass the caption. Once that happens, the Observer’s captions stop mattering to the Solver’s outcome. And if captions do not change outcomes, the Observer receives a weak, low-contrast utility signal. The evidence layer becomes decorative. Very enterprise SaaS, unfortunately.

The authors support this interpretation with an ablation. On Qwen2.5-VL-7B, full PRCO reaches an overall average score of 49.63 across the eight benchmarks. Removing the warmup reduces the average to 47.16. On Qwen2.5-VL-3B, full PRCO reaches 42.53, while removing warmup gives 41.86. The drop is larger for the 7B model, which is consistent with the idea that a stronger Solver is more capable of ignoring the caption and solving directly from the image.

The appendix adds a diagnostic: without warmup, the standard deviation of caption rewards decreases rapidly during training. Different captions become similarly useful—or similarly ignored. That is not a healthy learning signal. It is the training equivalent of asking three consultants for different analyses and then paying them all the same because the CEO never read the memo.

The benchmark gains are real, but the ablations explain why they matter

The headline results are strong. PRCO improves Qwen2.5-VL-3B from 34.88 to 42.53 average accuracy across eight benchmarks, a gain of 7.65 points. On Qwen2.5-VL-7B, it improves from 42.45 to 49.63, a gain of 7.18 points. It also beats GRPO and DAPO under the authors’ matched training setup, and at 7B it surpasses the strongest listed open-source baseline, VPPO, whose average is 47.60.

The eight benchmarks cover math-heavy visual reasoning and broader multimodal reasoning: MathVerse, MathVision, MathVista, WeMath, DynaMath, LogicVista, MMMU-Pro, and MMStar. The point is not that these benchmarks perfectly represent enterprise workflows. They do not. The point is that they stress a pattern common to enterprise visual tasks: the answer depends on extracting the right visual evidence before applying reasoning.

The ablations are more revealing than the leaderboard.

Test Likely purpose What it supports What it does not prove
Main benchmark comparison Main evidence PRCO improves average accuracy across model scales and benchmark types It does not prove open-ended multimodal generation quality
Remove Solver updates Ablation Outcome-optimized reasoning remains necessary; Observer-only learning is not enough It does not imply perception is unimportant
Remove Observer updates Ablation Utility-driven evidence extraction adds value beyond answer-only reasoning It does not isolate every possible captioning method
Remove caption-first warmup Ablation / mechanism test The Solver needs forced early reliance on captions for the Observer signal to remain useful It does not identify the optimal warmup schedule for all models
Pass@k analysis Robustness / sampling-budget sensitivity PRCO scales better when more sampled attempts are allowed It does not replace single-shot deployment evaluation
Solver rollout group size Sensitivity / compute trade-off PRCO remains competitive even with smaller Solver rollout groups It does not fully price training cost in commercial terms
Qwen3-VL-8B-Instruct test Extension / stronger-backbone comparison The method still helps on a stronger backbone It does not prove the effect will hold for proprietary frontier models
Qualitative cases and attention overlays Exploratory illustration The Observer often focuses on question-relevant regions and evidence Attention overlays are not causal proof of perception quality

The role ablation is especially useful. On Qwen2.5-VL-7B, removing Solver updates lowers the average to 44.52. Removing Observer updates gives 48.19. Full PRCO reaches 49.63. The Solver update is more essential for final accuracy, which is unsurprising: the model still has to reason. But the Observer update adds a clear complementary gain. More importantly, Observer-only updates still improve over the base model, reaching 44.52 versus 42.45. That suggests the perception side is not decorative. It can move end-task accuracy even without Solver learning.

The appendix adds three more pressure tests on Qwen2.5-VL-7B. Keeping the Solver image input empty throughout training drops the average from 49.63 to 47.74. Replacing the coevolving Solver with a fixed utility estimator drops it to 48.22. Removing the leakage checker drops it to 48.68. These are not catastrophic failures, but they point in the same direction: PRCO works best when evidence extraction, reasoning, leakage control, and coevolving feedback are present together.

The strongest-backbone extension is also worth reading carefully. On Qwen3-VL-8B-Instruct, PRCO reaches 63.05 average, compared with 59.75 for GRPO and 61.79 for DAPO. That is a comparison with trained baselines on the same backbone, so it is more informative than merely saying “it beats the base model.” Still, it remains one open model family under this experimental setup. Useful evidence, not a universal law.

The error analysis says PRCO improves seeing, not only answering

If the paper only reported final accuracy, PRCO could be dismissed as another reinforcement-learning wrapper around a benchmark suite. The error analysis is what makes the mechanism credible.

On WeMath, PRCO reduces perception errors by 39.2% and reasoning errors by 23.8% relative to Qwen2.5-VL-7B. On MathVista, the diagnostic figure shows reductions in both categories as well, with perception errors dropping by 31.4% and reasoning errors by 19.7%. The exact categorization is performed by GPT-5.1 using image, question, model response, and gold answer, so it should be treated as assisted diagnostic analysis rather than ground truth delivered from Mount Sinai. But it is aligned with the paper’s main claim and with the GRPO diagnostic at the beginning.

The paper also reports pass@k behavior. As the sampling budget increases, PRCO’s margin over baselines grows. On WeMath, its gap over DAPO increases from 3.53 at pass@1 to 7.33 at pass@32. On MMStar, PRCO is close to VPPO at pass@1, but the margin increases from 0.47 at pass@1 to 6.27 at pass@32. This suggests the model has a better distribution of candidate solutions, not merely a luckier greedy answer.

For business use, pass@k has a practical interpretation. Many enterprise systems do not rely on a single answer. They sample, cross-check, vote, route uncertain cases, or ask a second model to verify. If a method improves the quality of the solution distribution under sampling, it can be useful in workflows that allow internal deliberation before surfacing a final answer. Of course, this costs compute. “Think more” is not a free strategy; finance departments remain tragically unimpressed by epistemic ambition.

The enterprise lesson is evidence architecture, not caption decoration

The most practical reading of PRCO is not “use captions.” It is: separate evidence extraction from answer generation, and optimize the evidence layer by downstream utility.

This maps cleanly onto enterprise multimodal systems.

Enterprise task Evidence layer should extract Solver should decide PRCO-style lesson
Invoice or receipt checking Vendor, dates, line items, totals, tax fields, currency, anomalies Whether the document matches policy or accounting rules Reward extraction by downstream reconciliation success, not by pretty OCR summaries
Chart-based reporting Axes, units, plotted values, labels, trend breaks, legends What the chart implies for performance or risk Force the model to preserve task-relevant visual facts before narrative analysis
Compliance form review Checked boxes, missing signatures, field contradictions, dates Whether the submission passes a rule set Separate visual field extraction from legal or procedural reasoning
Engineering diagram QA Components, connections, labels, dimensions, spatial relations Whether a design satisfies a constraint Do not let the final answer hide a misread diagram
Dashboard monitoring Relevant widgets, thresholds, unusual values, time windows Whether intervention is needed Optimize what gets read, not just what gets concluded

The paper’s Observer is not a production extractor. It is a research role inside a training framework. But the design pattern is exportable. In production, an enterprise system can expose an evidence object before the final answer: selected rows, fields, coordinates, values, OCR spans, chart points, diagram relations, or page regions. That evidence object can be tested, audited, corrected, cached, and reused.

That is where ROI may appear. Not in the mystical claim that the model “understands images better,” but in cheaper diagnosis. When the final answer is wrong, a team can inspect whether the evidence layer failed or the reasoning layer failed. If the evidence is wrong, improve OCR, visual parsing, retrieval, document preprocessing, or targeted fine-tuning. If the evidence is right, improve rules, reasoning prompts, validators, or domain-specific solvers. Blame becomes more expensive to misallocate. Good.

There is also a product-design implication. Many multimodal AI interfaces still return a final answer with a vague explanation. PRCO suggests a better contract: show the evidence the model used before asking users to trust the answer. In regulated or high-value workflows, this is not cosmetic. It is the difference between an assistant and a liability with a chat box.

The boundaries are clear: verifiable answers, caption bottlenecks, and extra supervision

The paper is careful about its own limits, and those limits matter for business interpretation.

First, PRCO is evaluated on benchmarks with concise, verifiable answers. That fits many operational tasks—classification, extraction, matching, compliance checks, numeric answers—but not all multimodal work. Open-ended visual reporting, design critique, medical image interpretation, and creative generation have fuzzier reward signals. PRCO’s verifier-driven setup becomes harder when there is no short answer to verify.

Second, captions are lossy. The authors explicitly note that short textual captions may fail to preserve layout, texture, fine spatial relations, and geometric details. This is not a small caveat. For enterprise documents, layout is often meaning. A signature placed under the wrong clause matters. A table header spanning two columns matters. A chart axis scale matters. A caption can preserve some of this, but not always faithfully.

The practical extension is to treat “caption” as one possible intermediate representation, not the only one. A production evidence layer may need structured JSON, bounding boxes, OCR spans, table coordinates, chart data extraction, or graph representations. The PRCO principle survives: optimize the intermediate evidence by downstream task success. The literal caption format may not.

Third, PRCO uses auxiliary supervision for leakage detection and answer verification. In the experiments, Qwen3-VL-8B-Instruct is used as the auxiliary leakage checker. That adds noise, cost, and another model dependency. In a business setting, the equivalent leakage or shortcut problem may look different: a model might copy an answer from visible multiple-choice options, overfit a document template, or infer from file metadata rather than visual content. The anti-shortcut mechanism has to match the workflow.

Finally, the paper trains on 8 NVIDIA H200 GPUs with 200 optimization steps, rollout batch size 384, Observer group size 4, Solver group size 8, and a 40-step warmup. This is not a casual weekend fine-tune on a laptop. The enterprise value is therefore less likely to come from every company reproducing PRCO training from scratch. It is more likely to come from adapting the architecture pattern: build inspectable evidence stages, create verifiers where possible, and reward downstream utility instead of generic description quality.

The useful question is not whether the model thinks, but what it paid attention to

PRCO is a good paper because it refuses a comfortable simplification. Multimodal reasoning is not just reasoning with pixels attached. It is a chain of perception, evidence selection, representation, reasoning, and answer verification. A final-answer reward compresses that chain into one scalar and hopes optimization will sort out the rest. Sometimes it does. Often it merely teaches the model to sound more deliberate while still looking at the wrong thing.

The paper’s answer is to split the job without splitting the model: one shared policy, two roles, two reward signals, and a training sequence that makes the evidence layer matter before letting the Solver look back at the image. The result is not only higher average scores. The more interesting result is a reduction in perception errors and a set of ablations showing why the reduction is plausible.

For Cognaptus readers, the takeaway is simple enough to be operational: when building multimodal AI for charts, tables, forms, diagrams, and visual reports, do not ask only whether the final answer is right. Ask what evidence the system extracted, whether that evidence was useful, whether it leaked shortcuts, and whether downstream success actually depends on it.

Synthetic sense appears when a model’s self-generated intermediate evidence becomes testable, useful, and constrained by the task. Synthetic nonsense appears when the model writes fluent visual notes that nobody verifies, nobody audits, and everybody pretends are “reasoning.”

The difference is not poetry. It is credit assignment.

Cognaptus: Automate the Present, Incubate the Future.


  1. Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian, Yuan Xiong, Wenting Yan, and Jing Shao, “Seeing with You: Perception–Reasoning Coevolution for Multimodal Reasoning,” arXiv:2603.28618v2, 9 Apr 2026, https://arxiv.org/abs/2603.28618↩︎