Rewarding Bad Physics Habits: What VLMs Learn When You Pay Them to Reason

A factory camera sees a pressure gauge. The AI reads the image, explains the mechanism, applies the formula, and recommends an action. Everyone in the meeting relaxes, because the model has produced a neat chain of reasoning.

That is usually the moment to become nervous.

The dangerous part is not that a vision-language model can be wrong. We know that. The more interesting problem is that a model can become wrong in a very specific way because we trained it to chase the wrong reward. Pay it for clean formatting, and it learns to look organized. Pay it for final answers, and it may sacrifice the reasoning path. Pay it to stare at the image, and it may do better on spatial problems while forgetting that physics also contains formulas. Apparently, “look harder” is not a complete theory of mechanics.

The paper Reward Design for Physical Reasoning in Vision-Language Models studies exactly this incentive problem.¹ The authors train IBM Granite Vision 3.3 2B on the PhyX benchmark using Group Relative Policy Optimization, or GRPO, and compare reward designs across multiple-choice and open-ended physics questions. The useful contribution is not a new universal recipe. It is more uncomfortable: reward design changes the behavior being optimized, and different rewards improve different pieces of physical reasoning.

For business readers, this is the part worth keeping. The paper is not saying “GRPO beats SFT, therefore use reinforcement learning.” It is saying that reward functions are operational policy. They define what the AI system is allowed to become good at.

The study compares incentives, not just models

The experimental setup is clean enough to be commercially interesting. The authors use PhyX, a multimodal physics benchmark with problems spanning Mechanics, Electromagnetism, Thermodynamics, Wave/Acoustics, Optics, and Modern Physics. The tasks also vary by reasoning type: physical model grounding, spatial relation, multi-formula reasoning, implicit condition reasoning, numerical reasoning, and predictive reasoning.

The model is IBM Granite Vision 3.3 2B. That size matters. A 2B VLM is not a frontier-scale system; it is closer to the class of models enterprises might actually fine-tune, host, audit, and pay for without turning the GPU budget into a hostage situation. The authors train on 3,000 multiple-choice problems and 3,000 open-ended problems, then evaluate on the 1,000-problem PhyX testmini set.

They compare seven conditions:

Condition	What changes	Likely behavioral pressure
Baseline	No additional fine-tuning	Whatever the instruction-tuned model already knows
SFT	Supervised fine-tuning	Imitate reference outputs
GRPO (Fmt)	Reward output tags only	Produce extractable, well-structured answers
GRPO (Fmt + Acc)	Reward format plus answer correctness	Maximize final answer hit-rate
GRPO (Rubric)	Reward correctness, principle, unit, and reasoning quality	Produce more disciplined physics explanations
GRPO (ASM)	Reward attention on foreground image regions	Ground generation more tightly in visual content
GRPO (Fmt + Acc + ASM)	Combine answer accuracy with visual grounding	Balance final answer and image attention

This is why a comparison-based reading is better than a normal summary. The paper is not organized around one champion method. It is a set of incentive experiments. Each reward answers a different product question:

Can the model produce outputs we can parse?
Can it select the right answer?
Can it show the right physics principle and units?
Can it visually attend to the important image region?

Those questions sound related. They are not interchangeable.

Format rewards make the model behave, not think

The simplest reward is format compliance. The model is asked to generate four tagged fields: <think>, <answer>, <unit>, and <principle>. The reward checks whether the required structure appears.

This is not a trivial detail. In production systems, structured output is often the first thing teams need. A model that gives the right answer in an unparseable format is still a nuisance. A model that gives a beautiful paragraph when the workflow needs a JSON field is not being poetic; it is breaking the pipeline.

But format compliance is scaffolding, not reasoning. In the paper’s best single-run results, GRPO (Fmt) improves over the raw baseline on multiple-choice overall accuracy, from 0.217 to 0.304, but it remains well behind SFT at 0.433 and the stronger GRPO reward variants. In open-ended tasks, the same format-only condition reaches 0.017 overall, slightly above the baseline and SFT, but still economically tiny.

The interpretation is simple: forcing the answer into a clean box helps evaluation and extraction, but the box is not intelligence. Enterprises repeatedly forget this, because dashboards reward the presence of fields. The model has filled the field, therefore the system must be working. A charming illusion, and not even an original one.

Accuracy rewards win the headline, but not the audit

The accuracy reward is the blunt instrument: reward the model when the answer is correct. For multiple-choice questions, correctness is evaluated by exact match against the option letter. For open-ended questions, the authors use an LLM judge that can assign partial credit.

On the headline metric, accuracy-based reward works. In best single-run MCQ results, GRPO (Fmt + Acc) scores 0.460 overall, compared with 0.433 for SFT. Adding the attention reward gives GRPO (Fmt + Acc + ASM) a slightly higher 0.462, the best single-run MCQ overall score in the main table.

Method	MCQ overall, best single run	OE overall, best single run
Baseline	0.217	0.012
SFT	0.433	0.011
GRPO (Fmt)	0.304	0.017
GRPO (Fmt + Acc)	0.460	0.027
GRPO (Rubric)	0.440	0.018
GRPO (ASM)	0.352	0.014
GRPO (Fmt + Acc + ASM)	0.462	0.022

The open-ended numbers should stop anyone from over-celebrating. The best OE score is only 0.027. Yes, that is higher than SFT’s 0.011. No, that does not mean a small VLM is ready to freely reason through mission-critical engineering physics from images. There is a difference between “relative improvement” and “commercially reliable.” Procurement decks often blur that line. Reality is less cooperative.

The more important nuance appears in the appendix. The authors report top-5 mean performance and note GRPO’s run-to-run variability. Under MCQ, SFT has the least variance and strong stability. The best single GRPO run can beat SFT, but the mean of several GRPO runs does not automatically dominate it. For example, the top-5 mean MCQ overall scores are 0.433 for SFT, 0.426 for GRPO (Fmt + Acc + ASM), 0.417 for GRPO (Rubric), and 0.390 for GRPO (Fmt + Acc).

That distinction changes the business reading. Accuracy reward can produce the best run, but a production training pipeline needs repeatability. If surpassing SFT depends on selecting among multiple GRPO runs, then the real cost includes experiment management, evaluation discipline, and failed-run disposal. Reinforcement learning did not eliminate governance work. It merely made the spreadsheet longer.

Rubric rewards buy reasoning discipline by spending some accuracy

The rubric reward is the most attractive to managers who want “explainable AI.” It scores not just correctness, but also whether the model identifies the relevant physics principle, reports a consistent unit, and provides valid reasoning. For open-ended answers, these dimensions are judged with an LLM-based rubric.

This sounds obviously better. It is also exactly where the paper becomes useful.

Rubric reward does not consistently beat simpler accuracy reward on top-line accuracy. In best single-run MCQ, GRPO (Rubric) reaches 0.440 overall: better than SFT’s 0.433, but below GRPO (Fmt + Acc) at 0.460 and GRPO (Fmt + Acc + ASM) at 0.462. In open-ended best single-run results, Rubric reaches 0.018, below Fmt + Acc at 0.027 and below Fmt + Acc + ASM at 0.022.

A lazy reading would say the rubric failed. The paper’s evidence suggests something more interesting: the rubric changes what the model optimizes. In the discussion, the authors compare reward dynamics between rubric and format-plus-accuracy training. The rubric model’s secondary dimensions—principle, unit, and reasoning components—increase over training, even while overall accuracy does not rise in the same smooth way. They also report that rubric training improves reasoning quality over training, with the reasoning-judge trend reported as $R^2 = 0.617$.

By contrast, the format-plus-accuracy model can become better at answer selection while producing reasoning that supports the answer less reliably. That is an enterprise failure mode, not merely an academic curiosity. In audit-heavy settings, a model that gives a correct answer with a weak or inconsistent explanation may be worse than a less accurate model that exposes its uncertainty and reasoning path. The first one hides the failure until the meeting is already expensive.

A better framing is this:

Deployment priority	Reward bias that fits	What may be sacrificed
Maximize answer hit-rate	Accuracy reward	Explanation integrity
Require traceable physics steps	Rubric reward	Some top-line accuracy and training stability
Ensure parseable workflow output	Format reward	Substantive reasoning
Improve perception-heavy tasks	Attention reward	Symbolic reasoning in some domains

This is not a moral hierarchy. It is operational triage. A field-service assistant, a safety-review system, a homework tutor, and a scientific workflow agent do not need the same failure profile.

Attention rewards help the model look at the right thing—until looking is not enough

The paper’s most distinctive contribution is the attention-based reward: ASM, or Attention Score Mask. Instead of requiring human-labeled bounding boxes, the method uses the model’s own attention weights. It reconstructs attention over image tokens, maps it back to image space, constructs a foreground mask by separating non-white content from white background, and rewards attention mass that falls on the foreground.

That is technically clever because it supervises visual grounding without extra spatial annotations. In commercial language: it reduces the labeling burden. The method is not free—implementation requires forward hooks, attention reconstruction, and extra forward-pass work—but it avoids asking humans to draw boxes around every relevant diagram component.

The result is sharply domain-specific. The attention reward improves spatial relation reasoning from 0.27 to 0.50 when added on top of format reward. The authors describe this as the largest reward gain across reasoning types in the study. They also validate that the ASM-trained model concentrates attention more strongly on meaningful foreground regions, reporting the highest attention mask score, 0.0531, and the lowest entropy among non-SFT models, 0.8912.

So yes, ASM appears to make the model visually attend in a more focused way.

The catch is that physics is not only vision. GRPO (ASM) alone performs poorly in several symbolic domains in the main MCQ table: Thermodynamics at 0.139 and Wave/Acoustics at 0.164, both far below SFT and below several other GRPO conditions. This matters because many enterprise multimodal tasks are hybrids. A maintenance diagram, a circuit image, or an engineering chart may require both visual grounding and symbolic inference. If reward pressure over-privileges looking at the image, the model may underinvest in the formula chain.

The paper’s interpretation is cautious: at 2B scale, visual grounding supervision and symbolic reasoning capacity may compete for representational resources. That hypothesis should not be overgeneralized to every VLM size and architecture. But it is exactly the kind of hypothesis product teams should test before deploying “AI vision reasoning” into inspection, safety, or engineering workflows.

The appendix is not decoration; it changes how confident we should be

The main table gives the best single-run results. The appendix gives top-5 mean and standard deviation. That second view matters because GRPO can be volatile.

The top-5 mean table shows that SFT remains a stable baseline under MCQ. GRPO (Fmt + Acc + ASM) is competitive and close, but its mean overall MCQ performance is slightly below SFT. Rubric is also close. Accuracy-only GRPO is less impressive under the top-5 mean view than under the best-run view.

This does not invalidate the paper’s contribution. It clarifies it. The contribution is not “GRPO always wins.” The contribution is that reward selection changes the kind of competence produced, and that competence can be unstable across runs.

For enterprise AI teams, the lesson is painfully practical:

Evidence in the paper	Likely purpose	What it supports	What it does not prove
Best single-run accuracy table	Main evidence	Some GRPO reward designs can beat SFT on selected headline metrics	That GRPO is always superior or stable
Top-5 mean and standard deviation	Robustness / sensitivity view	GRPO has meaningful variability; multiple runs may be needed	That the best run is irrelevant
Spatial reasoning gain from 0.27 to 0.50	Ablation-style reward evidence	ASM helps perception-heavy spatial reasoning	That ASM improves all physics reasoning
Rubric reward dynamics	Mechanism evidence	Rubric components improve reasoning discipline	That richer reward always improves accuracy
Attention entropy and mask score	Implementation validation	ASM changes visual attention behavior	That focused attention always means correct reasoning

This is the difference between reading a benchmark and reading an experiment. Benchmarks invite ranking. Experiments invite diagnosis. The second is more useful, and less convenient for marketing.

What Cognaptus infers for business use

The paper directly shows that, for Granite Vision 3.3 2B on PhyX, GRPO reward design affects both accuracy and reasoning behavior. Accuracy-heavy rewards are strongest on headline answer accuracy. Rubric rewards improve structured reasoning signals without reliably dominating accuracy. Attention-based rewards improve spatial reasoning and visual grounding metrics, while hurting some symbolic domains. Open-ended performance remains very low in absolute terms.

From this, Cognaptus would infer three practical design rules.

First, choose the reward after choosing the failure mode. If the application is diagram-based troubleshooting where the main risk is missing a visible component, attention-style grounding may be valuable. If the application is regulatory reporting or engineering review, rubric-style rewards may be more important because the reasoning trail matters. If the application is a bounded multiple-choice diagnostic assistant, accuracy reward may be appropriate. The reward should match the operational failure that hurts most.

Second, separate evaluation dashboards by behavior type. Do not collapse answer accuracy, reasoning quality, unit correctness, and visual grounding into one decorative “AI quality” score. That kind of metric is an executive lullaby. It sounds soothing and prevents thought. The paper shows these behaviors can move in different directions.

Third, budget for reward iteration, not just model selection. A company that says “we will use VLM X” has not made the hard decision. The hard decision is what to reward, how to validate it, how many runs to tolerate, and which trade-offs to accept. In this study, even a small shift from accuracy reward to rubric reward changes what the model learns to prioritize.

A useful internal checklist would look like this:

Business question	Technical question	Evidence to demand before deployment
Do we need the right final answer above all else?	Does accuracy reward outperform SFT across repeated runs?	Mean and variance, not only best run
Do explanations need to survive audit?	Does rubric reward improve reasoning consistency?	Component-level reasoning, unit, and principle scores
Is the image essential to the task?	Does attention reward improve task-specific visual grounding?	Spatial-reasoning results and attention validation
Can the workflow parse outputs reliably?	Does format reward produce stable tagged fields?	Extraction success rate and downstream error rate
What failure is unacceptable?	Which reward reduces that failure without creating a worse one?	Scenario tests, not generic benchmark averages

None of this is glamorous. Good. Glamour is how teams end up with a model that explains a wrong answer in flawless prose.

Boundaries that should not be hand-waved

The study’s boundaries are not footnotes; they shape how the result should be used.

The model is a 2B Granite Vision model. Larger VLMs may handle multi-objective rewards differently. Reward complexity that destabilizes a small model may become manageable at larger scale—or may fail in more subtle ways. The paper does not settle that.

The domain is physical reasoning on PhyX. That is valuable because physics combines visual interpretation with symbolic reasoning, but it is not the same as medical imaging, financial chart interpretation, construction-site inspection, or legal document review. The reward trade-offs may transfer as a pattern, not as exact numbers.

The open-ended evaluation uses an LLM judge. That is reasonable for variable scientific answers, but it remains a mediated evaluation method. When the measured scores are around 0.01 to 0.03, even relative gains need careful handling. A 100% improvement from a very small base is not automatically deployment readiness. It is sometimes just arithmetic wearing a nicer jacket.

The attention reward uses foreground masks based on non-white image regions. That is suitable for many diagram-like physics images, but it may be less straightforward for complex natural scenes where the meaningful region is not simply “non-white foreground.” The annotation-free design is attractive, but the assumption behind the mask must match the image domain.

Finally, GRPO variability matters. If the best result appears in one run while mean performance remains close to or below SFT, then training governance becomes part of the method. Multiple runs, selection criteria, and honest reporting are not optional extras.

The real lesson: rewards are product requirements in disguise

The paper’s most useful message is not that accuracy rewards are good, rubric rewards are noble, or attention rewards are clever. All three statements are too simple.

The better lesson is that reward functions encode product priorities. A model trained to maximize final answers may learn to treat reasoning as theater. A model trained to satisfy a rubric may become more disciplined but less competitive on raw hit-rate. A model trained to attend to the image may improve spatial reasoning while losing ground where formulas dominate.

That is not a defect in reward design. That is reward design doing its job. The problem begins when teams pretend they asked for “better reasoning” as if reasoning were a single measurable substance, like battery capacity or warehouse rent.

For Cognaptus readers building AI systems, the operational takeaway is direct: do not ask which reward is best. Ask which behavior you are buying, which behavior you are neglecting, and which failure will become invisible after optimization.

Because once the model learns your incentive system, it will optimize it more faithfully than your employees ever did. Whether that is comforting depends entirely on what you rewarded.

Cognaptus: Automate the Present, Incubate the Future.

Derek Lilienthal, Manisha Mukherjee, and Sameera Horawalavithana, “Reward Design for Physical Reasoning in Vision-Language Models,” arXiv:2604.13993, 2026, https://arxiv.org/abs/2604.13993. ↩︎

The study compares incentives, not just models#

Format rewards make the model behave, not think#

Accuracy rewards win the headline, but not the audit#

Rubric rewards buy reasoning discipline by spending some accuracy#

Attention rewards help the model look at the right thing—until looking is not enough#

The appendix is not decoration; it changes how confident we should be#

What Cognaptus infers for business use#

Boundaries that should not be hand-waved#

The real lesson: rewards are product requirements in disguise#