Sight Unseen: How LVLM Alignment Can Teach Models to Ignore Images

Image inspection has one rude requirement: the model should look at the image.

That sounds too obvious to be an article thesis, which is usually a warning sign. In real deployments, a large vision-language model may describe a damaged package, summarize a product photo, inspect a dashboard screenshot, answer a question about an invoice, or guide a visual agent through a web interface. When it gets something wrong, the default diagnosis is familiar: the vision encoder missed the object, the dataset was noisy, the benchmark was weak, or the model simply hallucinated because models hallucinate. Very tidy. Also incomplete.

The paper Language Bias in LVLMs: From In-Depth Analysis to Simple and Effective Mitigation argues that one important failure can be produced by the alignment process itself.¹ The model may improve during visual instruction tuning or preference optimization while gaining much of that improvement from text-only conditioning. In plain language: the training objective can reward answers that sound better even when the image is doing less work. A multimodal model can become better at sounding like it saw the picture. This is not exactly the dream brochure.

The useful part of the paper is not merely that the authors propose two losses and report better benchmark numbers. The useful part is the causal chain: conditional-probability training and preference optimization can increase text-only reliance; that reliance becomes measurable as language bias; and simple training-time penalties can reduce it without adding new data or auxiliary models. For businesses, the message is uncomfortable but practical: hallucination in LVLMs is not only an output-quality problem. It is an alignment-infrastructure problem.

The misconception: better alignment does not automatically mean better grounding

A common business reading of multimodal alignment is pleasantly simple. First, pre-train a model on image-text data. Then perform visual instruction tuning so it can follow user instructions. Then apply preference optimization so it gives answers humans like. Each stage should move the model closer to being useful. The staircase goes up. Everyone applauds. Procurement asks for a demo.

The paper complicates that staircase.

Visual instruction tuning optimizes the likelihood of the target response given both text and image, usually written as $\pi_\theta(y \mid x, v)$, where $x$ is the instruction, $v$ is the visual input, and $y$ is the response. That objective looks multimodal because the image is present. But the presence of the image in the input does not prove that the model’s improvement comes from using the image. If the same response also becomes much more likely under text-only conditioning, $\pi_\theta(y \mid x)$, the model may have learned the linguistic pattern of the answer more than the visual grounding needed to justify it.

That is the paper’s central diagnostic move. Instead of treating hallucination as a vague behavioral defect, the authors compare two training gains:

$$ R_{\mathrm{VIT}} = \log \frac{\pi_\theta(y \mid x, v)}{\pi_{\mathrm{ref}}(y \mid x, v)}, \qquad B_{\mathrm{VIT}} = \log \frac{\pi_\theta(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}. $$

The first term asks: how much did the model improve when it could see both the prompt and the image? The second asks: how much did it improve when it had only the prompt? If both rise together, the model is not necessarily becoming visually smarter. It may be becoming a more confident language model wearing multimodal clothing.

The authors then generalize the text-only gain as a language-bias measure:

$$ B = \log \frac{\pi_\theta(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}. $$

This is not a perfect philosophical definition of “bias.” It is more useful than that. It is an operational measurement: relative to a reference model, how much has the current model improved at producing the response without seeing the image?

That measurement changes the business question. The issue is no longer only, “Did the model answer correctly?” It becomes, “Did the training process make the model more dependent on visual evidence, or merely more fluent at plausible completion?” In visual operations, plausible completion is a charming way to ship defects downstream.

VIT creates drift; DPO can polish the drift

The mechanism-first structure matters because the paper’s main evidence begins before the mitigation methods appear.

During visual instruction tuning on LLaVA-v1.5-7B, the authors track the multimodal gain and the text-only gain. The reported trajectories in Figure 3(a) are close enough to carry the argument: much of the measured training gain appears mirrored in the text-only setting. The model is improving, but the improvement is not cleanly attributable to stronger use of visual information.

Direct Preference Optimization makes the story more pointed. DPO trains a model to prefer chosen responses over rejected responses using preference data. In the multimodal version, the preference pair includes a prompt, an image, a preferred answer, and a rejected answer. The obvious assumption is that if the preferred answer is better, DPO should strengthen visual grounding. Sometimes the obvious assumption is just a shortcut in formalwear.

The paper tracks gain for chosen and rejected responses under both multimodal and text-only conditions. In Figure 3(b), the text-only gain for chosen responses can outpace the multimodal gain. That is the dangerous pattern. Preference learning can make the preferred answer more likely through language-only cues, not necessarily through image-grounded reasoning.

This explains why a model can become more aligned to the preference dataset and still produce visually unfaithful long-form descriptions. The reward signal says “this answer is preferred.” The model may learn “this style of answer is preferred.” The image, meanwhile, is sitting there like an unpaid consultant.

LBR and LBP target two different moments in the training pipeline

The paper proposes two interventions, and their separation matters.

Language Bias Regularization (LBR) is used during visual instruction tuning, where the authors argue that language bias is still emerging. It penalizes the magnitude of the text-only bias term, effectively discouraging the model from drifting too far from the reference model’s text-only distribution. The updated VIT objective adds a weighted regularization term:

$$ L'\ast{\mathrm{VIT}} = L\ast{\mathrm{VIT}} + \alpha \cdot L_{\mathrm{LBR}}. $$

Language Bias Penalty (LBP) is used during DPO, where bias already exists after instruction tuning. Instead of merely restraining the growth of bias, LBP pushes against text-only reliance inside the preference-optimization stage. The paper’s implementation uses the reference-to-current text-only likelihood ratio inside a sigmoid-based penalty, with a DPO objective of the form:

$$ L'\ast{\mathrm{DPO}} = L\ast{\mathrm{DPOM}} + \gamma \cdot L_{\mathrm{LBP}}. $$

The distinction is operationally important. LBR is a preventive regularizer. LBP is a corrective penalty. One is for the stage where the model starts learning to talk beyond the image; the other is for the stage where preference learning may reward that habit.

Intervention	Training stage	What it tries to stop	Practical interpretation
LBR	Visual instruction tuning	Early growth of text-only drift	Keep supervised multimodal tuning from becoming language-only imitation with an image attached
LBP	DPO / preference optimization	Preference learning that rewards text-only plausibility	Make alignment less willing to accept fluent answers that do not depend enough on the image

Both methods are deliberately simple. They do not require additional data or a separate judge model. The paper also reports that reference-model outputs can be pre-computed and cached, producing nearly identical VRAM use and only minor training-time overhead. That does not make them free. It makes them deployable in a real fine-tuning workflow, which is a more useful word than “free.”

The main evidence supports the mechanism, with some normal benchmark messiness

The paper evaluates LBR mainly on general LVLM capability benchmarks and LBP mainly on hallucination-focused benchmarks. That split is sensible: LBR modifies visual instruction tuning, so the first concern is whether general multimodal capability improves or breaks; LBP modifies preference alignment, so the first concern is whether hallucination declines without gutting general performance.

LBR improves most reported metrics across LLaVA-v1.5-7B, LLaVA-v1.5-13B, and LLaVA-NEXT-3B. The gains are often modest, but broad. For example, LLaVA-v1.5-7B with LBR moves MME from 1490 to 1525, SQA from 66.8 to 69.4, VisWiz from 50.1 to 54.0, COCO captioning from 110.6 to 112.1, and TextCap from 98.4 to 99.1. LLaVA-NEXT-3B improves TextVQA from 56.1 to 57.9 and MMStar from 42.7 to 44.7.

Not every cell improves. RWQA for LLaVA-v1.5-7B slips from 55.4 to 54.9; VisWiz for the 13B model dips from 55.6 to 55.1; MMMU for LLaVA-NEXT-3B falls from 39.6 to 38.8. This is exactly why “consistent improvement” should be read as a broad pattern, not as a magic spell. The business interpretation is not that a tiny regularizer guarantees every downstream metric rises. It is that suppressing text-only drift appears compatible with, and often helpful to, general multimodal performance.

LBP’s main results are more directly tied to hallucination. On directly comparable LLaVA-v1.5-7B results trained on RLHF-V, LBP reports an MMHalBench score of 2.91 with a hallucination rate of 0.43, compared with 2.16 / 0.56 for V-DPO and 2.69 / 0.49 for MFPO. On the AMBER generative task, LBP reports a hallucination rate of 18.5 versus 22.5 for MFPO. On Object HalBench, LBP reports CHAIRs / CHAIRi of 12.3 / 6.3 versus 13.4 / 6.6 for MFPO.

For LLaVA-v1.5-13B, the picture is still positive but less cartoonish. LBP scores 3.01 on MMHalBench with a 0.42 hallucination rate. MFPO also reports 0.42 hallucination rate but a lower score of 2.94. LBP has lower AMBER generative hallucination than MFPO, 16.6 versus 19.4, but lower coverage, 51.5 versus 56.1. Again, this is not a fairy tale where every number salutes. It is a stronger pattern: targeted bias penalties reduce hallucination-oriented errors while preserving competitive general capability.

Result area	Likely purpose in the paper	What it supports	What it does not prove
Figure 3 training dynamics	Main mechanism evidence	VIT and DPO can increase text-only gain alongside or beyond multimodal gain	That every hallucination in every LVLM is caused by language bias
LBR benchmark tables	Main effectiveness evidence for VIT intervention	Regularizing text-only drift can improve broad LVLM capability	Guaranteed gains on every benchmark or domain
LBP hallucination tables	Main effectiveness evidence for DPO intervention	Penalizing text-only reliance improves hallucination metrics under comparable settings	Complete visual factuality or production safety
Alternative regularization table	Ablation	Sequence-level L1 regularization is a reasonable LBR choice	That no better regularizer exists
Hyperparameter plots	Sensitivity / implementation detail	LBR is more sensitive to $\alpha$; LBP is relatively stable across $\gamma$	That no tuning is needed in other architectures
VLFeedback scale tests	Robustness / scalability	LBP retains advantage across 1K, 10K, and 30K preference subsets	That more preference data always helps more
Qwen2.5-VL extension	Architecture generalization test	LBP beats DPO under identical Qwen training conditions	Full validation of LBR on Qwen, which was not tested
PixMo augmentation	Data-bias check	LBR still helps when the SFT mixture is strengthened with human-curated captions	That data quality is irrelevant

The evidence is strongest when read as a chain: the diagnostic shows text-only drift; LBR and LBP directly target that drift; the resulting models improve on many capability and hallucination measurements; human evaluation supports the same direction. That is more convincing than a benchmark leaderboard, because the method and measurement are aimed at the same failure mechanism.

The authors are unusually explicit about a problem that benchmark users often pretend not to see: automated hallucination metrics are not neutral instruments from Mount Olympus.

Object-matching metrics such as CHAIR can flag correct objects as hallucinations when ground-truth object annotations are incomplete. They also struggle with relational, state, and attribute errors because they mostly count object-word overlap. LLM-as-judge methods such as MMHalBench can evaluate semantic coherence more flexibly, but the judge often compares generated text against captions or annotations rather than directly inspecting the image. A language model judging visual faithfulness without seeing the visual evidence is, let us say, a governance arrangement with comic potential.

So the paper adds a targeted human evaluation. The setup uses 100 COCO validation images and the prompt “Please help me describe the image in detail.” Three trained annotators classify hallucinations into six categories: existence, attribute, state, number, action, and relation. Majority vote determines the final error count. This is not huge, but it is well aligned with the specific claim: long-form visual descriptions are where language bias should become more visible.

The human evaluation strengthens the paper’s story. Total hallucination counts fall from 155 for the LLaVA-v1.5-7B baseline to 121 with LBR. DPO reduces the count to 137, but the paper notes a trade-off: vanilla DPO reduces existence hallucinations while increasing errors in several other categories. LBP cuts the total count further to 83, while average response length remains comparable or slightly longer than the baselines.

That last point matters. A trivial way to reduce hallucinations is to say less. The model can become very safe by becoming very useless, a strategy already mastered by many corporate compliance memos. The paper’s supplementary analysis reports that LBP improves the balance among MMHalBench score, hallucination rate, and informativeness. In the human evaluation, LBP produces fewer hallucinations without shrinking into terse avoidance.

Long-form generation is where language bias stops pretending to be harmless

The paper’s long-form analysis deserves attention because many business workflows do not ask LVLMs for one-word answers. They ask for descriptions, explanations, summaries, reports, screen-reading assistance, product comparisons, or step-by-step guidance. The longer the answer, the more room the language model has to fill gaps with plausible continuity.

The appendix stratifies AMBER generative results by output length. At 16 and 32 tokens, DPO and LBP are close. As outputs grow, the gap widens. At length 128, DPO’s hallucination rate reaches 24.6, while LBP reports 19.4. At length 256, DPO reaches 26.2, while LBP reports 20.7.

This is not just a benchmark curiosity. It tells us where the mechanism bites hardest. Short answers constrain the model. Long answers invite the model to narrate. Narration is useful, but it also gives language priors more surface area. If the system is used for inspection reports, customer-facing product descriptions, insurance photo summaries, or document-image analysis, the risky output is often not the short answer. It is the polished paragraph.

The visual encoder is part of the story, but not the whole villain

The paper also tests whether language bias is mainly caused by the visual encoder’s trainability. Freezing the vision encoder during LLaVA-NEXT training hurts general performance, which is unsurprising. A model that cannot adapt its visual representations has fewer ways to improve. But the paper reports that freezing has negligible impact on language-bias training dynamics.

This test is useful because it narrows the diagnosis. The vision encoder matters for capability. Encoder architecture also matters: the authors observe that Qwen2.5-VL-3B shows a lower tendency toward pure language priors than LLaVA-v1.5, likely due to a stronger vision transformer design. But the persistence of language bias under DPO, and LBP’s improvement over DPO in the Qwen extension, suggest that training objectives still matter even in stronger architectures.

In business terms, buying a better base model is not the same as auditing the alignment pipeline. Better eyes help. They do not guarantee the model will use them under pressure from a preference objective.

What businesses can actually do with this finding

The paper directly shows a training-time phenomenon in specific open LVLM settings. Cognaptus’ inference is broader but bounded: teams deploying vision-language systems should evaluate whether their models are using visual evidence, not only whether their answers sound right on a benchmark.

Business decision	What the paper directly shows	Cognaptus inference	What remains uncertain
Vendor evaluation	Text-only gain can track or exceed multimodal gain during VIT/DPO	Ask vendors for grounding diagnostics, not only hallucination leaderboard scores	Proprietary vendors may not expose training-stage measurements
Internal fine-tuning	LBR and LBP work without extra data or auxiliary models in tested settings	Add modality-reliance diagnostics to fine-tuning reviews before approving model updates	Hyperparameters may need retuning for other architectures and domains
Visual QA deployment	Long-form outputs are more exposed to language bias	Test both short-answer accuracy and long-form report faithfulness	Domain-specific error types may differ from COCO/MMHalBench/AMBER
Runtime monitoring	Automated metrics have known blind spots	Use human review or targeted audits for high-risk visual narratives	Human evaluation is expensive and cannot cover every edge case
Model governance	DPO can improve preference fit while shifting error categories	Track error taxonomy, not only total hallucination rate	The right taxonomy depends on the application

A practical audit can include text-only or image-ablated counterfactuals: give the model the prompt without the image, with a blank image, or with a mismatched image, and measure how much the answer changes. This is not a complete substitute for training-time bias measurement, but it follows the same logic. If the answer barely changes when the image disappears, the system is not visually grounded in any operationally meaningful sense. It is producing confident visual prose through muscle memory.

For teams fine-tuning open LVLMs, the paper suggests a more direct path: measure language bias during VIT or preference optimization, then penalize it. The attraction is not only accuracy. It is diagnosis. Instead of discovering language bias after deployment through user complaints and screenshots in Slack, teams can observe it during training and constrain it before release. Radical, I know.

The boundary conditions should shape deployment expectations

The paper is strong enough to be useful and bounded enough to be read carefully.

First, most experiments are on LLaVA-family models. The Qwen2.5-VL-3B extension is valuable because it suggests LBP generalizes beyond LLaVA, but only LBP is tested there. The authors could not test LBR on Qwen because LBR requires access to visual instruction tuning data and checkpoints that are not publicly available for the Qwen series.

Second, LBR is not a black-box fix. It is a training-time method for teams that can modify VIT. If a business only consumes a closed API, it cannot simply “turn on LBR.” It can still demand grounding diagnostics, run counterfactual evaluations, or apply downstream safeguards, but that is not the same intervention.

Third, automated hallucination metrics are imperfect, and the paper says so itself. Object-level metrics can punish correct objects missing from annotations; LLM-judge metrics may lack direct visual grounding. Human evaluation helps, but the reported human study uses 100 COCO images. That is a focused validation, not a universal safety certificate.

Fourth, the tested domains are not the full business universe. Industrial defect detection, medical imaging, customs inspection, satellite analysis, invoice processing, and UI automation each has its own error taxonomy. The mechanism may transfer; the numbers should not be imported like furniture from a showroom.

Finally, reducing language bias does not solve all multimodal hallucination. A model can still fail because the image is ambiguous, the visual encoder misses fine details, OCR is weak, object grounding is poor, or the user asks for information not present in the image. LBR and LBP address a particular training-induced tendency: reliance on text-only priors. That is important because it is specific. Specific fixes beat atmospheric concern.

The real lesson is to test whether the model looked before praising what it said

The paper’s contribution is not that hallucination exists. We have had enough demonstrations of models inventing umbrellas, traffic signs, and imaginary furniture. The contribution is a cleaner explanation for how a multimodal training pipeline can accidentally reward non-visual competence while appearing to improve visual behavior.

That explanation matters for deployment. If a vision-language model answers correctly because it used the image, the system has learned something operationally valuable. If it answers plausibly because the prompt strongly suggests the answer pattern, the system has learned a shortcut. In easy demos, both can look similar. In production, the shortcut waits patiently for the one image that does not match the usual script.

For businesses, the practical question is therefore not “Which LVLM has the best hallucination score?” It is sharper: “Does our training and evaluation process punish answers that can be generated without looking?” If the answer is no, the model may still be useful. It is just less grounded than the slide deck implies. And the slide deck, tragically, will be very fluent.

Cognaptus: Automate the Present, Incubate the Future.

Yangneng Chen and Jing Li, “Language Bias in LVLMs: From In-Depth Analysis to Simple and Effective Mitigation,” arXiv:2605.25036v1, 24 May 2026, https://arxiv.org/abs/2605.25036. ↩︎

Sight Unseen: How LVLM Alignment Can Teach Models to Ignore Images#

The misconception: better alignment does not automatically mean better grounding#

VIT creates drift; DPO can polish the drift#

LBR and LBP target two different moments in the training pipeline#

The main evidence supports the mechanism, with some normal benchmark messiness#

Human evaluation is the right sanity check because automated hallucination metrics have blind spots#

Long-form generation is where language bias stops pretending to be harmless#

The visual encoder is part of the story, but not the whole villain#

What businesses can actually do with this finding#

The boundary conditions should shape deployment expectations#

The real lesson is to test whether the model looked before praising what it said#