Seeing Isn’t Knowing: Why Vision-Language Models Still Miss the Details

A photo arrives in a product-support workflow. The model sees the image, answers confidently, and explains the object’s features. The prose is smooth. The reasoning sounds plausible. The problem is smaller and more brutal: it named the wrong thing.

That is the failure mode at the center of Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies, a paper that introduces the Fine-grained Recognition Open World benchmark, or FROW.¹ The paper is not asking whether large vision-language models can talk about images. They can. We have all been sufficiently dazzled by captioning demos; please clap responsibly.

The sharper question is whether they can distinguish the exact subordinate category before answering a question that depends on that distinction. Is this aircraft a Gulfstream IV or another Gulfstream model? Is this dog a Chihuahua or just “a small dog”? Is this food cannoli or something vaguely pastry-adjacent? In many business workflows, that difference is not aesthetic. It decides whether the answer is useful, misleading, or quietly expensive.

The failure is not bad reasoning; it is reasoning after a bad label

The paper’s running example is useful because it is painfully ordinary. An image shows a Gulfstream IV. The benchmark asks a question whose answer requires recognizing that exact aircraft type, such as when it made its maiden flight. A model may identify it as a different aircraft and then provide facts that match the wrong label. The answer can look structured, fluent, and informed while being anchored to the wrong object.

That distinction matters. A model can fail in at least two ways:

Failure layer	What goes wrong	Why it matters operationally
Recognition	The model misidentifies the fine-grained category.	Every downstream fact may become irrelevant even if well written.
Content	The model recognizes the object but gives incorrect or incomplete information.	The object anchor is right, but domain knowledge or retrieval remains weak.
Evaluation	The test format hides the recognition problem.	Teams overestimate reliability before deployment.

FROW is built around this separation. It evaluates both recognition accuracy and content accuracy. Recognition accuracy asks whether the model has identified the fine-grained category. Content accuracy asks whether the factual answer is correct, using a reference answer constructed from the category and Wikipedia-sourced information.

This is a small design choice with large consequences. In many multimodal AI evaluations, the model is rewarded for producing a generally reasonable answer. FROW makes the hidden dependency explicit: first know what you are looking at, then talk.

Multiple choice gives the model a map; open-world recognition removes it

The paper’s first contribution is the benchmark itself. FROW uses images from six established fine-grained datasets: FGVC-Aircraft, CUB-200-2011 birds, Food-101, Stanford Dogs, Oxford Flowers-102, and VegFru. Together, these provide 859 fine-grained categories. For each category, the authors select an image from the test set, retrieve relevant category information from Wikipedia, and use GPT-4o to generate open-ended questions that do not reveal the category name.

That last detail is doing real work. In a multiple-choice setting, the model receives a constrained answer space. It may not fully recognize the object; it may only compare the image against a few candidate labels. That is still a useful capability, but it is not the same as open-world recognition.

FROW removes the menu.

The paper’s Figure 1 contrasts multiple-choice performance with FROW performance. GPT-4o is near-perfect on multiple-choice fine-grained recognition, but its FROW scores are much lower: 64.20 on CUB-200-2011, 62.83 on Stanford Dog, 70.98 on Flowers102, 75.00 on Food101, 60.17 on VegFru, and 63.80 on Aircraft. Those are strong relative to open-source models, but not “the model sees everything” numbers.

The gap is harsher for open-source models. Qwen-VL-chat-78B scores 28.80 on CUB, 48.67 on Stanford Dog, 53.73 on Flowers102, 57.20 on Food101, 42.87 on VegFru, and 35.80 on Aircraft. InternVL-2.5-8B is lower still on several categories, with 16.80 on Aircraft and 21.52 on VegFru. LLaVA-1.5-7B drops to 15.80 on CUB and 16.40 on Aircraft.

The interpretation is not that these models are useless. The interpretation is more specific: a model can be broadly competent at visual dialogue and still be weak at identifying the exact class that business decisions depend on.

FROW tests a practical sequence: identify, then answer

The benchmark construction mirrors a real workflow better than a simple classification test.

First, the model receives an image. Then it receives a question whose answer depends on the object’s fine-grained identity. The category is not leaked in the question. The model must infer the object and provide the answer. Evaluation then separates whether the model named the object sufficiently well from whether the content was factually correct.

The recognition metric uses a three-level scoring system from 0 to 2. A partial match can receive partial credit. For example, if the true category is “Boeing 737-600,” the answer “Boeing 737” may be partially correct, while “Boeing 737-200” is wrong because it identifies a different subtype. The content metric uses a four-level score from 0 to 3, based on correctness against a reference answer.

The authors use GPT-4o-mini as the main evaluator after comparing it with human annotators, GPT-4o, and Claude 3.5 Sonnet on 300 sampled responses. They report discrepancies below or around one percentage point across these comparisons. That does not remove all evaluator-risk, but it gives the benchmark a reasonable cost-quality argument. Manual evaluation of open-ended fine-grained answers is not exactly a hobby one recommends to friends.

The core insight is the dependency chain:

image → fine-grained category recognition → factual answer → business decision

Break the second step, and the third step becomes theater.

The strongest result is the benchmark gap, not the augmentation trick

The paper also proposes optimization strategies. These are useful, but they should not distract from the main evidence.

The cleanest finding is that open-ended fine-grained recognition exposes weaknesses that multiple-choice testing can hide. That finding matters even if one never adopts the paper’s training recipe. For AI product teams, FROW is mainly a diagnostic warning: do not evaluate a multimodal system only with tests that give it the candidate labels.

The optimization section then asks a second question: if fine-grained recognition is weak, what kind of training data and training stage help?

The authors test several ideas using InternVL-2.5-8B, with general data from LLaVA: 558K alignment samples and 665K supervised fine-tuning samples. Fine-grained data is generated from the six fine-grained datasets.

The experiments fall into three roles:

Test or result	Likely purpose	What it supports	What it does not prove
FROW benchmark results across proprietary and open-source LVLMs	Main evidence	Open-ended fine-grained recognition remains weak, especially in open-source models.	It does not prove all multimodal systems fail in every domain.
GPT evaluator comparison with humans and other proprietary models	Validation check	GPT-4o-mini is a plausible evaluator for this benchmark.	It does not eliminate all grading bias or dataset-specific effects.
Mosaic data experiments	Exploratory/ablation-like data construction test	Mosaic images can improve convergence and modestly improve recognition.	It is not yet a general recipe for all domains.
Open-world data experiments	Main optimization evidence	Knowledge-rich open-ended and introduction-style data improves recognition and content scores.	It does not prove the gains come from durable understanding rather than dataset alignment.
Training-stage allocation experiments	Mechanism and implementation evidence	Putting fine-grained data into alignment helps recognition while reducing damage to general capability.	It does not fully settle optimal training for other architectures or preference-alignment methods.
Appendix figures on extra fine-grained/general tasks	Supporting extension	The training-stage claim is not limited to one small plot.	It remains within the tested model/data setup.

This ranking matters because the business implication depends on it. The paper is strongest as an evaluation and training-design study, not as a universal law of multimodal learning.

Mosaic data helps recognition, but it is not the whole answer

The mosaic strategy is intuitive. Instead of training on one object image at a time, the authors combine multiple images into a tiled collage and construct question-answer pairs for the mosaic. In the paper’s example, a 3 × 3 bird collage asks which bird categories are represented in the image, and the answer describes each tile’s category.

The motivation is practical. The authors observe that fine-grained recognition training may require repeated exposure, nearly ten repetitions, to converge. Mosaic data allows multiple category examples to be packed into one image-question pair. Under comparable data volume, the model trained with mosaic images converges faster and achieves higher recognition accuracy. The abstract summarizes the average recognition gain from mosaic data as about 1%.

That is useful, but modest. The business reading should not be “use collages and fine-grained recognition is solved.” The better reading is: dense exposure to category distinctions helps, especially when the model must learn many visually similar subtypes. It is a data-efficiency tactic, not a magic spell. Sadly, the models still refuse to become taxonomists out of politeness.

The authors also compare mosaic augmentation with traditional AutoAugment-style data augmentation. Their figures suggest the two are not mutually exclusive; with enough repetitions, combining traditional augmentation and mosaic data can outperform either alone. This is best treated as an exploratory extension, because the paper itself states that mosaic exploration remains preliminary.

Open-world data improves both naming and knowing

The more consequential optimization result comes from open-world data.

Short-answer recognition data can teach the model to name categories. But naming alone does not guarantee that it can answer questions about those categories. The paper shows that relying only on short-answer and mosaic data may lead to stagnation or even decline in content accuracy. That is exactly the failure pattern businesses should worry about: a system gets better at labels but not necessarily at useful knowledge.

To address this, the authors add two types of knowledge-rich data:

introduction-style QA pairs, where questions elicit background information about the depicted object;
open-ended QA pairs, generated in the same style as the benchmark’s reference answers.

For each category, the dataset includes three introduction-type QA pairs and two open-ended QA pairs. The introduction content is generated from Wikipedia summaries using GPT-4, with each summary produced three times to reduce generation errors. The authors also remove training questions identical to benchmark questions to avoid direct leakage.

Figure 8 reports substantial improvements when open-world data is included. For Flowers102, recognition with open-world data rises from 23.04 to 54.41 under general+fg1, from 26.47 to 66.67 under general+fg5, and from 25.49 to 70.10 under general+fg10. Content accuracy also improves. For Food101, recognition rises from 61.50 to 72.00 under general+fg1, from 55.50 to 73.50 under general+fg5, and from 57.50 to 71.00 under general+fg10.

There is an important mechanism hiding in these numbers. Fine-grained recognition is not only a visual classification problem. It is also a visual-language binding problem. The model must connect the visual subtype to the language knowledge that makes the answer meaningful.

For business use, this suggests a simple rule: do not train or evaluate domain vision systems only on object labels. Pair labels with the questions users will actually ask after recognition. A retail system should not merely learn “Oxford shoe”; it should learn size, material, care, compatibility, and return-risk questions. An agricultural system should not merely learn “specific crop disease”; it should learn treatment relevance and confidence boundaries. A parts-inspection system should not merely learn “component subtype”; it should learn failure modes, replacement rules, and operational consequences.

That is Cognaptus inference, not the paper’s direct claim. The paper shows gains on six fine-grained visual datasets using Wikipedia/GPT-generated data. The broader workflow implication is that domain-specific multimodal AI needs recognition-linked knowledge, not labels floating alone in a training spreadsheet.

Fine-tuning alone creates the usual trade-off: skill gained, generality lost

The training-process section is where the paper becomes more useful for builders.

A tempting response to poor fine-grained recognition is to fine-tune the model on fine-grained data after the main model is trained. The paper tests this “obvious” solution and finds the usual bill arrives: better fine-grained performance, weaker general capabilities.

In Table 2, the baseline setting uses 558K alignment data and 665K general fine-tuning data. When the model is later fine-tuned on fine-grained data, it performs well on fine-grained short-answer tasks but general benchmark scores drop sharply. AI2D falls from 65.31 to 48.67. ChartQA falls from 27.36 to 13.45. DocVQA falls from 42.43 to 20.36. InfographicsVQA falls from 30.27 to 18.89. MathVista falls from 22.6 to 16.7. POPE declines less dramatically, from 87.6 to 83.39.

This is not surprising, but it is operationally important. A product team cannot simply take a general multimodal assistant, fine-tune it aggressively on a niche taxonomy, and assume the rest of its capabilities remain intact. Some will. The dashboard will look neat. The support tickets will become educational.

The authors then test mixing general and fine-grained data during the broader fine-tuning stage. This reduces the damage but does not eliminate the trade-off. Mixed fine-tuning preserves general performance much better than the final-stage fine-grained-only approach, but it remains slightly weaker than training on general data alone, and its fine-grained performance can be weaker than fine-grained-only training.

The interpretation is not “fine-tuning is bad.” It is that training-stage placement changes what the model can absorb without forgetting other skills.

Alignment-stage data is the underrated lever

The paper’s most interesting training claim is that fine-grained data should appear earlier, during the alignment stage.

In LVLM training, the alignment stage teaches the model to connect visual representations to language representations. It is often treated as a bridge-building phase, not as the place where task capability is learned. The authors challenge that assumption. When fine-grained data is incorporated into alignment, the alignment module itself demonstrates considerable fine-grained recognition ability, especially as fine-grained data volume increases.

Figure 9 shows recognition accuracy improving as fine-grained data is repeated in alignment. Aircraft rises from 15.93 with fg1 to 53.46 with fg20. CUB rises from 34.06 to 75.31. Flowers102 rises from 39.27 to 87.91. Food101 is already high and rises from 86.27 to 89.86. Stanford Dog rises from 63.05 to 77.84. VegFru rises from 62.64 to 83.82.

Then Figure 10 compares models that include fine-grained data during alignment against those that do not, while varying how much fine-grained data is used later in fine-tuning. The pattern is straightforward: models exposed to fine-grained data during alignment achieve stronger fine-grained performance with less later fine-grained fine-tuning, while maintaining comparable general-task performance on examples such as InfographicsVQA and DocVQA.

That is the training-design lesson. Fine-grained recognition is not just another downstream task to paste onto a finished model. It may need to be taught where visual and linguistic representations are being connected.

For business teams building specialized multimodal systems, this changes the project question. Instead of asking, “How much domain data should we fine-tune on at the end?” the better question is, “At which stage should domain distinctions enter the model pipeline?” If the object taxonomy is central to the product, late-stage fine-tuning may be too late, too costly, or too destructive.

What this means for business AI products

The paper directly shows that FROW exposes fine-grained recognition weaknesses and that specific data/training strategies improve performance on the tested setup. From that, we can infer several practical design rules for multimodal AI workflows.

Business workflow	Risk if fine-grained recognition is weak	Practical implication
Retail cataloging and search	Similar products are grouped under the wrong subtype, hurting retrieval and recommendations.	Test exact product-category recognition, not only caption quality.
Agricultural diagnosis	A crop, pest, or disease is recognized too broadly, leading to wrong advice.	Pair visual labels with treatment-relevant knowledge and uncertainty checks.
Insurance and claims inspection	Damage type, vehicle model, or component subtype is misread.	Separate object recognition audit from claim-reasoning audit.
Industrial maintenance	The wrong part or machine variant is identified before troubleshooting.	Require subtype verification before procedural recommendations.
Food, plant, or animal identification apps	The model produces confident general descriptions while missing the exact species or dish.	Evaluate open-ended recognition rather than multiple-choice convenience tests.
Domain support assistants	The assistant answers about the wrong item with fluent domain language.	Treat recognition failure as a blocking error, not a minor confidence issue.

The most practical takeaway is evaluation design. If your workflow depends on exact object identity, do not test the model by giving it a short list of candidate labels. That tests discrimination among provided options. Real users rarely provide such a polite answer key.

The second takeaway is architectural. In a production system, object recognition should be a first-class stage with its own confidence threshold, fallback path, and audit metric. It should not be buried inside a single end-to-end answer. The answer may be beautiful, but if the object anchor is wrong, beauty is mostly a user-interface crime.

The third takeaway is data design. Label-only fine-tuning may improve category naming, but the paper suggests that open-world QA data improves both recognition and content accuracy. Domain systems should therefore collect or generate examples where each object subtype is linked to the questions users actually ask.

Where the evidence stops

The paper’s boundaries are clear enough to matter.

First, the benchmark uses six fine-grained datasets and 859 categories. That is diverse across aircraft, birds, food, dogs, flowers, vegetables, and fruits, but it is not every business domain. Performance in medical imaging, industrial parts, fashion SKUs, semiconductor defects, or real estate defects may differ.

Second, much of the data construction relies on Wikipedia and GPT-generated questions or answers, with manual verification and filtering. This is reasonable for a research benchmark, but business deployments often require source-controlled domain knowledge, compliance review, and liability-aware answer policies. “Wikipedia plus GPT” is not a governance plan. It is a dataset construction method.

Third, the optimization experiments mainly validate strategies on InternVL-2.5-8B, with LLaVA also reported in the benchmark improvement table. The training-stage lesson is persuasive, but architecture-specific effects remain possible.

Fourth, the paper uses GPT-4o-mini as the primary evaluator after reporting close agreement with human and proprietary-model judgments on a sampled set. That supports cost-effective evaluation, but it does not make evaluator bias disappear. Open-ended grading is always a little less clean than we want and a little more useful than pretending multiple choice is reality.

Fifth, the authors explicitly note that preference alignment methods such as RLHF or DPO are underexplored in this work, and that the mosaic-data exploration remains preliminary. So the paper should not be read as the final recipe for fine-grained multimodal training.

These limitations do not weaken the central message. They shape where to use it.

The real benchmark is whether the model knows what it is talking about

FROW is valuable because it attacks a comfortable illusion: that a model capable of fluent visual dialogue must also be reliable at recognizing exact objects. The paper shows that this is not safe. A model may describe the image, answer a question, and still be wrong at the first step.

For business use, that first step is often the cheapest place to catch failure. Ask the model what it sees. Score that separately. Require subtype-level confidence when subtype matters. Then allow reasoning to continue.

The paper’s optimization results add a second lesson: better recognition is not obtained merely by bolting a narrow fine-tune onto a general assistant. The data must teach both category identity and category-linked knowledge, and the training stage matters. Fine-grained distinctions belong close to the visual-language alignment process, not only at the final polish layer.

Seeing is useful. Knowing what is seen is better. Knowing what is seen before explaining it is where multimodal AI starts becoming operational software instead of a charming intern with excellent grammar.

Cognaptus: Automate the Present, Incubate the Future.

Cong Pang, Hongtao Yu, Zixuan Chen, Lewei Lu, and Xin Lou, “Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies,” arXiv:2512.10384, 2025. ↩︎

The failure is not bad reasoning; it is reasoning after a bad label#

Multiple choice gives the model a map; open-world recognition removes it#

FROW tests a practical sequence: identify, then answer#

The strongest result is the benchmark gap, not the augmentation trick#

Mosaic data helps recognition, but it is not the whole answer#

Open-world data improves both naming and knowing#

Fine-tuning alone creates the usual trade-off: skill gained, generality lost#

Alignment-stage data is the underrated lever#

What this means for business AI products#

Where the evidence stops#

The real benchmark is whether the model knows what it is talking about#