Seeing Is Misleading: When Climate Images Need Receipts

A picture lies differently from a sentence.

A sentence can be checked against a source. A picture can be old, cropped, staged, reused, mislabeled, emotionally loaded, or paired with a claim it never supported. This is why climate disinformation is annoying in the precise technical sense: it often does not need to fabricate a new fact. It can simply attach a real-looking image to a slippery claim and let the audience do the rest. Very efficient. Very human. Very platform-native.

The paper behind today’s article, Multimodal Climate Disinformation Detection: Integrating Vision-Language Models with External Knowledge Sources, tests whether GPT-4o can classify climate-related image–claim pairs more reliably when it receives external evidence from reverse image search, Google search, GPT-based web previews, and fact-checking sites.¹ The headline finding is simple enough: combining evidence sources performs best overall. The useful lesson is less simple: external knowledge is not magic dust. Some single sources underperform the model’s internal reasoning. The operational advantage comes from routing evidence, not merely adding more text to the prompt and hoping the model becomes a fact-checker by osmosis.

That distinction matters for anyone building media monitoring, ESG intelligence, brand-risk scanning, public-sector communication tools, or internal misinformation triage. The business question is not, “Can we connect a vision-language model to the web?” That is a procurement sentence, not a system design. The better question is: “Which evidence source should the model trust, when, and at what cost?”

This paper gives us a useful small-scale map of that problem.

The paper tests evidence, not just vision

The study starts from the CliME dataset, a collection of climate-related social media posts from Twitter and Reddit. Each post includes a textual claim, an image, and a manually validated description. The original dataset did not include factuality labels, so the authors created labels using GPT-4o, multiple role-based prompts, and majority voting.

That labeling step is important because it shapes the whole interpretation of the results. The benchmark is not a fully human-labeled gold standard. The labels are model-assisted, derived from the claim, image, and expert-validated description. During evaluation, however, GPT-4o sees only the image and claim, with or without retrieved evidence. In other words, the paper asks whether external evidence helps GPT-4o recover factual judgments that were previously assigned through a richer labeling process.

The authors evaluate two label schemes:

Setup	Labels	Evaluation size	Practical meaning
4-class	Accurate, Misleading, False, Unverifiable	500 samples	Tests finer factual diagnosis
2-class	Accurate vs. Disinformation	500 samples	Tests operational triage

The 4-class setting is the harder and more useful version. It asks the model to distinguish between “wrong,” “partly true but distorted,” and “not checkable enough.” That is close to the real world, where misinformation rarely arrives wearing a name tag.

The 2-class setting is easier and more deployable. Many business workflows first need a triage flag: safe enough, or needs review. But collapsing Misleading, False, and Unverifiable into one bucket also hides the difference between malicious distortion, factual error, and evidentiary ambiguity. For compliance and public communication, those distinctions are not decorative. They change the response.

The retrieval success chart is an implementation diagnostic, not the main result

Before looking at classification performance, the paper reports retrieval success rates for the external evidence sources. This is not the main evidence for model quality. It is a system feasibility check.

The difference between sources is sharp. GPT retrieval succeeds for all 500 samples in both the 4-class and 2-class setups. Reverse image retrieval also has very high coverage, above 98% in both setups. Google search covers less than half of the samples. Fact-checking sites cover only about one-fifth to one-quarter of the samples: 22.2% in the 4-class setup and 27.6% in the 2-class setup.

This tells us something a leaderboard table alone would miss. Fact-checking sites may be high-quality evidence when they exist, but they do not exist for most samples. Reverse image search is widely available, but it can be expensive in tokens and may return provenance rather than direct claim verification. GPT search gives broad coverage and concise summaries, but the model’s own confidence in GPT search is not the same as independent factual certainty. Google search sits awkwardly in the middle: useful in principle, noisy and incomplete in practice.

So the retrieval chart is best read as an operational supply map. It answers: “Can the system usually find something?” It does not answer: “Does the something make the model right?”

That second question belongs to the classification results.

Combined evidence wins, but the win is not evenly distributed

The strongest overall pattern is that combining all four external sources produces the best accuracy and macro F1 in the 4-class setting.

In the 4-class Chain-of-Draft setup, the combined evidence condition reaches 70.40% accuracy and 71.89 F1, with zero rejection. In the Chain-of-Thought setup, combined evidence reaches 69.60% accuracy and 71.01 F1, also with zero rejection.

Those numbers are not miraculous. They are not “solved climate misinformation,” because apparently we still live on Earth. But they are meaningful because the 4-class task is genuinely harder than binary triage. The model must decide whether a claim is accurate, misleading, false, or unverifiable while interpreting both image and text.

Here is the compact version of the 4-class result:

Evidence setting	CoT accuracy	CoT F1	CoD accuracy	CoD F1	Interpretation
Internal only	63.80	68.35	68.20	70.98	Surprisingly competitive baseline
Fact-check sites	66.60	68.18	62.80	62.55	High-quality source, low coverage
Google search	60.60	62.29	58.40	59.20	Weakest single source
Reverse image	62.00	62.32	60.80	60.49	Useful provenance, not enough alone
GPT search	65.40	61.88	69.00	66.86	Strong coverage, high model confidence
Combined	69.60	71.01	70.40	71.89	Best overall, zero rejection

The obvious but wrong conclusion would be: “External evidence improves performance.” The more accurate conclusion is: “A coordinated evidence bundle improves performance; individual evidence sources can distract or underperform.”

That is the paper’s most useful finding.

The internal-only model is not weak. In the 4-class CoD setup, internal-only GPT-4o reaches 68.20% accuracy and 70.98 F1. That is better than fact-check-only, Google-only, reverse-image-only, and GPT-search-only on F1. In other words, giving the model one external source does not automatically ground it. Sometimes it may simply add partial context, retrieval noise, or irrelevant evidence that encourages a worse judgment.

This is the core correction to the common “just add RAG” instinct. Retrieval-augmented generation is not a virtue by itself. Retrieval can also be a very expensive way to confuse a model with receipts for the wrong purchase.

The 2-class setting shows the business appeal — and the measurement trap

In the 2-class setup, the combined evidence condition again performs best on accuracy: 86.45% with CoT and 86.20% with CoD, both with zero rejection. F1 is also around 86.

For a business user, this looks attractive. A binary classifier can support a practical review queue. Let the system flag climate-related posts as either broadly accurate or needing review; send the risky items to analysts; preserve human attention for high-stakes cases.

But the 2-class results need careful reading. In the CoT 2-class setup, the internal-only model has 85.40% accuracy and 88.16 F1, with a 6.2% rejection rate. The combined setup has slightly higher accuracy and zero rejection but a lower F1 than internal-only in that table. In the CoD 2-class setup, internal-only performance collapses to 66.80% accuracy with a 28.8% rejection rate, while combined evidence reaches 86.20% accuracy and 86.02 F1.

So the combined setup is operationally attractive not only because it improves classification. It also removes rejection. In workflows where every item must be triaged, “I cannot decide” becomes its own cost. It creates manual backlog, delays response, and makes dashboards look precise while quietly leaking work to humans.

Still, zero rejection is not the same as zero uncertainty. It means the model returned a valid label for every sample. It does not mean every label was correct, nor that the underlying evidence was complete.

A useful business reading is therefore:

Paper result	Direct meaning	Business meaning	Boundary
Combined evidence gets the best 4-class performance	Multiple sources help GPT-4o classify image–claim pairs	Evidence orchestration beats single-source prompting	Tested on 500 model-labeled samples
Some single sources underperform internal-only reasoning	Retrieval can introduce weak or partial evidence	Source quality and routing matter	Does not identify every failure mechanism
Combined evidence has zero rejection	The model always returns a valid label	Fewer unresolved cases in review queues	A valid answer can still be wrong
2-class performance reaches about 86% accuracy/F1	Binary triage is easier than detailed diagnosis	Suitable for first-pass filtering	Not enough for final public claims or legal decisions

That last boundary is not a minor footnote. A binary system can tell an analyst where to look. It should not become the final arbiter of public truth unless the organization enjoys reputational risk as a service.

The confusion matrices show where the model still struggles

The combined-source confusion matrices are not just decorative heatmaps. They are error analysis.

In the 4-class setting, the model handles Accurate and Misleading cases relatively well, and False cases look manageable because there are fewer of them. The difficult category is Unverifiable. Under the combined CoD setup, 53 of 119 Unverifiable samples are correctly classified, while 62 are predicted as Accurate. Under CoT, only 40 of 119 Unverifiable samples are correctly classified, and 70 are predicted as Accurate.

That is a serious pattern. It means the model often converts insufficient evidence into apparent accuracy. For climate disinformation, this is exactly the dangerous failure mode. The absence of contradiction is not proof of truth. A vague, sarcastic, or context-poor post may be impossible to verify, but a model under pressure to decide can still give it a clean label.

This matters more than the aggregate score. In real moderation, public communication, or brand-risk analysis, “Unverifiable predicted as Accurate” is not just another confusion-matrix cell. It is the system blessing ambiguity.

The 2-class confusion matrices tell a related story. Under combined CoT, 260 of 273 Accurate samples are correctly classified, but 55 of 227 Disinformation samples are incorrectly labeled Accurate. Under combined CoD, 257 of 273 Accurate samples are correct, while 53 of 227 Disinformation samples are mislabeled Accurate.

That makes the model more conservative toward accusing accurate content than toward catching all problematic content. Depending on the workflow, that may or may not be acceptable. A platform safety team may prefer to reduce false negatives. A public institution may prefer to avoid false accusations. A brand-risk dashboard may need adjustable thresholds and human review at the boundary.

The paper does not solve that policy choice. It gives evidence that the choice exists.

Chain-of-Draft is mostly an efficiency variant here

The paper compares Chain-of-Thought and Chain-of-Draft prompting. The authors describe CoD as generating multiple reasoning drafts, evaluating them, and selecting the most coherent explanation before assigning a label. CoT, by contrast, uses explicit step-by-step reasoning.

The result is not a dramatic reasoning-method victory. In the 4-class setting, CoD with combined evidence is slightly better than CoT: 70.40% accuracy and 71.89 F1 versus 69.60% accuracy and 71.01 F1. In the 2-class setting, CoT is slightly ahead on accuracy: 86.45% versus 86.20%.

The more consistent difference is cost. Across sources, CoT generally uses more tokens than CoD. In the combined setup, CoD uses about 2,025,000 tokens for 500 prompts, with an average prompt length of 3964.5 and average time of 5.10 seconds. CoT uses about 2,048,050 tokens, average prompt length of 4007.5, and average time of 5.74 seconds.

This makes the CoT/CoD comparison best interpreted as an efficiency and prompting-variant test, not as a second thesis about reasoning. The main thesis remains evidence orchestration. CoD’s advantage is modest but useful: similar or slightly better performance in the hard 4-class setting, with somewhat lower token and time cost.

The uncomfortable number is the combined-source cost. More than two million tokens for 500 samples is not shocking in a research experiment, but production systems do not run on vibes and conference applause. If a company wants to monitor thousands or millions of posts, the combined-evidence pipeline cannot be applied blindly to everything. It needs triage before triage: cheap filters first, expensive evidence bundles only when the item is high-risk, high-reach, or decision-relevant.

The real architecture is an evidence-routing layer

The paper’s method uses four types of external knowledge:

fact-checking sites;
GPT-based web previews;
reverse image search;
Google search.

The authors’ combined method conditionally includes evidence in a prioritized order: fact-checking sources first, then GPT search, then reverse image search, then Google search. That design choice is more important than it may appear. It is a move away from “retrieve everything” and toward evidence routing.

For business systems, each source has a different operational role:

Evidence source	Best operational use	Failure mode	Cost implication
Fact-checking sites	High-confidence claim verification	Low coverage	Cheap when available
GPT web preview	Broad claim context and quick summaries	May create overconfidence in summarized evidence	Medium coverage-cost balance
Reverse image search	Provenance, reuse, out-of-context detection	Expensive and not always claim-specific	High token load
Google search	General external context	Noisy or weak retrieval	Moderate cost, variable value
Internal model knowledge	Fast baseline judgment	Outdated or unsupported certainty	Cheapest starting point

This is where the paper becomes useful for Cognaptus-style automation work. The product idea is not “GPT-4o plus search.” It is a layered system:

Image + claim
   ↓
Cheap internal screening
   ↓
Risk scoring: topic, reach, novelty, claim specificity
   ↓
Evidence routing:
   - known claim? fact-check first
   - suspicious image reuse? reverse image search
   - novel claim? GPT/web preview and targeted search
   - ambiguous item? send to human review
   ↓
Model verdict with source-aware explanation
   ↓
Audit log and review queue

The routing layer is where ROI lives. It reduces unnecessary retrieval, preserves expensive evidence gathering for the cases that need it, and makes the model’s answer more auditable.

A human analyst does not check every claim the same way. They ask: Is this a reused image? Is the text making a scientific claim? Is the claim recent? Has a fact-checker already covered it? Is the item too vague to verify? A production AI system should do the same, preferably without the analyst having to watch it learn common sense one failed ticket at a time.

What the paper directly shows, and what Cognaptus infers

The paper directly shows that, on its constructed 500-sample evaluation sets, GPT-4o performs best overall when it receives a combined context from multiple external sources. It also shows that single evidence sources vary substantially, and that retrieval coverage differs sharply across source types. Finally, it shows that prompt strategy affects token use and running time, while performance differences between CoT and CoD are modest.

Cognaptus infers three business lessons.

First, evidence quality is a workflow problem, not only a model problem. A larger vision-language model may recognize a chart, meme, or image more effectively, but factuality still depends on whether the system can find the right external context. For climate communication, provenance and claim verification are separate jobs. A picture of a flood may be real and still be attached to the wrong event.

Second, binary triage is the near-term product shape. The 4-class taxonomy is analytically richer, but many organizations will start with Accurate versus Needs Review. The important design choice is to keep the detailed labels internally, even if the dashboard displays a simpler flag. Otherwise, Unverifiable gets flattened into Disinformation, and the organization loses the ability to distinguish uncertainty from falsehood.

Third, cost control must be part of the architecture from the beginning. Combined evidence is best, but it is also the heaviest. A sensible deployment would not run full reverse-image plus multi-source evidence retrieval on every low-reach post. It would escalate only when the item’s potential harm, audience size, novelty, or business relevance justifies the extra cost.

That is not glamorous. It is also how useful systems are usually built.

Boundaries that affect practical use

The paper is valuable, but its boundaries matter.

The most important boundary is labeling. The factuality labels were automatically extracted using GPT-4o with multiple prompts and majority voting. The process is thoughtful, but it is not the same as independent expert labeling for every evaluation sample. This means the reported scores should be interpreted as performance against a model-assisted benchmark, not as final evidence of real-world fact-checking accuracy.

The second boundary is scale. Each setup uses 500 labeled samples. That is enough to reveal patterns, but not enough to claim broad coverage across all climate narratives, languages, geographies, platforms, or adversarial formats. Climate disinformation is not one genre. It includes policy distortion, cherry-picked charts, old disaster images, manipulated screenshots, sarcastic memes, and synthetic media. A neat benchmark will always be cleaner than the internet, which is the internet’s main crime.

The third boundary is source governance. The paper compares source types, but a production system must decide which fact-checking organizations, search results, expert sources, and web summaries are trusted. “External evidence” is not a neutral category. It is a supply chain. Bad sourcing turns retrieval into institutionalized rumor laundering.

The fourth boundary is the Unverifiable class. The model’s difficulty with Unverifiable samples is a practical warning. Systems should allow uncertainty to remain uncertainty. Forcing a crisp label may make dashboards look cleaner while making decisions worse.

The practical takeaway: make the receipts legible

The paper’s best result is not simply that combined evidence improves GPT-4o’s performance. The deeper point is that multimodal misinformation detection needs a receipt system.

A claim needs factual support. An image needs provenance. A model needs source-aware reasoning. A business workflow needs cost-aware escalation. A reviewer needs an audit trail. These are different needs. One prompt cannot magically satisfy all of them just because it contains a URL and a confident paragraph.

For organizations building climate-risk intelligence, ESG monitoring, media verification, or public communication tools, the winning design is likely to look less like a chatbot and more like an evidence router with a vision-language model inside it. The model reads the image and claim. The router decides which receipts matter. The system returns not just a label, but a reasoned, source-linked, uncertainty-aware explanation.

Seeing is misleading. Receipts help. But only if the system knows which receipts to ask for.

Cognaptus: Automate the Present, Incubate the Future.

Marzieh Adeli Shamsabad and Hamed Ghodrati, “Multimodal Climate Disinformation Detection: Integrating Vision-Language Models with External Knowledge Sources,” arXiv:2601.16108, 2026. ↩︎

The paper tests evidence, not just vision#

The retrieval success chart is an implementation diagnostic, not the main result#

Combined evidence wins, but the win is not evenly distributed#

The 2-class setting shows the business appeal — and the measurement trap#

The confusion matrices show where the model still struggles#

Chain-of-Draft is mostly an efficiency variant here#

The real architecture is an evidence-routing layer#

What the paper directly shows, and what Cognaptus infers#

Boundaries that affect practical use#

The practical takeaway: make the receipts legible#