Green Is the New Gray: When ESG Claims Meet Evidence

Greenwashing usually begins with a sentence that sounds harmless enough.

“We reduced emissions.” “Our operations are greener.” “This product supports a sustainable future.”

Very nice. Also very convenient. The problem is that none of these claims can be judged by grammatical confidence, public relations polish, or the warm glow of the word sustainable. A serious reviewer has to ask uglier questions: reduced compared with what year? Which scope of emissions? Which facility? Which product line? Is the claim about a target, an initiative, or actual measured performance?

That is where many AI systems become dangerously smooth. A generic large language model can explain greenwashing fluently, define it politely, and produce a plausible verdict. What it often lacks is the thing auditors, regulators, investors, and journalists actually need: a traceable evidence trail.

The paper behind EmeraldMind starts from that practical irritation. It proposes a knowledge-graph-augmented RAG framework for greenwashing detection, designed not merely to classify sustainability claims, but to classify them with evidence-backed justifications — or abstain when the evidence is insufficient.¹ That last option matters. In ESG review, “I cannot verify this claim” is not a weakness. It is often the only responsible answer.

The claim is not the hard part; the evidence trail is

The tempting misconception is to treat greenwashing detection as sentiment analysis with a regulatory haircut. Positive environmental language? Suspicious. Concrete KPI? Probably safer. Vague promise? Maybe greenwashing. Add a few examples, prompt the model carefully, and enjoy the dashboard.

That is not the problem EmeraldMind is solving.

Greenwashing is not just a textual style. It is a mismatch between a claim and the evidence that should support it. A company may make a claim that is literally true but misleadingly scoped. Another may report progress toward a target while quietly changing the baseline. A third may describe a policy as if it were a measured outcome. The difference between those cases is not visible from the claim alone.

The paper defines the task in a stricter way: given a textual sustainability claim, determine whether it constitutes greenwashing, provide a fact-based justification, and abstain if evidence is insufficient. That formulation is more operational than glamorous. Good. Glamour is how we got half of the ESG communications industry in the first place.

The mechanism-first lesson is this: before the model can reason well, the organization has to build an evidence environment where the right facts can be found, compared, and cited.

EmeraldMind starts before the LLM is allowed to speak

EmeraldMind separates the work into two phases.

First, it constructs evidence stores from ESG reports, KPI definitions, and greenwashing claim examples. Second, it uses those stores to ground and assess new sustainability claims. This separation is not architectural decoration. It is the whole point.

Most weak RAG systems behave like nervous interns: retrieve some chunks, paste them into a prompt, and hope the model connects the dots. EmeraldMind instead builds two complementary evidence layers:

Evidence layer	What it stores	Why it matters
EmeraldDB	Vectorized ESG report chunks with metadata such as company, year, report, and page	Preserves textual context and provenance for document retrieval
EmeraldGraph	A company-centered property graph of organizations, facilities, KPI observations, goals, initiatives, claims, and relations	Converts ESG disclosure into structured reasoning paths

EmeraldDB is the familiar RAG component: ESG reports are segmented into passages, embedded, and retrieved by semantic similarity. The paper uses 250-token chunks with 50-token overlap. That is the document-memory side of the system.

EmeraldGraph is the more interesting piece. It is a structured graph that distinguishes entities and relations that ESG prose often blends together: companies, facilities, goals, KPI observations, sustainability claims, locations, materials, and initiatives. A claim about emissions reduction should not be mixed up with a target, a policy statement, or a facility-level observation. If that sounds obvious, congratulations, you have already surpassed many corporate sustainability pages.

The graph is constrained by a schema synthesized from three sources: patterns observed in parsed ESG reports, a regulatory KPI sub-schema, and claim patterns abstracted from known greenwashing examples. Candidate triples extracted by an LLM are admitted only if their types are valid under the schema. This is an important design choice. The system does not simply ask an LLM to “understand ESG.” It forces extracted facts into a controlled structure before those facts become evidence.

That is the difference between AI as a narrator and AI as a clerk with filing rules. The clerk is less charming. The clerk is also less likely to confuse an ambition with an achievement.

The graph is company-centered because ESG evidence is local

One of the useful empirical details in the paper is that EmeraldGraph is not a dense web of universal sustainability knowledge. It is a sparse, company-centered graph built from 37 publicly available ESG reports.

The resulting graph contains 53,748 entities and 59,344 relationships. Its average total degree is 2.21. Organization nodes act as hubs, while many other nodes sit in local neighborhoods around specific companies. KPIObservation is the dominant entity type, with 24,809 nodes, or roughly 46% of all entities. The reportsKPI relationship appears 24,832 times, about 42% of all relationships.

That structure tells us something important about the domain. ESG verification is usually not a global reasoning problem. It is a localized audit problem. The relevant evidence for a claim about Company X is normally not “all climate knowledge.” It is the slice of evidence attached to Company X: reported KPIs, targets, facilities, initiatives, dates, and supporting report passages.

This explains why EmeraldMind’s graph retrieval does not simply expand a large neighborhood around a company node. Naive expansion would either flood the context with loosely related ESG material or miss multi-hop evidence that happens to be deeper in the graph. Instead, the system grounds the claim, identifies relevant schema types, retrieves nodes of those types within a bounded neighborhood, ranks them by embedding similarity, and then adds shortest paths back to the company node.

The resulting evidence is not a pile of facts. It is a set of reasoning paths.

That distinction matters for business use. A compliance officer does not only need to know that an emission value appears somewhere in a report. She needs to know how that value connects to the claim, the company, the KPI, the year, and possibly a target. A graph path makes that connection inspectable.

Reasoning begins by grounding the claim

Once a claim arrives, EmeraldMind first performs claim grounding. It identifies the target company and extracts elements such as KPIs, numeric values, policy mentions, goals, and years. These elements are mapped into schema types and graph nodes.

For a claim like “Company X reduced its CO₂ emissions by 30% in 2023,” the system must recognize the company, the KPI, the claimed reduction, and the year. It must also understand that phrases like “flaring” may relate to CO₂ emissions rather than a generic safety incident. This is where ESG language becomes annoying in the professional sense: terminology is inconsistent, context-dependent, and often conveniently vague.

After grounding, EmeraldMind retrieves two kinds of context.

The graph side, EM-KGRAG, extracts a compact claim-specific subgraph from EmeraldGraph. The document side, EM-RAG, retrieves relevant report chunks from EmeraldDB, restricted by the target company. A hybrid variant, EM-HYBRID, then uses the justifications produced by the graph and document pipelines as inputs to a judge model, selecting the better justification and corresponding label.

The three variants have different operational personalities:

Variant	Evidence used	Likely operational strength	Likely weakness
EM-RAG	ESG report text chunks	Broad textual coverage and strong explanation quality	May retrieve relevant prose without enforcing structured entity relations
EM-KGRAG	Graph paths from EmeraldGraph	Structured, objective links among companies, KPIs, goals, and observations	May miss useful context if extraction or graph retrieval is incomplete
EM-HYBRID	Justifications from EM-RAG and EM-KGRAG	Best overall classification performance in the reported experiments	Depends on judge quality and inherits errors from both upstream variants

This is the paper’s central mechanism. Build evidence. Ground the claim. Retrieve structured and textual context. Classify only after the model sees the evidence. Abstain when it cannot support a verdict.

The order is not negotiable. If the LLM speaks first and the evidence arrives later, you are not doing verification. You are doing post-production.

EmeraldData is a useful benchmark, not a production universe

The paper also introduces EmeraldData, a semi-synthetic benchmark of 620 sustainability claims. This addresses a real bottleneck: verified greenwashing datasets are scarce, legally sensitive, and expensive to build. The existing GreenClaims dataset has 91 claim samples, but only 51 were usable in the paper’s evaluation because corresponding ESG reports were available.

EmeraldData is constructed in four stages. The authors begin with 37 unique company-year pairs from GreenClaims, collect relevant news articles around those pairs, filter them for contextual relevance, then use an LLM to generate both truthful and refuting claims. The same model assigns labels and produces brief article-grounded justifications. The final dataset contains 225 greenwashing claims and 395 non-greenwashing claims.

This is useful, but it should be read correctly.

EmeraldData helps test whether a system can connect generated claims with evidence in ESG reports and related source material. It does not prove that the system is ready to adjudicate live corporate misconduct. Semi-synthetic data can scale evaluation, but it can also inherit the assumptions of the generation and labeling process. The authors are not pretending otherwise. The benchmark is a bridge over a data scarcity problem, not a substitute for human-verified regulatory case history.

For article readers, this is where the result should be interpreted with discipline: strong benchmark performance supports the architecture’s promise, especially relative to generic baselines, but it does not eliminate the need for domain experts in real ESG review.

The main result is not accuracy; it is usable coverage

The classification results are easy to misread if one looks only at accuracy.

The baseline LLM often achieves very high conditional accuracy — that is, accuracy on the cases where it does not abstain. But it abstains so frequently that its practical usefulness collapses. The paper therefore reports accuracy, coverage, overall accuracy, and abstentions. Overall accuracy is effectively accuracy multiplied by coverage, treating abstentions as failures to deliver a usable decision.

That metric choice is not cosmetic. In high-stakes review systems, a model that answers only a tiny fraction of easy cases can look wonderful in a leaderboard and useless in a workflow. A greenwashing detector that refuses most claims may be safe, but it is not very helpful. A detector that answers everything by guessing is worse. The hard target is coverage with evidence.

The headline pattern is clear:

Dataset and prompt	Pipeline	Accuracy	Coverage	Overall accuracy	Abstentions
GreenClaims, zero-shot	Baseline	93.33%	29.41%	27.45%	36
GreenClaims, zero-shot	EM-HYBRID	89.47%	74.51%	66.67%	13
GreenClaims, few-shot	Baseline	100.00%	31.37%	31.37%	35
GreenClaims, few-shot	EM-HYBRID	92.31%	76.47%	70.59%	12
EmeraldData, zero-shot	Baseline	94.21%	25.97%	24.52%	459
EmeraldData, zero-shot	EM-HYBRID	85.78%	68.06%	58.39%	198
EmeraldData, few-shot	Baseline	94.21%	19.52%	18.39%	499
EmeraldData, few-shot	EM-HYBRID	83.80%	74.68%	62.58%	157

The baseline looks impressive until one notices how often it declines to decide. On EmeraldData with few-shot prompting, the baseline abstains on 499 out of 620 claims. That is not a greenwashing detector. That is a very articulate receptionist.

EmeraldMind changes the trade-off. Its variants produce decisions for far more claims while maintaining competitive accuracy. The paper summarizes the improvement as 2–4 times higher coverage than the baseline, with the EmeraldMind variants staying in the 49–77% coverage range versus 19–31% for the baseline. EM-HYBRID performs best on overall accuracy, reaching 70.59% on GreenClaims under few-shot prompting and 62.58% on EmeraldData under few-shot prompting.

There is a cost: conditional accuracy can decline as coverage increases. That is expected. Once the system stops hiding behind abstention, it faces harder cases. The business question is not whether accuracy remains cosmetically perfect. It is whether the system delivers more evidence-backed decisions without degenerating into confident nonsense.

On that question, the paper’s evidence favors the architecture.

The justification tests are not decoration; they are the audit layer

The authors do not stop at classification metrics. They also evaluate justification quality, which is exactly the right instinct. ESG review is not only about whether a label is correct. It is about whether the reasoning can be inspected.

The paper uses ILORA, an explanation-quality evaluation method with a five-point scale across five criteria: informativeness, logicality, objectivity, readability, and accuracy. This is a single-answer grading setup, using an LLM judge. It also uses a relative ranking setup where a judge ranks justifications from the baseline, EM-RAG, and EM-KGRAG. EM-HYBRID is excluded from that relative evaluation because it is built by selecting between EM-RAG and EM-KGRAG justifications.

Here is the important interpretation:

Evaluation component	Likely purpose	What it supports	What it does not prove
Classification accuracy, coverage, overall accuracy	Main evidence	Evidence-grounded retrieval improves usable decision-making versus generic LLM prompting	Production readiness for live legal or regulatory decisions
EmeraldGraph statistics	Implementation and design evidence	The graph is company-centered and KPI-heavy, matching the retrieval strategy	That every extracted entity or relation is correct
ILORA single-answer grading	Explanation-quality evidence	EmeraldMind justifications are stronger than baseline justifications across evaluated criteria	That LLM judges equal expert human auditors
Relative Borda ranking	Comparative justification evidence	EM-RAG justifications rank above EM-KGRAG and baseline in the reported judge setup	That document retrieval is always superior to graph retrieval
Friedman and Nemenyi tests	Significance check for ranking differences	The observed justification ranking differences are unlikely to be random under the test setup	That the benchmark covers all real-world ESG claim types

The relative justification results are especially interesting. Across dataset-prompt combinations, the Borda scores are 3,856 for EM-RAG, 2,538 for EM-KGRAG, and 1,658 for the baseline. In simple terms: document retrieval produces the strongest justifications in the relative ranking, graph retrieval still beats the baseline, and the baseline is mostly there to remind everyone why unsupported fluency is not governance.

This does not contradict the value of the graph. It clarifies its role.

The graph helps impose structure, retrieve company-centered paths, and distinguish categories like targets, KPI observations, and claims. But when a judge evaluates the natural-language quality of the final explanation, rich report passages may provide more complete context than graph paths alone. The hybrid model’s advantage comes from treating those sources as complementary rather than forcing one to be the hero. Very un-Hollywood. Very sensible.

For business use, this is evidence operations, not ESG magic

The practical relevance of EmeraldMind is not that it “detects greenwashing with AI.” That phrase is too broad to be useful and too tempting for bad slide decks.

The practical relevance is narrower and stronger: it shows how an organization might build an evidence-grounded review layer for sustainability claims.

A communications team could use such a system before publishing a sustainability report or campaign. Claims that lack supporting evidence could be flagged before they become reputational liabilities. An investor or analyst could use a similar pipeline to screen ESG claims across portfolio companies, especially where report language is dense and KPI tables are scattered. An auditor or compliance team could use it to triage claims that require human review.

The business pathway looks like this:

Business workflow	What the paper directly shows	Cognaptus interpretation	Boundary
ESG claim review	Claims can be grounded against ESG report evidence using document and graph retrieval	Pre-publication claim checking can become a repeatable evidence workflow	Requires high-quality internal reports, source coverage, and human sign-off
Sustainability auditing	Verdicts can be paired with retrieved justifications and abstentions	AI can triage which claims need deeper audit attention	The paper does not replace audit standards or legal judgment
Investor due diligence	Company-centered evidence stores can support claim-by-claim review	Portfolio ESG screening can move beyond broad scores and into specific claim evidence	Public disclosures may be incomplete or strategically framed
Compliance monitoring	Abstention is built into the decision process	“Insufficient evidence” can be treated as a governance signal	Abstention policy must be calibrated to regulatory and business risk
AI governance	Explanation quality is evaluated separately from labels	Responsible AI systems should assess reasoning quality, not only output accuracy	LLM-as-judge evaluation still needs validation against expert review

The most useful business insight is that ESG automation should be designed around evidence provenance. That means storing where each passage came from, which year it refers to, which company it belongs to, which KPI it describes, and how it connects to the claim being evaluated.

Without that, the system may still produce labels. It may even produce labels that look right. But it will not produce an audit trail. In sustainability review, the audit trail is not a feature. It is the product.

The limits are real, and they matter near deployment

The paper’s limitations do not weaken the mechanism. They define where the mechanism can responsibly travel.

First, the evidence base is limited. EmeraldGraph is built from 37 ESG reports. That is enough to demonstrate the architecture, but not enough to cover the full mess of real-world sustainability communication. Companies publish reports in different formats, use different KPI definitions, and sometimes bury the useful numbers in charts that seem designed by someone with a vendetta against parsers.

Second, the benchmark is semi-synthetic. EmeraldData is valuable because real greenwashing data is scarce, but LLM-generated claims and labels are not equivalent to regulator-verified cases. In production, the system would need stronger validation against expert annotations, legal records, regulatory actions, and historical cases.

Third, the pipeline itself depends on LLM-based extraction and LLM-based judging. That is not automatically disqualifying, but it creates a governance requirement. If the system extracts wrong triples, links entities poorly, or ranks justifications with a biased judge, downstream verdicts can inherit those errors. A neat graph does not guarantee true evidence. It guarantees structured evidence, which is not the same thing.

Fourth, ESG reports are not neutral ground truth. They are company-produced documents. An evidence system that relies heavily on those reports may verify whether a claim aligns with disclosed materials, but it may miss contradictions from external inspections, litigation, NGO investigations, satellite data, or regulatory filings. The authors point toward future incorporation of regulatory data and sustainability taxonomies, and that extension is not optional for serious deployment. It is the obvious next bill.

So the correct reading is not: EmeraldMind solves greenwashing.

The correct reading is: EmeraldMind shows a plausible architecture for turning greenwashing detection into evidence-grounded, abstention-aware claim review. That is already a meaningful step, because most AI systems in this area still confuse explanation-shaped text with explanation.

The better AI system is the one that knows when the file is empty

EmeraldMind’s strongest contribution is not a single metric. It is the workflow discipline.

It refuses to treat greenwashing as a vibe. It refuses to let the model answer before evidence is retrieved. It refuses to evaluate only the label while ignoring the justification. And, most importantly, it gives abstention a legitimate role. In ESG review, missing evidence is not an inconvenience to be smoothed over by a better prompt. It is often the finding.

For businesses, the lesson is direct. If an AI system is going to evaluate sustainability claims, the first investment should not be a larger model or a louder dashboard. It should be evidence infrastructure: clean report ingestion, KPI schemas, entity resolution, provenance tracking, graph paths, document retrieval, and human-review thresholds.

That is less exciting than saying “AI will end greenwashing.” It is also more useful.

Green is no longer a simple color in corporate reporting. It is a claim, a KPI, a baseline, a scope, a report passage, a graph edge, and sometimes a legal problem waiting patiently in the footnotes.

Gray, apparently, is where the work begins.

Cognaptus: Automate the Present, Incubate the Future.

Georgios Kaoukis et al., “EmeraldMind: A Knowledge Graph–Augmented Framework for Greenwashing Detection,” arXiv:2512.11506, 2025, https://arxiv.org/abs/2512.11506. ↩︎

The claim is not the hard part; the evidence trail is#

EmeraldMind starts before the LLM is allowed to speak#

The graph is company-centered because ESG evidence is local#

Reasoning begins by grounding the claim#

EmeraldData is a useful benchmark, not a production universe#

The main result is not accuracy; it is usable coverage#

The justification tests are not decoration; they are the audit layer#

For business use, this is evidence operations, not ESG magic#

The limits are real, and they matter near deployment#

The better AI system is the one that knows when the file is empty#