When ESG Meets LLM: Decoding Corporate Green Talk on Social Media

A corporate sustainability post rarely says, “Please admire our reputational risk management.”

It says something friendlier. A tree-planting day. A Pride Month banner. A smiling volunteer team. A solar panel photographed at just the right angle. A line about communities, innovation, opportunity, resilience, or the future. The usual words, freshly laundered.

The analytical problem is that these posts are not random fluff. They are corporate communication at scale, and they are increasingly multimodal: text, hashtags, brand imagery, infographics, event photos, symbolic gestures, and occasionally something resembling an operational fact. Reading them one by one is theatre. Ignoring them is also a choice, just not a very intelligent one.

A 2025 University of Amsterdam paper, Analyzing Sustainability Messaging in Large-Scale Corporate Social Media, offers a more useful middle path: use large language models and vision-language models to turn this messy stream of corporate green talk into a structured object of analysis.¹ Not a lie detector. Not a magical greenwashing tribunal. A measurement pipeline.

That distinction matters. The paper does not prove that companies are insincere. It does not establish causal manipulation. It does not look inside the executive soul, which is probably merciful for everyone involved. What it does show is that foundation models can identify sustainability-related themes in corporate social media at scale, connect those patterns to ESG-risk scores and engagement metrics, and surface visual motifs that would be difficult to predefine manually.

The contribution is not “AI finds greenwashing.” The contribution is subtler and more operational: AI makes corporate sustainability messaging auditable as a dataset.

The real mechanism is annotation without a bespoke annotation army

Traditional content analysis has an unglamorous bottleneck: labels. If researchers want to know whether corporate tweets align with the UN Sustainable Development Goals, someone has to classify them. If the corpus contains a few thousand posts, that is annoying. If it contains more than a million posts, it becomes an organisational lifestyle choice.

The paper starts from this problem. Corporate sustainability language is broad, contextual, and constantly shifting. A company can reference gender equality without saying “SDG 5.” It can talk about infrastructure, public health, clean energy, or community resilience without using the official vocabulary of the Sustainable Development Goals. A keyword system would miss much of the point. A supervised classifier would require a labelled dataset that may age badly as corporate language evolves.

So the authors use a zero-shot LLM ensemble as an ad hoc classifier.

Each tweet is assigned to one of 18 categories: the 17 SDGs or “None.” The ensemble consists of Qwen2.5, Mistral NeMo, and Phi-4. Each model receives the same classification prompt. The outputs are constrained to fixed labels, and if the generated response contains unwanted reasoning or disclaimers, the system conservatively treats it as “None.” The final label comes from majority voting, with Qwen2.5 used as the tie-breaker because it performs best in evaluation.

This is not an exotic architecture. That is partly why it matters. The system uses foundation models as flexible annotators rather than training a bespoke classifier from scratch. The point is not that LLMs know corporate virtue when they see it. The point is that they can map ambiguous short-form communication to a stable taxonomy with enough consistency to support large-scale analysis.

The authors evaluate this text classifier using 6,310 tweets containing hashtags that map clearly to specific SDGs. Hashtags are not perfect ground truth; they are a proxy. Still, they provide a practical validation signal. The majority-vote ensemble reaches 82.09% agreement and a Cohen’s Kappa of 0.79 against the hashtag-derived labels, outperforming each individual model. Qwen2.5 alone reaches 80.21% agreement and 0.77 Kappa; Mistral NeMo is close at 79.81% and 0.76; Phi-4 is lower at 73.80% and 0.70.

That result should be read correctly. It says the ensemble is reasonably reliable for broad thematic labelling under this proxy evaluation. It does not say the system understands corporate intent. It does not say hashtags are the Platonic form of SDG truth. It says the classifier is good enough to support the next layer of analysis, with the usual bruises and caveats attached.

Component	Likely purpose in the paper	What it supports	What it does not prove
Hashtag-based evaluation	Classifier validation	The LLM ensemble can map SDG-related tweets to SDG labels with moderate-to-strong agreement	That every implicit sustainability claim is correctly interpreted
Majority voting across LLMs	Reliability improvement	Ensembling reduces dependence on one model family	That LLM outputs are free from systematic bias
Spearman correlations with ESG risk	Main empirical evidence	Sector-specific associations between messaging emphasis and ESG-risk scores	Causal links between messaging and real-world ESG performance
Mann-Whitney filtering for visual clusters	Statistical selection of visual themes	Some visual motifs differ meaningfully from sector baselines in risk or engagement	That the image theme caused the risk or engagement difference
Engagement comparison for SDG vs non-SDG tweets	Negative evidence	SDG-relevant tweets do not clearly outperform non-SDG tweets in likes or retweets	That audiences dislike sustainability messaging

The table is not decoration. It is the difference between analysis and vibes wearing a lab coat.

The visual layer is where the paper becomes more interesting

The text classifier answers a familiar question: what are companies talking about?

The visual pipeline asks a stranger and more useful one: what kinds of images cluster around companies with different ESG-risk and engagement profiles?

Instead of applying the SDG classifier directly to images, the authors take a bottom-up route. They embed images using a vision-language model from the OpenCLIP ecosystem, remove duplicates using perceptual hashing, filter for SDG-relevant visual content, and cluster the resulting image embeddings. The clustering is graph-based: images are connected by cosine similarity, candidate clusters require a minimum number of close neighbours, and overlapping clusters are resolved in favour of larger ones. In the reported implementation, clusters require at least 50 images and a cosine similarity threshold of 0.75.

That choice is important. A top-down approach would start with predefined labels such as “solar panels,” “diversity event,” or “trees” and search for them. The paper instead lets the image space form its own clusters, then asks whether those clusters deviate from the sector baseline in ESG risk or engagement.

This is the mechanism-first heart of the study. The authors are not merely counting green words. They are building a way to discover shared corporate visual themes, then measuring whether those themes appear disproportionately among firms with higher or lower ESG risk, or among posts with higher or lower engagement.

The selected clusters must be interpretable, shared across multiple companies, and statistically separated from the background population. To avoid mistaking one company’s repetitive brand imagery for an industry-wide motif, the authors use normalized Shannon entropy and distinct-company counts to assess whether a cluster is broadly represented. They also compare each cluster’s median ESG risk and engagement against the sector population using Mann-Whitney U tests.

Then, because nobody wants to manually inspect endless near-identical corporate event photos unless legally required, they sample representative images and use Qwen2.5-VL to generate concise summaries and concept lists.

The result is a pipeline that turns images into thematic signals. Again, not proof of motive. But a useful diagnostic surface.

The dataset is large enough to make the patterns worth noticing

The paper applies the pipeline to GreenScreen-derived corporate social media data: 1,374,049 tweets and 715,081 accompanying images from 537 active corporate Twitter accounts, covering January 1, 2017 to December 13, 2022. The companies are mapped to GICS sectors, and ESG-risk scores are retrieved from Sustainalytics using stock symbols.

This scale matters because corporate sustainability communication is noisy. One campaign can distort a small sample. One viral event can dominate a week. One brand’s posting habits can masquerade as a sector trend. A million-post corpus does not remove all bias, but it gives the analysis room to separate recurring patterns from isolated performance art.

Across the dataset, the authors find that SDG-relevant content has become more embedded in corporate communication over time. Total tweet volume and SDG-relevant tweet volume move together, but the proportion of SDG-related tweets rises steadily from 2017 onward. This is not a shocking finding. Anyone with a LinkedIn account and a tolerance for corporate PDFs could have guessed the direction. The value is that the paper quantifies it at scale.

The distribution across SDGs is more revealing. Companies do not communicate evenly across the 17 goals. SDG 8, Decent Work and Economic Growth, and SDG 9, Industry, Innovation and Infrastructure, appear consistently among the most frequent themes. Climate Action, by contrast, does not show the same broad increase. During the early COVID-19 period, roughly from the end of 2019 through the third quarter of 2020, SDG 3, Good Health and Well-being, and SDG 8 spike, reflecting the obvious corporate preoccupation with health and economic uncertainty.

The picture is sector-specific. Information Technology, Financials, and Industrials lean heavily toward work, growth, innovation, and infrastructure. Utilities allocate more attention to clean energy. Healthcare unsurprisingly centres health and well-being, while also touching education and gender equality. Sectors facing environmental exposure do not necessarily flood the zone with direct environmental protection themes. Often, they favour economic and social positives.

That is where interpretation becomes more delicate. The paper suggests that companies may highlight SDGs that align with public narratives or operational strengths while downplaying harder topics. This is plausible. It is also not the same as proving deception. A firm may speak more about decent work because labour is its genuine priority, because climate topics are legally sensitive, because communications staff follow a campaign calendar, or because executives discovered that trees photograph better than supply-chain accounting. The dataset can reveal pattern; it cannot subpoena intention.

ESG-risk correlations show selective sustainability talk

The authors then correlate each company’s proportion of SDG-focused tweets with its ESG-risk score, using Spearman rank correlations within sectors. This is one of the paper’s main empirical moves. It links communication emphasis to an external proxy for sustainability-related risk.

Several sector-level results stand out.

In Energy, higher ESG risk correlates positively with more communication around Climate Action ($0.5789$) and Gender Equality ($0.4825$). The first association is intuitive: higher-risk energy companies face direct pressure around climate. The second is more interesting because gender equality is socially positive and reputationally safe. It allows an energy firm to speak the language of sustainability without necessarily dwelling on emissions, extraction, or transition risk. Convenient, that.

In Financials, higher ESG risk is negatively correlated with Industry, Innovation, and Infrastructure ($-0.3432$) and Affordable and Clean Energy ($-0.2895$), but positively correlated with No Poverty ($0.3301$), Decent Work and Economic Growth ($0.3037$), and Quality Education ($0.2960$). This suggests that higher-risk financial firms may communicate more around social and economic inclusion themes while paying less attention to infrastructure and clean-energy framing.

Healthcare shows a mixed profile: higher-risk firms communicate more on Good Health and Well-being ($0.3579$), but less on Clean Water and Sanitation ($-0.4191$), Responsible Consumption and Production ($-0.3895$), and Industry, Innovation, and Infrastructure ($-0.3128$). Industrials show several negative associations but a positive one with Industry, Innovation, and Infrastructure ($0.2747$). Materials correlate positively with Decent Work and Economic Growth ($0.3404$) and negatively with Responsible Consumption and Production ($-0.3336$). Utilities show a positive association with Sustainable Cities and Communities ($0.4427$).

The important reading is not “riskier companies talk more about ESG.” That is too blunt. The better reading is that riskier companies appear to talk selectively. They emphasize some SDG themes and avoid others, and the selection differs by sector.

That is business-relevant because selective sustainability language is precisely where communication risk lives. A company’s ESG narrative may be technically active but strategically evasive. The risk is not silence. The risk is beautifully organised partial disclosure in social-media form.

The visual clusters expose the soft-focus strategy

The paper’s most memorable evidence comes from two illustrative sector studies: Materials and Financials.

In the Materials sector, clusters associated with higher ESG risk contain visual themes such as community gardening, environmental conservation, volunteerism, planting trees, outdoor work, and sustainable practices. One cluster has a risk deviation of $+11.62$, includes 15 distinct companies, and has a normalized entropy of $0.64$. Another, centred on farming and agricultural landscapes, shows a risk deviation of $+8.28$.

Then comes the contrast. A cluster showing oil refineries, petrochemical plants, power facilities, pipelines, manufacturing sites, and industrial landscapes has a lower risk deviation of $-2.15$ and a positive engagement deviation of $+9.00$.

That is an elegant little discomfort. Higher-risk firms in Materials are more associated with nature and community imagery. Lower-risk firms are more comfortable showing industrial operations. The interpretation practically writes itself, although it should still be handled with gloves: firms carrying higher ESG risk may lean on softer visual themes that are adjacent to sustainability but less connected to their core industrial activity.

In Financials, the pattern shifts but the mechanism rhymes. A cluster around food distribution, home construction, clothing donation, disaster relief, and community support shows a risk deviation of $+7.19$ and an engagement deviation of $+2.00$. A cluster featuring financial institutions, stock exchanges, small-business celebrations, corporate events, and professional gatherings shows risk deviation of $+5.66$ and engagement deviation of $+18.00$. A Pride Month and LGBTQ+ support cluster shows risk deviation of $+3.32$ and engagement deviation of $+15.00$.

The Financials case is not about environmental imagery. It is about socially resonant imagery: inclusion, community support, celebration, local benefit. These themes may be sincere. They may also be reputationally useful. The method cannot tell which. It can tell that these motifs are not randomly distributed.

This is where the article’s old, tempting phrase — “algorithmic greenwashing detection” — needs discipline. The pipeline does not detect greenwashing directly. It detects patterned gaps between communication motifs, ESG-risk proxies, and engagement outcomes. That is still valuable. In business, early warning systems usually begin as structured suspicion, not courtroom evidence.

Sustainability posts do not automatically win attention

One of the paper’s useful negative findings is that SDG-relevant tweets do not clearly outperform non-SDG tweets on engagement.

After removing outliers using Tukey’s fences, SDG-relevant tweets have a mean of 4.27 likes and median of 3.00, compared with 8.19 mean likes and 3.00 median likes for non-SDG-relevant tweets. For retweets, SDG-relevant posts have a mean of 1.93 and median of 1.00, compared with 3.06 mean and 2.00 median for non-SDG-relevant posts. The authors use Mann-Whitney U tests and do not find statistically significant evidence that SDG-relevant posts receive higher engagement.

This matters because it undercuts a lazy marketing interpretation. Companies are not necessarily posting sustainability content because it reliably generates more likes and retweets. At least in this dataset, sustainability content is not an obvious engagement machine.

So why post it? Because engagement is only one objective. Corporate social media is not merely a popularity contest; it is stakeholder theatre, investor signalling, recruitment branding, regulator-facing atmospherics, and reputational maintenance compressed into short-form content. Some posts are not designed to go viral. They are designed to exist, to be discoverable, and to complete the narrative record.

The paper does not test those motives directly. But the lack of engagement advantage makes the strategic-communication interpretation more plausible than a simple “green posts get clicks” story.

The business value is communications-risk intelligence, not ESG prophecy

For companies, the practical lesson is uncomfortable but manageable: sustainability communication is now machine-readable. The social-media archive can be audited not only for what it says explicitly, but for how its themes shift across time, sector, risk profile, and image choice.

That creates several business use cases.

First, companies can benchmark their own sustainability messaging against sector peers. Are they overusing generic community imagery while undercommunicating operational changes? Are they talking about the SDGs that align with their real exposures, or only the ones that photograph well? Are their climate commitments visible in content, or buried beneath innovation slogans and smiling volunteers?

Second, investors and analysts can use similar pipelines as weak-signal tools. A mismatch between high ESG risk and soft symbolic imagery does not prove misconduct. It can, however, flag where deeper review is warranted. The value is triage: deciding where human analysts should spend attention.

Third, regulators and watchdogs can monitor narrative drift. If a sector increasingly speaks about social contribution while reducing direct references to harder environmental goals, that may indicate changing stakeholder pressure, defensive communication, or simply campaign fashion. None of these explanations should be assumed. All of them are worth noticing.

Fourth, communications teams can use the method defensively. Not to optimise green aesthetics — please, no more factory-approved saplings — but to check whether their public messaging accurately reflects operational substance. The reputational danger is not only saying too much. It is saying the easy part repeatedly while leaving the material part strangely absent.

A useful operating framework would separate four layers:

Layer	What the paper directly enables	Business interpretation	Boundary
Text theme mapping	Classifies posts into SDG-related categories	Shows what sustainability topics dominate communication	Depends on LLM labels and SDG taxonomy fit
Sector comparison	Compares theme distribution across industries	Reveals communication norms and omissions	Sector context affects meaning
ESG-risk association	Links theme proportions to ESG-risk scores	Flags selective emphasis among higher-risk firms	Correlation, not causation
Visual motif analysis	Finds image clusters with risk or engagement deviations	Surfaces symbolic communication patterns	Motifs require human interpretation

The last column is not legal padding. It is the difference between useful intelligence and analytical overreach.

The boundaries are where the tool becomes usable

The paper’s boundaries are specific enough to be operationally useful.

First, ESG risk is a proxy. Sustainalytics scores represent assessed exposure and management of ESG-related risk, not a complete truth source for corporate sustainability performance. Different ESG providers can disagree. ESG risk also varies by sector, business model, disclosure practices, and methodology. Treating it as an oracle would be very on-brand for the ESG industry, but still wrong.

Second, the classifier evaluation uses hashtags as proxy labels. Hashtags are practical, but they are not perfect annotations. Some tweets use hashtags for visibility rather than meaning. Some meaningful SDG-related content lacks hashtags entirely. The evaluation establishes workable agreement, not universal semantic certainty.

Third, the dataset covers corporate Twitter accounts from 2017 to 2022. That period matters. Twitter/X changed substantially after 2022, and corporate posting strategies may have shifted with platform dynamics, regulation, stakeholder pressure, and the general exhaustion of everyone asked to write “purpose-driven” copy for the eighth time that week.

Fourth, the visual examples are illustrative sector analyses, not universal laws. Materials and Financials show compelling patterns, but the exact motifs and meanings will differ across industries. Tree planting in Materials is not the same signal as food distribution in Financials.

Fifth, the method surfaces associations. It does not infer intent. A high-risk company posting community-garden photos may be performing reputational displacement. It may also be documenting a real community programme. The pipeline tells you where to look. It does not replace the looking.

That final point is the practical boundary for deployment. Human review remains necessary, especially before using these signals for compliance, investment decisions, public accusations, or corporate governance action. The model is an analyst’s instrument, not a judge.

The sharper lesson: green talk has become structured data

The paper’s lasting contribution is not the finding that companies talk about sustainability selectively. We already suspected that. Some of us suspected it with considerable enthusiasm.

The stronger contribution is methodological. The authors show how foundation models can transform high-volume, ambiguous, multimodal corporate communication into structured evidence: SDG labels, sector distributions, ESG-risk correlations, visual clusters, engagement deviations, and interpretable themes.

That changes the status of sustainability messaging. It is no longer just branding collateral floating through the social feed. It is a measurable behavioural trace.

For businesses, this means ESG communication will be judged less by isolated campaigns and more by longitudinal pattern. For investors, it means public messaging can become one more weak signal in risk analysis. For regulators and civil society, it means the gap between operational reality and symbolic narrative may become easier to detect, even if not automatically provable.

The irony is tidy. Companies used social media to make sustainability communication more immediate, emotional, and visually persuasive. Foundation models may now make that same communication more searchable, comparable, and accountable.

The green post is still a green post. But now it has metadata.

Cognaptus: Automate the Present, Incubate the Future.

Ujjwal Sharma, Stevan Rudinac, Ana Mićković, Willemijn van Dolen, and Marcel Worring, “Analyzing Sustainability Messaging in Large-Scale Corporate Social Media,” arXiv:2511.01550, 2025, https://arxiv.org/abs/2511.01550. ↩︎

The real mechanism is annotation without a bespoke annotation army#

The visual layer is where the paper becomes more interesting#

The dataset is large enough to make the patterns worth noticing#

ESG-risk correlations show selective sustainability talk#

The visual clusters expose the soft-focus strategy#

Sustainability posts do not automatically win attention#

The business value is communications-risk intelligence, not ESG prophecy#

The boundaries are where the tool becomes usable#

The sharper lesson: green talk has become structured data#