Opening — Why this matters now

ESG is no longer a soft-power marketing exercise. Mandatory disclosures are tightening, regulators expect traceability, and investors want evidence rather than adjectives. The problem? ESG reports—hundreds of pages of slide-like layouts, drifting hierarchies, and orphaned charts—remain designed for optics, not analysis. Even advanced document models buckle under their chaotic reading order.

Into this disorder walks Pharos-ESG, a framework that doesn’t merely extract text from ESG disclosures—it reconstructs reading order, infers implicit hierarchies, grounds charts in context, and produces structured, labeled datasets suitable for financial research.

If ESG reports are the new infrastructure of sustainable finance, then systems like Pharos-ESG are the missing operating system.

Background — The pre-AI paralysis of ESG documents

For years, financial analysts have relied on proxies: third‑party ESG ratings, selective text snippets, or small-scale case studies. Not out of laziness, but necessity. The reports themselves were effectively unreadable at scale.

Two persistent problems defined the landscape (fileciteturn0file0):

  1. Chaotic visual layouts — slide-like pages mixing text, tables, images, and decorative elements, often without predictable sequencing.
  2. Implicit, inconsistent hierarchies — heading styles vary; numbering is optional; structure is implied rather than stated.

Traditional document AI models—LayoutLM, DocFormer, or OCR-driven parsers—excel on regular documents (legal forms, academic papers). ESG reports, however, are the wild frontier.

Analysis — What Pharos-ESG actually does

Pharos-ESG introduces a unified, multimodal pipeline designed specifically for long, visually irregular documents.

1. Reading-order modeling

Rather than guessing sequences top-to-bottom, left-to-right, Pharos-ESG uses a successor classification framework—computing pairwise relations between blocks via semantic, spatial, and categorical features. These form a directed graph, later topologically sorted into consistent reading order. The result: a coherent flow even when pages are more “pitch deck” than “report”.

2. Hierarchical structure reconstruction via ToC anchors

Most ESG reports include a Table of Contents—but not necessarily one that matches the document layout or wording. Pharos-ESG’s RAP (Region-Aware Prompting) reconstructs TOC hierarchies using:

  • color similarity,
  • spatial grouping,
  • cross-region visual cues,
  • multimodal LLM reasoning.

ALIGN then matches these TOC anchors with real body text using exact, fuzzy, and context-driven reasoning. The system doesn’t just find headings—it inserts missing ones where structure is broken.

3. Contextual image-to-text transformation

Charts and images rarely stand alone. Pharos-ESG aggregates visual blocks with their nearest headings and text, feeding the combined cluster through a multimodal generator (Qwen2.5-VL). Instead of generic descriptions, it outputs contextualized narratives linked to section themes.

For example (see page 6): charts about carbon-free energy distribution are interpreted together with surrounding text to correctly infer their temporal and operational semantics.

4. Multi-level financial labels

Using MLPDH (Multi-Level Prediction with Document Hierarchy), Pharos-ESG assigns:

  • ESG category (E/S/G/N),
  • GRI indicator, and
  • sentiment.

This turns ambiguous prose into structured analytical signals.

Findings — How well does it perform?

Benchmarked against both document parsers and multimodal LLMs, Pharos-ESG consistently leads (fileciteturn0file0). Below is a distilled summary.

Table 1 — Parsing performance across systems

System Class Best F1 ROKT (Reading Order) ToC-Body Alignment
Dedicated Parsers (Docling, MinerU, Textin) 82.55 0.80 < 17%
Multimodal LLMs (GPT‑4o, Gemini 2.5, DeepSeek) ~87.5 0.45–0.75 < 65%
Pharos-ESG 93.59 0.92 92.46%

Table 2 — Multi-level labeling accuracy

Model Macro-F1 Hierarchy Logic Accuracy
SVM / XGBoost ~70
BERT / HAN / HMCN 76–79 81–88%
MLPDH (Pharos-ESG) 86.32 94.78%

The pattern is clear: Pharos-ESG is not a small improvement—it’s a different class of system.

Cross-market robustness

Pharos-ESG performs best on U.S. reports (more standardized), slightly below on Hong Kong documents, but still strong across the board.

Market Parsing F1 Macro-F1 (labels)
China 92.04 86.32
Hong Kong 89.05 87.20
United States 94.30 87.60

Implications — What this means for business and regulators

1. ESG becomes machine-readable infrastructure

Once reports become structured data, ESG shifts from narrative-driven to signal-driven. Investors can analyze disclosure breadth, depth, tone, and consistency at scale.

2. Greenwashing detection becomes automatable

Aurora‑ESG’s consistent GRI-level mapping enables:

  • year-over-year comparison,
  • cross-firm benchmarking,
  • identification of omissions,
  • tone–content discrepancy analysis.

3. Regulators gain real-time auditability

Instead of manually reviewing hundreds of pages, an AI pipeline can:

  • summarize key deviations,
  • validate disclosure alignment,
  • flag missing mandatory sections,
  • generate audit-ready structured summaries.

4. Enterprises face new competitive pressure

Once peers’ disclosures become quantifiable, ESG performance—and ESG communication strategy—will face empirical comparison.

Conclusion — Toward an AI-native ESG world

Pharos-ESG demonstrates a simple truth: the future of ESG reporting is not better design or longer narratives. It’s machine-aligned structure. Once ESG disclosures become parseable, comparable, and contextualized, sustainable finance shifts from symbolism to substance.

The invisible part of ESG—the data infrastructure—finally gets its lighthouse.

Cognaptus: Automate the Present, Incubate the Future.