GIST: Greenhouse Gas Insights and Sustainability Tracking

Project Description

The Greenhouse Gas Insights and Sustainability Tracking project (GIST) applies Natural Language Processing (NLP) and, in particular, Large Language Models (LLMs) to extract information from companies’ sustainability reports.

On the methodological side, the GIST project investigates amongst others how specifications to the extraction pipeline such as the choice of prompts or embeddings affects model performance. Moreover, the GIST project has a strong focus on annotation quality and the interactions between machine and human annotators. In this context, we have, for instance published a gold-standard Greenhouse Gas (GHG) emissions dataset that was validated through multiple annotation rounds.

On the substantive side, the GIST project develops open-source and transparent extraction pipelines that transform information from dispersed, unstructured and heterogenous sources into structured data. This work has so far mainly been focused on extracting comparable disclosures on companies GHG emissions. In addition, other ongoing and planned workstreams will extract location (i.e. physical-asset) specific disclosures and other climate and nature related indicators such as energy and water use.

Finally, the GIST project explores how NLP methods can be used to assess the quality of company-reported data and, relatedly, detect instances of misleading information that could constitute “greenwashing”.

Sub-project: Spatial Cues in Corporate ESG Disclosures

The Spatial Cues in Corporate ESG Disclosures sub-project develops methods and indicators to discover spatial information in companies’ sustainability reporting, like asset or factory locations, sites of environmental impact or locations of best practice examples. Drawing from research on “spatial finance” (Caldecott et al. 2022) and “planetary materiality” (Wassénius et al. 2024), we start from the observation that almost all sustainability-related risks and impacts occur at the level of the physical asset rather than at the legal-organizational level of the company.

To get an overview to what extent and how companies report on their assets, we develop a Named Entity Recognition (NER) pipeline that extracts location information from corporate reports. In addition, we qualitatively annotate location-specific disclosures in corporate reports. Combining NLP and qualitative methods we seek to answer questions including:

Are there sectoral and geographical differences regarding asset-level disclosures?
How complete or selective are asset-level disclosures?
In which format (e.g. tables, boxes, main-text, graphs) are asset-level data described?
What topics are described at the asset-level?
Which indicators are disclosed at the asset level?

Apart from answering these questions, we create a database containing the extracted asset locations and their attributes (i.e. indicators related to the asset and its owner). This database will then enable geographical aggregations of assets with different owners that are spatially close thus highlighting potential cumulative impacts and risks. In addition, we aim to combine the extracted data with other geographical data including Earth Observation derived land use indicators, socio-demographic and economic variables from official statistics and local building policies and plans in the GreenDIA platform (in progress).

Sub-project: Typology of Data Quality Problems in ESG Reporting

The sub-project Typology of Data Quality Problems in ESG Reporting explores how the accuracy of company-reported GHG emission values can be categorized and assessed by human and LLM-based annotation pipelines. While most NLP research on data extraction from corporate reports is preoccupied with model accuracy, that is extracting the values as they are reported, in this sub-project we ask how trustworthy the reported values are in representing a company's actual GHG emissions.

To achieve this, we propse a typology of 30 data quality problems. This typology integrates insights from academic literature on corporate disclosures, standards, calculation methodologies, and our own qualitative findings from annotation tasks.Derived from this typology we provide a questionnaire at the report level and annotated examples at the text or screenshot level.

Although the typology is a work in progress, our goal is to integrate data quality problems as contextual attributes in datasets of extracted GHG values. Recording the presence and type of data quality problems in a report can offer users additional valuable information.

To read the report: Click here (PDF, 1,919 KB)

Project Team

Name	Email	Organization unit
Dimmelmeier, Andreas	a.dimmelmeier@stat.uni-muenchen.de	LMU
Beck, Jacob	jacob.beck@stat.uni-muenchen.de	LMU
Schierholz, Malte	malte.schierholz@stat.uni-muenchen.de	LMU
Kreuter, Frauke	soda@stat.uni-muenchen.de	LMU
Ahmed, Maaz	maaz.ahmed@campus.lmu.de	LMU
Fraser, Alex	alexander.fraser@tum.de	TUM
Elhefnawy, Ahmed	ahmed.elhefnawy@tum.de	TUM
Sommer, Felicitas	felicitas.sommer@tum.de	TUM
Kormanyos, Emily (Deutsche Bundesbank)		Deutsche Bundesbank
Fehr, Maurice		Deutsche Bundesbank
Reichenbach, Lisa		Deutsche Bundesbank
Oehler, Simon		Deutsche Bundesbank

Publications

Beck, J., Steinberg, A., Dimmelmeier, A. et al. Addressing data gaps in sustainability reporting: A benchmark dataset for greenhouse gas emission extraction. Sci Data 12, 1497 (2025). https://doi.org/10.1038/s41597-025-05664-8. Link: https://www.nature.com/articles/s41597-025-05664-8
Andreas Dimmelmeier, Hendrik Doll, Malte Schierholz, Emily Kormanyos, Maurice Fehr, Bolei Ma, Jacob Beck, Alexander Fraser, and Frauke Kreuter. 2024. Informing climate risk analysis using textual information - A research agenda. In Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024), pages 12–26, Bangkok, Thailand. Association for Computational Linguistics.