Gold Standard and Annotation Dataset for CO2 Emissions Annotation
Description
This repository contains the results of a research project which provides a benchmark dataset for extracting greenhouse gas emissions from corporate annual and sustainability reports.
The zipped datasets
file contains two datasets, gold_standard
and annotation_dataset
(password is provided in the zip file).
Data collection
- A Large Language Model (LLM) based pipeline was used to extract the greenhouse gas emissions from the reports (see columns prefixed with
llm_
inannotation_dataset
). The extracted emissions follow the categories Scope 1, 2 (market-based) and 2 (location-based) and 3, as defined in the GHGP protocol (see variablesscope
). - Annotation of the pipeline output was done in 3 phases: first by non-experts (see columns prefixed with
non_expert_
inannotation_dataset
), then by expert groups (columns prefixed withexp_group_
inannotation_dataset
) in case of disagreement of non-experts and finally in a discussion of all experts (columns prefixed withexp__disc
inannotation_dataset
) in case of disagreement between expert groups. The annotation guidelines for the non-experts and experts are also included in this repository. - The annotation results from all three phases are combined to form the final benchmark dataset:
gold_standard
. Codebooks detailing each variable of each of the two datasets are also provided. More details about the annotation template or the data wrangling scripts can be found in the GitHub repository.
Merging of datasets
Users can match the two datasets (gold_standard
and annotation_dataset
) using the variable combination of company_name
, report_year
and merge_id
(index column). The merge_id
already includes the company name and report year implicitly, but to avoid column duplication in the join operation, it should be included as join variables. For example this is useful when comparing LLM extractions to gold standard data.
Files
codebook_gold_standard.csv
Files
(4.7 MB)
Name | Size | Download all |
---|---|---|
md5:af291b895c9c8c02c1f29d5a4965c0af
|
14.1 kB | Preview Download |
md5:64c001e5d854ebd3624571f10ceb130f
|
4.3 kB | Preview Download |
md5:f8919ce1463e139d613e49476d452670
|
213.7 kB | Preview Download |
md5:153dc9fc6065561db62302b079746c1d
|
1.3 MB | Preview Download |
md5:5033da6297152d20cf618b8fd9d13ec6
|
3.2 MB | Preview Download |
Additional details
Related works
- Is described by
- Workflow: https://github.com/soda-lmu/gist-data-descriptor (URL)
Dates
- Collected
-
2024-12-10
Software
- Repository URL
- https://github.com/soda-lmu/gist-data-descriptor
- Programming language
- R
References
- GISTPROJ001