OSDG Community Dataset (OSDG-CD)

OSDG; UNDP IICPSD SDG AI Lab; PPMI

doi:10.5281/zenodo.11441197

Published April 1, 2024 | Version 2024.04

Dataset Open

OSDG Community Dataset (OSDG-CD)

The OSDG Community Dataset (OSDG-CD) is a public dataset of thousands of text excerpts, which were validated by over 1,400 OSDG Community Platform (OSDG-CP) citizen scientists from over 140 countries, with respect to the Sustainable Development Goals (SDGs).

Dataset Information

In support of the global effort to achieve the Sustainable Development Goals (SDGs), OSDG is realising a series of SDG-labelled text datasets. The OSDG Community Dataset (OSDG-CD) is the direct result of the work of more than 1,400 volunteers from over 130 countries who have contributed to our understanding of SDGs via the OSDG Community Platform (OSDG-CP). The dataset contains tens of thousands of text excerpts (henceforth: texts) which were validated by the Community volunteers with respect to SDGs. The data can be used to derive insights into the nature of SDGs using either ontology-based or machine learning approaches.

📘 The file contains 43,0210 (+390) text excerpts and a total of 310,328 (+3,733) assigned labels.

To learn more about the project, please visit the OSDG website and the official GitHub page. Explore a detailed overview of the OSDG methodology in our recent paper "OSDG 2.0: a multilingual tool for classifying text data by UN Sustainable Development Goals (SDGs)".

Source Data

The dataset consists of paragraph-length text excerpts derived from publicly available documents, including reports, policy documents and publication abstracts. A significant number of documents (more than 3,000) originate from UN-related sources such as SDG-Pathfinder and SDG Library. These sources often contain documents that already have SDG labels associated with them. Each text is comprised of 3 to 6 sentences and is about 90 words on average.

Methodology

All the texts are evaluated by volunteers on the OSDG-CP. The platform is an ambitious attempt to bring together researchers, subject-matter experts and SDG advocates from all around the world to create a large and accurate source of textual information on the SDGs. The Community volunteers use the platform to participate in labelling exercises where they validate each text's relevance to SDGs based on their background knowledge.

In each exercise, the volunteer is shown a text together with an SDG label associated with it – this usually comes from the source – and asked to either accept or reject the suggested label.

There are 3 types of exercises:

All volunteers start with the mandatory introductory exercise that consists of 10 pre-selected texts. Each volunteer must complete this exercise before they can access 2 other exercise types. Upon completion, the volunteer reviews the exercise by comparing their answers with the answers of the rest of the Community using aggregated statistics we provide, i.e., the share of those who accepted and rejected the suggested SDG label for each of the 10 texts. This helps the volunteer to get a feel for the platform.
SDG-specific exercises where the volunteer validates texts with respect to a single SDG, e.g., SDG 1 No Poverty.
All SDGs exercise where the volunteer validates a random sequence of texts where each text can have any SDG as its associated label.

After finishing the introductory exercise, the volunteer is free to select either SDG-specific or All SDGs exercises. Each exercise, regardless of its type, consists of 100 texts. Once the exercise is finished, the volunteer can either label more texts or exit the platform. Of course, the volunteer can finish the exercise early. All progress is saved and recorded still.

To ensure quality, each text is validated by up to 9 different volunteers and all texts included in the public release of the data have been validated by at least 3 different volunteers.

It is worth keeping in mind that all exercises present the volunteers with a binary decision problem, i.e., either accept or reject a suggested label. The volunteers are never asked to select one or more SDGs that a certain text might relate to. The rationale behind this set-up is that asking a volunteer to select from 17 SDGs is extremely inefficient. Currently, all texts are validated against only one associated SDG label.

Column Description

doi - Digital Object Identifier of the original document
text_id - unique text identifier
text - text excerpt from the document
sdg - the SDG the text is validated against
labels_negative - the number of volunteers who rejected the suggested SDG label
labels_positive - the number of volunteers who accepted the suggested SDG label
agreement - agreement score based on the formula \(agreement = \frac{|labels_{positive} - labels_{negative}|}{labels_{positive} + labels_{negative}}\)

Further Information

Do not hesitate to share with us your outputs, be it a research paper, a machine learning model, a blog post, or just an interesting observation. All queries can be directed to community@osdg.ai.

Notes

This CSV file uses UTF-8 character encoding. For easy access on MS Excel, open the file using Data → From Text/CSV. Please split CSV data into different columns by using a TAB delimiter.

Files

osdg-community-data-v2024-04-01.csv

Files (30.2 MB)

Name	Size	Download all
osdg-community-data-v2024-04-01.csv md5:26627ce342bc1de474be21d9a9b80536	30.2 MB	Preview Download

	All versions	This version
Views	17,746	1,415
Downloads	27,305	1,446
Data volume	738.6 GB	62.9 GB

OSDG Community Dataset (OSDG-CD)

Creators

Description

Notes

Files

osdg-community-data-v2024-04-01.csv

Files (30.2 MB)