Dataset Open Access

Text Analyses of Survey Data on "Mapping Research Output to the Sustainable Development Goals (SDGs)"

Vanderfeesten, Maurice; Spielberg, Eike; Hasse, Linda

Besselaar, Peter van den

This package contains data on five text analysis types (term extraction, contract analysis, topic modeling, network mapping), based on the survey data where researchers selected research output that are related to the 17 Sustainable Development Goals (SDGs). This is used as input to improve the current SDG classification model v4.0 to v5.0

Sustainable Development Goals are the 17 global challenges set by the United Nations. Within each of the goals specific targets and indicators are mentioned to monitor the progress of reaching those goals by 2030. In an effort to capture how research is contributing to move the needle on those challenges, we earlier have made an initial classification model than enables to quickly identify what research output is related to what SDG. (This Aurora SDG dashboard is the initial outcome as proof of practice.)

The initiative started from the Aurora Universities Network in 2017, in the working group "Societal Impact and Relevance of Research", to investigate and to make visible 1. what research is done that are relevant to topics or challenges that live in society (for the proof of practice this has been scoped down to the SDGs), and 2. what the effect or impact is of implementing those research outcomes to those societal challenges (this also have been scoped down to research output being cited in policy documents from national and local governments an NGO's).

Context of this dataset | classification model improvement workflow

The classification model we have used are 17 different search queries on the Scopus database.

Methods used to do the text analysis

  1. Term Extraction: after text normalisation (stemming, etc) we extracted 2 terms in bigrams and trigrams that co-occurred the most per document, in the title, abstract and keyword
  2. Contrast analysis: the co-occurring terms in publications (title, abstract, keywords), of the papers that respondents have indicated relate to this SDG (y-axis: True), and that have been rejected (x-axis: False). In the top left you'll see term co-occurrences that a clearly relate to this SDG. The bottom-right are terms that are appear in papers that have been rejected for this SDG. The top-right terms appear frequently in both and cannot be used to discriminate between the two groups.
  3. Network map: This diagram shows the cluster-network of terms co-occurring in the publications related to this SDG, selected by the respondents (accepted publications only).
  4. Topic model: This diagram shows the topics, and the related terms that make up that topic. The number of topics is related to the number of of targets of this SDG.
  5. Contingency matrix: This diagram shows the top 10 of co-occurring terms that correlate the most.

Software used to do the text analyses

CorTexT: The CorTexT Platform is the digital platform of LISIS Unit and a project launched and sustained by IFRIS and INRAE. This platform aims at empowering open research and studies in humanities about the dynamic of science, technology, innovation and knowledge production.

Resource with interactive visualisations

Based on the text analysis data we have created a website that puts all the SDG interactive diagrams together. For you to scrall through.

Data set content

In the dataset root you'll find the following folders and files:

  • /sdg01-17/
    • This contains the text analysis for all the individual SDG surveys.
  • /methods/
    • This contains the step-by-step explanations of the text analysis methods using Cortext.
  • /images/
    • images of the results used in this
    • terms and conditions for reusing this data.
    • description of the dataset; each subfolders contains a file to futher describe the content of each sub-folder.

Inside an /sdg01-17/-folder you'll find the following:

  • This contains the step-by-step explanations of the text analysis methods using Cortext.
  • /sdg01-17/sdg04-sdg-survey-selected-publications-combined.db
    • his contains the title, abstract, keywords, fo the publications in the survey, including the and accept or rejection status and the number of respondents
  • /sdg01-17/sdg04-sdg-survey-selected-publications-combined-accepted-accepted-custom-filtered.db
    • same as above, but only the accepted papers
  • /sdg01-17/extracted-terms-list-top1000.csv
    • the aggregated list of co-occuring terms (bigrams and trigrams) extracted per paper.
  • /sdg01-17/contrast-analysis/
    • This contains the data and visualisation of the terms appearing in papers that have been accepted (true) and rejected (false) to be relating to this SDG.
  • /sdg01-17/topic-modelling/
    • This contains the data and visualisation of the terms clustered in the same number of topics as there are 'targets' within that SDG.
  • /sdg01-17/network-mapping/
    • This contains the data and visualisation of the terms clustered in co-occuring proximation of appearance in papers
  • /sdg01-17/contingency-matrix/
    • This contains the data and visualisation of the top 10 terms co-occuring

note: the .csv files are actually tab-separated.

Contribute and improve the SDG Search Queries

We welcome you to join the Github community and to fork, branch, improve and make a pull request to add your improvements to the new version of the SDG queries.

Sustainable Development Goals SDG Classification model Search Queries SCOPUS Text indexing Controlled vocabulary
Files (46.9 MB)
Name Size
46.9 MB Download
All versions This version
Views 169169
Downloads 2525
Data volume 1.2 GB1.2 GB
Unique views 143143
Unique downloads 2222


Cite as