AI for mapping multi-lingual academic papers to the United Nations' Sustainable Development Goals (SDGs)

doi:10.5281/zenodo.6487606

Published March 30, 2022 | Version 1.0

Report Open

AI for mapping multi-lingual academic papers to the United Nations' Sustainable Development Goals (SDGs)

1. Vrije Universiteit Amsterdam
2. Palacký University Olomouc
3. University Duisburg-Essen

PLEASE GO TO LATEST VERSION

In this report we demonstrate how we made the multi-lingual text classifier to match research papers to the Sustainable Development Goals (SDGs) of the United Nations.

We trained the BERT multi-language model to classify the 169 individual SDG Targets, based on the English abstracts in the corpus of 1.4 million research papers. We gathered that data using Scopus with the Aurora SDG Query model v5, which has an evaluated average precision of 70% and recall of 14%.

This is a follow-up project of the query based Aurora SDG classification model v5. The purpose of this project is to try and tackle several issues: 1. to label also research output to an SDG that is written in a non-english language, 2. to include papers that use other terms than the exact terms used in keyword searches, 3. to have a classification model that works independent from any other database specific query language.

In this report we show how we decided to use the abstracts only and the mBERT model to train the classifier. Also we show why we trained 169 individual models, instead of 1 multi-label model, including the evaluation for prediction. We will show how to prepare the data for training, and how to run the code to train the models on multiple GPU cores. Next we show how to prepare the data for prediction and how to use the code to predict English and non-English texts. And finally we evaluate the model by reviewing a sample of non-English research papers, and provide some tips to increase the reliability of the predicted outcomes.

This collection will contain:

Report / technical documentation describing the method and evaluating the models.
- AI_for_mapping_... SDGs_.pdf
Text classification models: a table containing the download urls for each of the mBERT models for each SDG-Target and SDG-Goal in .h5 format.
- SDGs_many_BERTs_models_download_urls.csv
- (.csv format, semicolon separated)
- for only SDG-Goals models: https://doi.org/10.5281/zenodo.5835849
Training data sample in .csv format. Containing abstract and columns of SDG-Targets with 1 or 0.
- train_data_sample_aurora_sdg_v5_worldwide_set_doi_abstracts_sdg_targets_2009-2020-in-columns.csv
- (.csv format, comma separated)
- get full data here (1.4 million doi's labeled to SDG Targets): https://doi.org/10.5281/zenodo.5205672
- due to licencing agreements with Scopus we are only allowed to share a limited amount of abstracts. Support https://i4oa.org to get more free-to-use abstracts in Crossref.
Training code in python. Explaining what parameters we used to train the models on GPU hardware.
- train.py
- get / fork code here: https://github.com/Aurora-Network-Global/sdgs_many_berts
Test Statistics on accuracy of each of the trained models
- SDGs_many_BERTs_models_test_statistics.csv
- (.csv format, tab separated)
Prediction data sample. Text file (UTF8) containing abstracts of papers in different languages on each row.
- predict_data_sample_multiple_language_abstracts.csv
- (.csv format, list, languages row 2-31: EN, row 32-152: NL )
Prediction code in python. Setup to run the models to classify a text fragment.
- predict.py
- get / fork code here: https://github.com/Aurora-Network-Global/sdgs_many_berts

Notes

Acknowledgements

Many thanks to Maéva Vignes from University of South Denmark, to allow us to use their UCloud HPC facilities and budget to train the mBERT models on their GPU's.

Funded by

Funded by European Commission, Project ID: 101004013, Call: EAC-A02-2019-1, Programme: EPLUS2020, DG/Agency: EACEA

Change log

2022-03-31 | v1.0 | final version of the report

2022-03-31 | v0.8 | added final draft of the report

2022-02-01 | v0.7.2 | changed download urls for SDG 11 to latest version 2 (adding 3 missing target models) in SDGs_many_BERTs_models_download_urls.xslx

2022-02-01 | v0.7.1 | added files SDGs_many_BERTs_models_download_urls.xslx and .csv, I forgot to upload in version 0.7

2022-02-01 | v0.7 | changed files SDGs_many_BERTs_models_download_urls.xslx and .csv to match URLs of new version dataset SDG-Goals-Only models, which now separates files of the models, instead of one big .zip file.

2021-12-06 | v0.6 | added URL to SDG-Goals-Only models in second line of file SDGs_many_BERTs_models_download_urls.csv.

2021-12-06 | v0.5 | added report and documentation. (NEED TO FINISH SECTIONS EVALUATION AND CONCLUSION)

2021-12-06 | v0.4 | added sample data and code for training and for predicting. To reproduce the models yourself and to make use of the trained models yourself.

2021-12-01 | v0.3 | added .csv file with accuracy statistics of the models

2021-11-30 | v0.2 | added .csv file with download urls of the models

2021-10-29 | v0.1 | added initial .md file as placeholder for this dataset

Files

AI_for_mapping_multi_lingual_research_papers_to_the_United_Nations__Sustainable_Development_Goals__SDGs.pdf

Files (3.3 MB)

Name	Size	Download all
AI_for_mapping_multi_lingual_research_papers_to_the_United_Nations__Sustainable_Development_Goals__SDGs.pdf md5:8140d4b822090d2566e5b7486e7acfbc	818.4 kB	Preview Download
predict.py md5:6df00b0332fcd370bd95f22e7ac946cb	288.7 kB	Download
predict_data_sample_multiple_language_abstracts.csv md5:b0944a4a586a5a8922a329e68a1eb209	193.8 kB	Preview Download
SDGs_many_BERTs_models_download_urls.csv md5:08df9df63aab5d445501bdd4b03ed9ba	49.9 kB	Preview Download
SDGs_many_BERTs_models_download_urls.xlsx md5:fce25cc9db3652a79a237e6bb302bebc	23.7 kB	Download
SDGs_many_BERTs_models_test_statistics.csv md5:a2dd37a643462aca4c30303b0762563a	8.9 kB	Preview Download
train.py md5:4ce501ff45cfb316f19187df2947c5c2	301.2 kB	Download
train_data_sample_aurora_sdg_v5_worldwide_set_doi_abstracts_sdg_targets_2009-2020-in-columns.csv md5:a5bd4ac99ff7bad4ba20fc33f441882a	1.6 MB	Preview Download

Additional details

Documents: Software: 10.5281/zenodo.5835849 (DOI); Software: 10.5281/zenodo.5700941 (DOI); Software: 10.5281/zenodo.5700994 (DOI); Software: 10.5281/zenodo.5701153 (DOI); Software: 10.5281/zenodo.5701281 (DOI); Software: 10.5281/zenodo.5702201 (DOI); Software: 10.5281/zenodo.5702382 (DOI); Software: 10.5281/zenodo.5702494 (DOI); Software: 10.5281/zenodo.5702551 (DOI); Software: 10.5281/zenodo.5702713 (DOI); Software: 10.5281/zenodo.5702796 (DOI); Software: 10.5281/zenodo.5702944 (DOI); Software: 10.5281/zenodo.5733882 (DOI); Software: 10.5281/zenodo.5734383 (DOI); Software: 10.5281/zenodo.5734903 (DOI); Software: 10.5281/zenodo.5735390 (DOI); Software: 10.5281/zenodo.5735686 (DOI); Software: 10.5281/zenodo.5737987 (DOI)
Is supplemented by: Dataset: 10.5281/zenodo.5205672 (DOI)

	All versions	This version
Views	3,419	1,903
Downloads	2,577	1,553
Data volume	2.8 GB	2.0 GB

AI for mapping multi-lingual academic papers to the United Nations' Sustainable Development Goals (SDGs)

Creators

Description

Notes

Files

AI_for_mapping_multi_lingual_research_papers_to_the_United_Nations__Sustainable_Development_Goals__SDGs.pdf

Files (3.3 MB)

Additional details

Related works