There is a newer version of the record available.

Published November 30, 2021 | Version 0.2
Report Open

AI for mapping multi-lingual academic papers to the United Nations' Sustainable Development Goals (SDGs)

  • 1. Palacký University Olomouc
  • 2. Vrije Universiteit Amsterdam

Description

PLEASE GO TO LATEST VERSION

In this report we demonstrate how we made the multi-lingual text classifier to match research papers to the Sustainable Development Goals (SDGs) of the United Nations.

We trained the mBERT multi-language model to classify the 169 individual SDG Targets, based on the English abstracts in the corpus of 1.4 million research papers we gathered using the Aurora SDG Query model v5.

This is a follow-up project of the query based Aurora SDG classification model. The purpose of this project is to try and tackle several issues: 1. to label also research output to an SDG that is written in a non-english language, 2. to include papers that use other terms than the exact terms used in keyword searches, 3. to have a classification model that works independent from any other database specific query language.

In this report we show how we decided to use the abstracts only and the mBERT model to train the classifier. Also we show why we trained 169 individual models, instead of 1 multi-label model, including the evaluation for prediction. We will show how to prepare the data for training, and how to run the code to train the models on multiple GPU cores. Next we show how to prepare the data for prediction and how to use the code to predict English and non-English texts. And finally we evaluate the model by reviewing a sample of non-English research papers, and provide some tips to increase the reliability of the predicted outcomes.

This collection will contain:

  1. Report / technical documentation describing the method and evaluating the models.
    • (to be uploaded) placeholder.md
  2. Text classification models: a table containing the download urls for each of the mBERT models for each SDG-Target in .h5 format.
    • SDGs_many_BERTs_models_download_urls.csv
    • (.csv format, semicolon separated)
  3. Training data sample in .h5 format. Containing abstract and columns of SDG-Targets with 1 or 0.
    • (to be uploaded)
  4. Training code in python. Explaining what parameters we used to train the models on GPU hardware.
    • (to be uploaded)
  5. Test Statistics on accuracy of each of the trained models
    • (to be uploaded)
  6. Prediction data sample. Text file (UTF8) containing abstracts of papers in different languages on each row.
    • (to be uploaded)
  7. Prediction code in python. Setup to run the models to classify a text fragment.
    • (to be uploaded)

 

Notes

Acknowledgements

Many thanks to Maéva Vignes from University of South Denmark, to allow us to use their UCloud HPC facilities and budget to train the mBERT models on their GPU's.


Funded by

Funded by European Commission, Project ID: 101004013, Call: EAC-A02-2019-1, Programme: EPLUS2020, DG/Agency: EACEA


Read more

[ Project website | Zenodo Community | Github ]


Change log

2021-11-30 | v0.2 | added .csv file with download urls of the models

2021-10-29 | v0.1 | added initial .md file as placeholder for this dataset

Files

placeholder.md

Files (19.1 kB)

Name Size Download all
md5:f2fc3a517d46a6a0b630574a4fe1a5ac
1.7 kB Preview Download
md5:0113c00610865cd5a5f6235a3c1529fd
17.4 kB Preview Download

Additional details