Published June 21, 2024 | Version v1
Dataset Open

SyROCCo dataset

Description

The peer-reviewed publication for this dataset has been published in Data & Policy, and can be accessed here: https://arxiv.org/abs/2406.16527 Please cite this when using the dataset.

 

This dataset has been produced as a result of the “Systematic Review of Outcomes Contracts using Machine Learning” (SyROCCo) project. The goal of the project was to apply machine learning techniques to a systematic review process of outcomes-based contracting (OBC). The purpose of the systematic review was to gather and curate, for the first time, all of the existing evidence on OBC. We aimed to map the current state of the evidence, synthesise key findings from across the published studies, and provide accessible insights to our policymaker and practitioner audiences.

OBC is a model for the provision of public services wherein a service provider receives payment, in-part or in-full, only upon the achievement of pre-agreed outcomes.

 

The data used to conduct the review consists of 1,952 individual studies of OBC. They include peer reviewed journal articles, book chapters, doctoral dissertations, and assorted ‘grey literature’ - that is, reports and evaluations produced outside of traditional academic publications. Those studies were manually filtered by experts on the topic from an initial search of over 11,000 results.

The full text of the articles was obtained from their PDF versions and preprocessed. This involved text format normalisation, removing acknowledgements and bibliographic references.

The corpus was then connected to the INDIGO Impact Bond Dataset. Projects and organisations mentioned in this latter dataset were searched for in the article’s corpus to relate both datasets.

 

Other types of information that were identified in the texts were 1) financial mechanisms (type of outcomes-based instrument); using a list of terms related to those financial mechanisms based on prior discussions with a policy advisory group (Picker et al., 2021); 2) references to the 17 Sustainable Development Goals (SDGs) defined by the United Nations General Assembly in the 2030 Agenda; 3) country names mentioned in each article and income levels related to the countries; according to the World Classification of Income Levels 2022 by the World Bank.

 

Three machine learning techniques were applied to the corpus:

 

  • Policy areas identification. A query-driven topic model (QDTM) (Fang et al., 2021) was used to determine the probability of an article belonging to different policy areas (health, education, homelessness, criminal justice, employment and training, child and family welfare, and agriculture and environment), using all text of the article as input. The QDTM is a semi-supervised machine learning algorithm that allows users to specify their prior knowledge in the form of simple queries in words or phrases and return query-related topics. 

  • Named Entity Recognition. Three named entity recognition models were applied: “en_core_web_lg” and “en_core_web_trf” models from the python package ‘spaCy’ and the “ner-ontonotes-large” English model from ‘Flair’. “en_core_web_trf” is based on the RoBERTa-base transformer model. ‘Flair’ is a bi-LSTM character-based model. All models were trained on the “OntoNotes 5” data source (Marcus et al., 2011) and are able to identify geographical locations, organisation names, and laws and regulations. An ensemble method was adopted, considering the entities that appear simultaneously in the results of any two models as the correct entities.

  • Semantic text similarity. We calculated the similarity score between articles. The 10,000 most frequently mentioned words were first extracted from all the articles’ titles and abstracts and the text vectorization technique TF*IDF was applied to convert each article’s abstract into an importance score vector based on these words. Using these numerical vectors, the cosine similarity between different articles was calculated.

 

The SyROCCo Dataset includes references to the 1952 studies of OBCs mentioned above and the results of the previous processing steps and techniques. Each entry of the dataset contains the following information.

 

The basic information of each document is its title, abstract, authors, published years, DOI and Article ID:

  • Title: Title of the document.

  • Abstract: Text of the abstract.

  • Authors: Authors of a study.

  • Published Years: Published Years of a study.

  • DOI: DOI link of a study.

  • Article ID: ID of the document selected during the screening process.

 

The probability of a study belonging to each policy area:

  • policy_sector_health: The probability of a study belongs to the policy sector “health”.

  • policy_sector_education: The probability of a study belongs to the policy sector “education”.

  • policy_sector_homelessness: The probability of a study belongs to the policy sector “homelessness”.

  • policy_sector_criminal: The probability of a study belongs to the policy sector “criminal”

  • policy_sector_employment: The probability of a study belongs to the policy sector “employment”

  • policy_sector_child: The probability of a study belongs to the policy sector “child”.

  • policy_sector_environment: The probability of a study belongs to the policy sector “environment”.

 

Other types of information such as financial mechanisms, Sustainable Development Goals, and different types of named entities:

  • financial_mechanisms: Financial mechanisms mentioned in a study.

  • top_financial_mechanisms: The financial mechanisms mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

  • top_sgds: Sustainable Development Goals mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

  • top_countries: Country names mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions. This entry is also used to determine the income level of the mentioned counties.

  • top_Project: Indigo projects mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.  

  • top_GPE: Geographical locations mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.  

  • top_LAW: Relevant laws and regulations mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.  

  • top_ORG: Organisations mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.  

Files

SyROCCo_dataset.csv

Files (56.2 MB)

Name Size Download all
md5:de10306e7e4a0864e621f4f33cf81bac
56.2 MB Preview Download

Additional details

Funding

(grant ref: A2683) A2683
Department for Digital, Culture, Media & Sport
grant ref: 2104-06351 2104-06351
Children's Investment Fund Foundation
grant ref: MR/T040890/1 MR/T040890/1
UK Research and Innovation
grant ref: 300539 300539
Foreign and Commonwealth Office
UBS Optimus Foundation (grant ref: 51962) 51962
Union Bank of Switzerland
John Fell Fund (grant ref: 0012257) 0012257
University of Oxford