Published September 5, 2020 | Version 1.0
Dataset Open

Dataset Reuse Indicators Datasets

  • 1. King's College London
  • 2. Huawei Technologies
  • 3. University of Amsterdam

Description

This dataset contains two files. 

1) A python pickle file (github_dataset.zip) that contains Github repositories with datasets.  Specifically, using Google’s public dataset copy of Github and the BigQuery service to build a list of repositories that have a CSV or XLSX or XLS file. We then used the GitHub API to collect nformation about each repository in this list. The resulting dataset consists of 87936 repositories that contain at least a CSV, XLSX or XLS file, alongside with information about their features (e.g. number of open and closed issues and license) from GitHub. This corpus had more than two million data files. We then excluded those files withless then ten rows, which was the case for 65537 repositories with a total of 1,467,240 data files.

2) A python pickle file (processed_dataset.zip) containing the feature information necessary to train a machine learning model to predict reuse on these Github datasets

Source code can be found at: https://github.com/laurakoesten/Dataset-Reuse-Indicators

For a full description of the content see:

Koesten, Laura and Vougiouklis, Pavlos and Simperl, Elena and Groth, Paul, Dataset Reuse: Translating Principles to Practice. Available at SSRN: https://ssrn.com/abstract=3589836 or http://dx.doi.org/10.2139/ssrn.3589836

Files

github_dataset.zip

Files (396.9 MB)

Name Size Download all
md5:ae372cef56d2ecc5ff86d1e7ab762957
161.6 MB Preview Download
md5:d4250b3977c9a0d6d31b4b6c5514ac91
235.2 MB Preview Download

Additional details

Related works

Is derived from
Preprint: 10.2139/ssrn.3589836 (DOI)
Is supplement to
Software: https://github.com/laurakoesten/Dataset-Reuse-Indicators (URL)

Funding

UK Research and Innovation
Data Stories: Engaging citizens with data in a post-truth society EP/P025676/1