Introduction

This dataset, which was created by merging data from Crossef, CORE, and Mendeley, contains metadata for over 800 thousand publications published between 2013 and 2018. The dataset was used to study how much time does it take for authors to deposit their articles in Open Access repositories in relation to when these articles get published. The source codes of our analysis and a link to a paper describing this dataset are available on GitHub. The purpose of this readme is to describe the data format and schema.

Format

The dataset is a standard CSV where each value within each column (except for the header row) is JSON encoded. Using Python, you can load the dataset with:

import csv
import json
with open('dataset.csv') as fp:
    reader = csv.DictReader(fp)
    for row in reader:
        data_row = {k: json.loads(v) for k, v in row.items()}
        # do something with data_row

Sample data row loaded using the script above:

{
    "core_country_code": [
        "es",
        "es"
    ],
    "core_deposited_date": [
        "2014-04-04",
        "2016-07-06"
    ],
    "core_doi": [
        "10.1002/2013PA002546",
        "10.1002/2013PA002546"
    ],
    "core_id_document": [
        "71863359",
        "78542979"
    ],
    "core_id_repository": [
        "2053",
        "522"
    ],
    "core_metadata_added": [
        "2016-11-29",
        "2017-02-28"
    ],
    "core_oai": [
        "oai:digibuo.uniovi.es:10651/25257",
        "oai:ddd.uab.cat:159776"
    ],
    "core_published_date": [
        "2013",
        "2013"
    ],
    "core_repository_name": [
        "Repositorio Institucional de la Universidad de Oviedo",
        "Diposit Digital de Documents de la UAB"
    ],
    "cr_accepted": null,
    "cr_created": "2013-11-26",
    "cr_doi": "10.1002/2013pa002546",
    "cr_issn": [
        "0883-8305"
    ],
    "cr_published": "2013-12-01",
    "panels": [
        "B"
    ],
    "subjects": [
        "Earth and Planetary Sciences"
    ]
}

Schema

The CSV consists of three sets of columns -- columns with the header prefixed with core_, columns prefixed with cr_, and additional two columns without a prefix. The core_ prefixed columns contain the part of the dataset that was obtained from CORE, the data contained in the cr_ prefixed columns were obtained from Crossref, and the data within the remaining columns were obtained from Mendeley.

CORE data

Each CORE data column contains a list. This is because multiple publications from CORE can be matched to a single article from Crossref. The lists are ordered so that values at index zero belong to the first matched CORE publication, values at index one belong to the second matched CORE publication, etc.

In the example above, two CORE publications were matched to Crossref record with DOI 10.1002/2013pa002546:

core_id_document core_id_repository core_deposited_date ...
71863359 2053 2014-04-04 ...
78542979 522 2016-07-06 ...

Crossref data

All Crossref columns are simple values containing information about a single Crossref publication (which is identified by the DOI stored in cr_doi). An exception is the column cr_issn which is a list of all ISSN numbers that match the given article.

Mendeley data

Columns panels and subjects contain lists of values based on metadata obtained from Mendeley. These lists can be empty in case the publication was not found in Mendeley, or they can contain a single (single disciplinary publication) or multiple values (multi-disciplinary publication).

Columns

  • core_country_code: Country codes representing locations of repositories in core_id_repository and core_repository_names columns.
  • core_deposited_date: Dates each of the CORE publications in core_id_document was deposited in their respective repositories. Format: %Y-%m-%d.
  • core_doi: Publication DOIs CORE obtained from the publications' repositories.
  • core_id_document: CORE publication IDs.
  • core_id_repository: IDs of repositories the publications in core_id_document were obtained from.
  • core_metadata_added: Dates showing when each of the publications was first added in CORE. Format: %Y-%m-%d.
  • core_oai: Publication OAI identifiers.
  • core_published_date: Publication dates which CORE obtained from the repositories. These are not formatted (i.e. they are in the same format they were received from the repository).
  • core_repository_name: Names of the repositories each of the publications was obtained from.
  • cr_accepted: Date when the article was accepted for publication, obtained from Crossref. Format: %Y-%m-%d.
  • cr_created: Date when the Crossref DOI was first registered.
  • cr_doi: Crossref DOI.
  • cr_issn: List of ISSN numbers obtained from Crossref.
  • cr_published: Date when the article was published according to Crossref. Format: %Y-%m-%d.
  • panels: A list of REF assessment panels the publication was tagged with based on the subjects below.
  • subjects: A list of subjects the publication was tagged with based on its Mendeley readers. For details of the tagging method please refer to our paper describing this dataset, which is linked from our GitHub repository.