Published February 1, 2025 | Version v3.0
Dataset Open

Data Citation Corpus Data File

Description

Data file for the third release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.

The data file includes 5,322,388 data citation records in JSON and CSV formats. The JSON file is the version of record.

For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2025-02-01-data-citation-corpus-01-v3.0.json.

The data citations in the file originate from the following sources:

  • DataCite Event Data
  • A project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles
  • Data citations identified Aligning Science Across Parkinson’s (ASAP)

Each data citation record is comprised of:

  • A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited  

  • Metadata for the cited dataset and for the citing publication 

The data file includes the following fields:

Field

Description

Required?

id

Internal identifier for the citation

Yes

created

Date of item's incorporation into the corpus

Yes

updated

Date of item's most recent update in corpus

Yes

repository

Repository where cited data is stored

No

publisher

Publisher for the article citing the data

No

journal

Journal for the article citing the data

No

title

Title of cited data

No

publication

DOI of article where data is cited

Yes

dataset

DOI or accession number of cited data

Yes

publishedDate

Date when citing article was published

No

source

Source where citation was harvested

Yes

subjects

Subject information for cited data

No

affiliations

Affiliation information for creator of cited data

No

funders

Funding information for cited data

No

 

Additional documentation about the citations and metadata in the file is available on the Make Data Count website

Notes on v3.0:

The third release of the Data Citation Corpus data file reflects a few changes made to add new citations, including those from a new data source (ASAP), update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:

Add and update Event Data citations:

  • Add 65,524 new data citations created in DataCite Event Data between August 2024 and December 2024

Add ASAP citations:

  • Add 750 new data citations provided by Aligning Science Across Parkinson’s (ASAP), identified through processes to evaluate compliance with ASAP’s for open science practices, which involve a partnership with DataSeer and internal curation (described here).

  • Citations with provenance from ASAP are identified as “asap” in the source field

Metadata enhancements:

  • Reconcile and normalize organization names for affiliations and funders in a subset of records with the Research Organization Registry (ROR)
    • Add ror_name and ror_id subfields for affiliations and funders in JSON files. Unreconciled affiliation and funder strings are identified with values of null
    • Add new columns affiliationsROR and fundersROR in CSV files. Unreconciled affiliation and funder strings are identified with values of NONE NONE (this is to ensure consistency in number and order of values in cases where some strings have been reconciled and others have not)
  • Normalize DOI formats for articles and papers as full URLs

Additional details about the above changes, including scripts used to perform the above tasks, are available in GitHub

Additional enhancements to the corpus are ongoing and will be addressed in the course of subsequent releases. Users are invited to submit feedback via GitHub. For general questions, email info@makedatacount.org.

Files

2025-02-01-data-citation-corpus-v3.0-csv.zip

Files (995.7 MB)

Name Size Download all
md5:f744eb66df2fcd832b5f2a9e78103b26
467.3 MB Preview Download
md5:31f8bec01ebdc44dc0de7129b0938e1d
528.4 MB Preview Download

Additional details

Related works

Is new version of
Dataset: 10.5281/zenodo.11196859 (DOI)
Dataset: 10.5281/zenodo.11216814 (DOI)
Dataset: 10.5281/zenodo.13376773 (DOI)

Funding

Wellcome Trust
Make Data Count: A Central Corpus for All Data Citations 226453/Z/22/Z