Data Citation Corpus Data File
Creators
Description
Data file for the third release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.
The data file includes 5,322,388 data citation records in JSON and CSV formats. The JSON file is the version of record.
For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2025-02-01-data-citation-corpus-01-v3.0.json.
The data citations in the file originate from the following sources:
- DataCite Event Data
- A project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles
- Data citations identified Aligning Science Across Parkinson’s (ASAP)
Each data citation record is comprised of:
-
A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited
-
Metadata for the cited dataset and for the citing publication
The data file includes the following fields:
Field |
Description |
Required? |
id |
Internal identifier for the citation |
Yes |
created |
Date of item's incorporation into the corpus |
Yes |
updated |
Date of item's most recent update in corpus |
Yes |
repository |
Repository where cited data is stored |
No |
publisher |
Publisher for the article citing the data |
No |
journal |
Journal for the article citing the data |
No |
title |
Title of cited data |
No |
publication |
DOI of article where data is cited |
Yes |
dataset |
DOI or accession number of cited data |
Yes |
publishedDate |
Date when citing article was published |
No |
source |
Source where citation was harvested |
Yes |
subjects |
Subject information for cited data |
No |
affiliations |
Affiliation information for creator of cited data |
No |
funders |
Funding information for cited data |
No |
Additional documentation about the citations and metadata in the file is available on the Make Data Count website.
Notes on v3.0:
The third release of the Data Citation Corpus data file reflects a few changes made to add new citations, including those from a new data source (ASAP), update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:
Add and update Event Data citations:
-
Add 65,524 new data citations created in DataCite Event Data between August 2024 and December 2024
Add ASAP citations:
-
Add 750 new data citations provided by Aligning Science Across Parkinson’s (ASAP), identified through processes to evaluate compliance with ASAP’s for open science practices, which involve a partnership with DataSeer and internal curation (described here).
-
Citations with provenance from ASAP are identified as “asap” in the source field
Metadata enhancements:
- Reconcile and normalize organization names for affiliations and funders in a subset of records with the Research Organization Registry (ROR)
- Add ror_name and ror_id subfields for affiliations and funders in JSON files. Unreconciled affiliation and funder strings are identified with values of null
- Add new columns affiliationsROR and fundersROR in CSV files. Unreconciled affiliation and funder strings are identified with values of NONE NONE (this is to ensure consistency in number and order of values in cases where some strings have been reconciled and others have not)
- Normalize DOI formats for articles and papers as full URLs
Additional details about the above changes, including scripts used to perform the above tasks, are available in GitHub.
Additional enhancements to the corpus are ongoing and will be addressed in the course of subsequent releases. Users are invited to submit feedback via GitHub. For general questions, email info@makedatacount.org.
Files
2025-02-01-data-citation-corpus-v3.0-csv.zip
Files
(995.7 MB)
Name | Size | Download all |
---|---|---|
md5:f744eb66df2fcd832b5f2a9e78103b26
|
467.3 MB | Preview Download |
md5:31f8bec01ebdc44dc0de7129b0938e1d
|
528.4 MB | Preview Download |
Additional details
Related works
- Is new version of
- Dataset: 10.5281/zenodo.11196859 (DOI)
- Dataset: 10.5281/zenodo.11216814 (DOI)
- Dataset: 10.5281/zenodo.13376773 (DOI)
Funding
- Wellcome Trust
- Make Data Count: A Central Corpus for All Data Citations 226453/Z/22/Z