Data Citation Corpus Data File
Creators
Description
Data file for the fourth release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.
The data file includes 10,697,745 data citation records (of which 9,682,257 represent unique dataset-publication pairs) in JSON and CSV formats. The JSON file is the version of record.
Data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2025-08-15-data-citation-corpus-01-v4.1.json.
The data citations in the file originate from the following sources:
- DataCite Event Data
- Chan Zuckerberg Initiative (CZI) Science Knowledge Graph
- Aligning Science Across Parkinson’s (ASAP)
- Europe PMC
Each data citation record is comprised of:
-
A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited
-
Metadata for the cited dataset and for the citing publication
The data file includes the following fields:
Field |
Description |
Required? |
id |
Internal identifier for the citation |
Yes |
created |
Date of item's incorporation into the corpus |
Yes |
updated |
Date of item's most recent update in corpus |
Yes |
repository |
Repository where cited data is stored |
No |
publisher |
Publisher for the article citing the data |
No |
journal |
Journal for the article citing the data |
No |
title |
Title of cited data |
No |
publication |
DOI of article where data is cited |
Yes |
dataset |
DOI or accession number of cited data |
Yes |
publishedDate |
Date when citing article was published |
No |
source |
Source where citation was harvested |
Yes |
subjects |
Subject information for cited data |
No |
affiliations |
Affiliation information for creator of cited data |
No |
funders |
Funding information for cited data |
No |
Additional documentation about the citations and metadata in the file is available on the Make Data Count website.
Notes on v4.1:
Version 4.1 of the Data Citation Corpus is a minor update to v4.0 that corrects (1) an error that occurred when a portion of DOI-DOI citations originating from Europe PMC were attributed to the wrong repository, and (2) a small number of DOI formatting errors in the "publication" field.
Notes on v4.0:
The fourth release of the Data Citation Corpus data file adds new citations from the following sources:
-
5.2 million data citations from Europe PMC identified as "eupmc" in the source field. Ingest of these citations was performed 9 July 2025.
-
139,647 data citations from DataCite Event Data for the period 1 January 2025 through 30 June 2025.
This release also includes the following new metadata enhancements:
-
Affiliation information for cited data from the Gene Expression Omnibus (GEO) repository, reonciled to Research Organization Registry (ROR) IDs where possible.
-
Reconciliation of organization and funders names with the Research Organization Registry (ROR) for new citations from Event Data.
-
Application of Field of Science subject terms to citation records originating from Europe PMC, based on disciplinary area of data repository.
Additional details about the above changes, including scripts used to perform the above tasks, are available in GitHub.
Additional enhancements to the corpus are ongoing and will be addressed in the course of subsequent releases. Users are invited to submit feedback via GitHub. For general questions, email info@makedatacount.org.
Files
2025-08-15-data-citation-corpus-v4.1-csv.zip
Files
(1.9 GB)
Name | Size | Download all |
---|---|---|
md5:8d60ab7f08c7d4a09a5865c1d2da7654
|
887.5 MB | Preview Download |
md5:601293df895148b315fdf5395484e768
|
1.0 GB | Preview Download |
Additional details
Related works
- Is new version of
- Dataset: 10.5281/zenodo.11196859 (DOI)
- Dataset: 10.5281/zenodo.11216814 (DOI)
- Dataset: 10.5281/zenodo.13376773 (DOI)
- Dataset: 10.5281/zenodo.14897662 (DOI)
- Dataset: 10.5281/zenodo.16546069 (DOI)
Funding
- Wellcome Trust
- Make Data Count: A Central Corpus for All Data Citations 226453/Z/22/Z