Published August 15, 2025 | Version v4.1
Dataset Open

Data Citation Corpus Data File

Description

Data file for the fourth release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.

The data file includes 10,697,745 data citation records (of which 9,682,257 represent unique dataset-publication pairs) in JSON and CSV formats. The JSON file is the version of record.

Data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2025-08-15-data-citation-corpus-01-v4.1.json.

The data citations in the file originate from the following sources:

  • DataCite Event Data
  • Chan Zuckerberg Initiative (CZI) Science Knowledge Graph
  • Aligning Science Across Parkinson’s (ASAP)
  • Europe PMC

Each data citation record is comprised of:

  • A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited  

  • Metadata for the cited dataset and for the citing publication 

The data file includes the following fields:

Field

Description

Required?

id

Internal identifier for the citation

Yes

created

Date of item's incorporation into the corpus

Yes

updated

Date of item's most recent update in corpus

Yes

repository

Repository where cited data is stored

No

publisher

Publisher for the article citing the data

No

journal

Journal for the article citing the data

No

title

Title of cited data

No

publication

DOI of article where data is cited

Yes

dataset

DOI or accession number of cited data

Yes

publishedDate

Date when citing article was published

No

source

Source where citation was harvested

Yes

subjects

Subject information for cited data

No

affiliations

Affiliation information for creator of cited data

No

funders

Funding information for cited data

No

 

Additional documentation about the citations and metadata in the file is available on the Make Data Count website

Notes on v4.1:

Version 4.1 of the Data Citation Corpus is a minor update to v4.0 that corrects (1) an error that occurred when a portion of DOI-DOI citations originating from Europe PMC were attributed to the wrong repository, and (2) a small number of DOI formatting errors in the "publication" field.

Notes on v4.0:

The fourth release of the Data Citation Corpus data file adds new citations from the following sources:

  • 5.2 million data citations from Europe PMC identified as "eupmc" in the source field. Ingest of these citations was performed 9 July 2025.

  • 139,647 data citations from DataCite Event Data for the period 1 January 2025 through 30 June 2025.

This release also includes the following new metadata enhancements:

  • Affiliation information for cited data from the Gene Expression Omnibus (GEO) repository, reonciled to Research Organization Registry (ROR) IDs where possible.

  • Reconciliation of organization and funders names with the Research Organization Registry (ROR) for new citations from Event Data.

  • Application of Field of Science subject terms to citation records originating from Europe PMC, based on disciplinary area of data repository.

Additional details about the above changes, including scripts used to perform the above tasks, are available in GitHub

Additional enhancements to the corpus are ongoing and will be addressed in the course of subsequent releases. Users are invited to submit feedback via GitHub. For general questions, email info@makedatacount.org.

Files

2025-08-15-data-citation-corpus-v4.1-csv.zip

Files (1.9 GB)

Name Size Download all
md5:8d60ab7f08c7d4a09a5865c1d2da7654
887.5 MB Preview Download
md5:601293df895148b315fdf5395484e768
1.0 GB Preview Download

Additional details

Related works

Is new version of
Dataset: 10.5281/zenodo.11196859 (DOI)
Dataset: 10.5281/zenodo.11216814 (DOI)
Dataset: 10.5281/zenodo.13376773 (DOI)
Dataset: 10.5281/zenodo.14897662 (DOI)
Dataset: 10.5281/zenodo.16546069 (DOI)

Funding

Wellcome Trust
Make Data Count: A Central Corpus for All Data Citations 226453/Z/22/Z