Data Citation Corpus Data File
Creators
Description
Data file for the first release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.
The data file includes 10,006,058 data citation records in JSON and CSV formats. The JSON file is the version of record.
Version 1.0 of the corpus data file was released on January 30, 2024. Release v1.1 is an optimized version of v1.0 designed to make the original citation records more usable. No citations have been added to or removed from the dataset in v1.1.
For convenience, the data file is provided in batches of approximately 1 million records each. The publication date and batch number are included in each component file name, ex: 2024-05-10-data-citation-corpus-01-v1.1.json.
The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.
Each data citation record is comprised of:
-
A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication object (journal article or preprint) in which the dataset is cited
-
Metadata for the cited dataset and for the citing publication object
The data file includes the following fields:
Field |
Description |
Required? |
id |
Internal identifier for the citation |
Yes |
created |
Date of item's incorporation into the corpus |
Yes |
updated |
Date of item's most recent update in corpus |
Yes |
repository |
Repository where cited data is stored |
No |
publisher |
Publisher for the article citing the data |
No |
journal |
Journal for the article citing the data |
No |
title |
Title of cited data |
No |
objId |
DOI of article where data is cited |
Yes |
subjId |
DOI or accession number of cited data |
Yes |
publishedDate |
Date when citing article was published |
No |
accessionNumber |
Accession number of cited data |
No |
doi |
DOI of cited data |
No |
relationTypeId |
Relation type in metadata between citation object and subject |
No |
source |
Source where citation was harvested |
Yes |
subjects |
Subject information for cited data |
No |
affiliations |
Affiliation information for creator of cited data |
No |
funders |
Funding information for cited data |
No |
Additional documentation about the citations and metadata in the file is available on the Make Data Count website.
Feedback on the data file can be submitted via Github. For general questions, email info@makedatacount.org.
Files
2024-05-10-data-citation-corpus-v1.1.zip
Files
(1.8 GB)
Name | Size | Download all |
---|---|---|
md5:e2e191e9573a9ee729413df872edd96b
|
1.8 GB | Preview Download |
Additional details
Related works
- Is new version of
- Dataset: 10.5281/zenodo.11196859 (DOI)
Funding
- Make Data Count: A Central Corpus for All Data Citations 226453/Z/22/Z
- Wellcome Trust