Data Citation Corpus Data File
Creators
Description
Data file for the second release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.
The data file includes 5,256,114 data citation records in JSON and CSV formats. The JSON file is the version of record.
For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2024-08-23-data-citation-corpus-01-v2.0.json.
The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.
Each data citation record is comprised of:
-
A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited
-
Metadata for the cited dataset and for the citing publication
The data file includes the following fields:
Field |
Description |
Required? |
id |
Internal identifier for the citation |
Yes |
created |
Date of item's incorporation into the corpus |
Yes |
updated |
Date of item's most recent update in corpus |
Yes |
repository |
Repository where cited data is stored |
No |
publisher |
Publisher for the article citing the data |
No |
journal |
Journal for the article citing the data |
No |
title |
Title of cited data |
No |
publication |
DOI of article where data is cited |
Yes |
dataset |
DOI or accession number of cited data |
Yes |
publishedDate |
Date when citing article was published |
No |
source |
Source where citation was harvested |
Yes |
subjects |
Subject information for cited data |
No |
affiliations |
Affiliation information for creator of cited data |
No |
funders |
Funding information for cited data |
No |
Additional documentation about the citations and metadata in the file is available on the Make Data Count website.
The second release of the Data Citation Corpus data file reflects several changes made to add new citations, remove some records deemed out of scope for the corpus, update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:
Add and update Event Data citations:
-
Add 179,885 new data citations created in DataCite Event Data between 01 June 2023 through 30 June 2024
Remove citation records deemed out of scope for the corpus:
-
273,567 records from DataCite Event Data with non-citation relationship types
-
28,334 citations to items in non-data repositories (clinical trials registries, stem cells, samples, and other non-data materials)
-
44,117 invalid citations where subj_id value was the same as the obj_id value or subj_id and obj_id are inverted, indicating a citation from a dataset to a publication
-
473,792 citations to invalid accession numbers from CZI data present in v1.1 as a result of false positives in the algorithm used to identify mentions
-
4,110,019 duplicate records from CZI data present in v1.1 where metadata is the same for obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id (the record with the most recent updated date was retained in all of these cases)
Metadata enhancements:
-
Apply Field of Science subject terms to citation records originating from CZI, based on disciplinary area of data repository
-
Initial cleanup of affiliation and funder organization names to remove personal email addresses and social media handles (additional cleanup and standardization in progress and will be included in future releases)
Data structure updates to improve usability and eliminate redundancies:
-
Rename subj_id and obj_id fields to “dataset” and “publication” for clarity
-
Remove accessionNumber and doi elements to eliminate redundancy with subj_id
-
Remove relationTypeId fields as these are specific to Event Data only
Full details of the above changes, including scripts used to perform the above tasks, are available in GitHub.
While v2 addresses a number of cleanup and enhancement tasks, additional data issues may remain, and additional enhancements are being explored. These will be addressed in the course of subsequent data file releases.
Feedback on the data file can be submitted via GitHub. For general questions, email info@makedatacount.org.
Files
2024-08-23-data-citation-corpus-v2.0.zip
Files
(956.1 MB)
Name | Size | Download all |
---|---|---|
md5:03c3fb45edbab61fa59e0acbd5275715
|
956.1 MB | Preview Download |
Additional details
Related works
- Is new version of
- Dataset: 10.5281/zenodo.11196859 (DOI)
- Dataset: 10.5281/zenodo.11216814 (DOI)
Funding
- Make Data Count: A Central Corpus for All Data Citations 226453/Z/22/Z
- Wellcome Trust