Published August 23, 2024 | Version v2.0
Dataset Open

Data Citation Corpus Data File

Description

Data file for the second release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.

The data file includes 5,256,114 data citation records in JSON and CSV formats. The JSON file is the version of record.

For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2024-08-23-data-citation-corpus-01-v2.0.json.

The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles. 

Each data citation record is comprised of:

  • A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited  

  • Metadata for the cited dataset and for the citing publication 

The data file includes the following fields:

Field

Description

Required?

id

Internal identifier for the citation

Yes

created

Date of item's incorporation into the corpus

Yes

updated

Date of item's most recent update in corpus

Yes

repository

Repository where cited data is stored

No

publisher

Publisher for the article citing the data

No

journal

Journal for the article citing the data

No

title

Title of cited data

No

publication

DOI of article where data is cited

Yes

dataset

DOI or accession number of cited data

Yes

publishedDate

Date when citing article was published

No

source

Source where citation was harvested

Yes

subjects

Subject information for cited data

No

affiliations

Affiliation information for creator of cited data

No

funders

Funding information for cited data

No

 

Additional documentation about the citations and metadata in the file is available on the Make Data Count website

The second release of the Data Citation Corpus data file reflects several changes made to add new citations, remove some records deemed out of scope for the corpus, update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:

Add and update Event Data citations:

  • Add 179,885 new data citations created in DataCite Event Data between 01 June 2023 through 30 June 2024

Remove citation records deemed out of scope for the corpus:

  • 273,567 records from DataCite Event Data with non-citation relationship types 

  • 28,334 citations to items in non-data repositories (clinical trials registries, stem cells, samples, and other non-data materials)

  • 44,117 invalid citations where subj_id value was the same as the obj_id value or subj_id and obj_id are inverted, indicating a citation from a dataset to a publication

  • 473,792 citations to invalid accession numbers from CZI data present in v1.1 as a result of false positives in the algorithm used to identify mentions

  • 4,110,019 duplicate records from CZI data present in v1.1 where metadata is the same for obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id (the record with the most recent updated date was retained in all of these cases)

Metadata enhancements:

  • Apply Field of Science subject terms to citation records originating from CZI, based on disciplinary area of data repository

  • Initial cleanup of affiliation and funder organization names to remove personal email addresses and social media handles (additional cleanup and standardization in progress and will be included in future releases)

Data structure updates to improve usability and eliminate redundancies:

  • Rename subj_id and obj_id fields to “dataset” and “publication” for clarity

  • Remove accessionNumber and doi elements to eliminate redundancy with subj_id

  • Remove relationTypeId fields as these are specific to Event Data only

 

Full details of the above changes, including scripts used to perform the above tasks, are available in GitHub

While v2 addresses a number of cleanup and enhancement tasks, additional data issues may remain, and additional enhancements are being explored. These will be addressed in the course of subsequent data file releases. 


Feedback on the data file can be submitted via GitHub. For general questions, email info@makedatacount.org.

Files

2024-08-23-data-citation-corpus-v2.0.zip

Files (956.1 MB)

Name Size Download all
md5:03c3fb45edbab61fa59e0acbd5275715
956.1 MB Preview Download

Additional details

Related works

Is new version of
Dataset: 10.5281/zenodo.11196859 (DOI)
Dataset: 10.5281/zenodo.11216814 (DOI)

Funding

Make Data Count: A Central Corpus for All Data Citations 226453/Z/22/Z
Wellcome Trust