Easy ORCID

Hoyt, Charles Tapley

doi:10.5281/zenodo.12204667

Published June 21, 2024 | Version 2023.5

Dataset Open

Easy ORCID

Hoyt, Charles Tapley

The first-party ORCID data dump uses a data structure that is overly complex for most use cases. This Zenodo record contains a derived version that is much more straightforwards, accessible, and smaller. So far, this includes employers, education, external identifiers, and publications linked to PubMed. It adds additional processing to ground employers and educational instutitions using the Research Organization Registry (ROR). It also does some minor string processing, such as standardization of education types (e.g., Bachelor of Science, Master of Science) and standardization of PubMed references.

Records

The records.jsonl.gz file is a JSON Lines file where each row represents a single ORCID record in a simple, well-defined schema (see schema.json). The records_hq.jsonl.gz file is a subset of the full records file that only contains records that have at least one ROR-grounded employer, at least one ROR-grounded education, or at least one publication indexed in PubMed. The point of this subset is to remove ORCID records that are generally not possible to match up to any external information.

Updates since last version:

country field is now countries and contains a list
locale field added
date field is now a dictionary instead of a straight integer for year

This record also contains a SQLite database orcid.db that contains tables for researchers and for organizations. This is useful for quick lookup of data based on an ORCID local unique identifier.

Nomenclature Authority Cross-References

Websites, social links, and other identifiers are parsed and standardized to comply with the Bioregistry then shared using the Simple Standard for Sharing Ontological Mappings (SSSOM) in the sssom.tsv.gz file. This allows for getting Scopus, Web of Science, GitHub, Google Scholar, and other profiles for records that include them. This information is also available through the main records file.

Authorship Links

Authorships are extracted and standardized in the pubmeds.tsv.gz file, which contains an ORCID column and PubMed column that has been pre-sanitized to only contain local unique identifiers. This information is also available through the main records file.

Lexical Indexes

It includes two pre-built Gilda indexes for named entity recognition (NER) and named entity normalization (NEN). One contains all records, and the second is filtered to high-quality records. The following Python code snipped can be used for grounding:

from gilda import Grounder
url = "https://zenodo.org/records/11474470/files/gilda_hq.tsv.gz?download=1"
grounder = Grounder(url)
results = grounder.ground("Charles Tapley Hoyt")

Ontology Artifacts

The file orcid.ttl.gz is an OWL-ready RDF file that can be opened in Protégé or used with the Ontology Development Kit. It can also be converted into OWL XML, OWL Functional Notation, or other OWL formats using ROBOT. This artifact can serve as a replacement for the ones generated by https://github.com/cthoyt/orcidio, which was a smaller-scale way of turning ORCID records for contributors to OBO Foundry Ontologies into a small OWL file. Now, the export here contains all ORCID records with names.

Reproduction

It is automatically generated with code in https://github.com/cthoyt/orcid_downloader.

Files

schema.json

Files (6.0 GB)

Name	Size	Download all
gilda.tsv.gz md5:e2b525892676fe2b1dff296de68b64b9	1.9 GB	Download
gilda_hq.tsv.gz md5:8bc4b77dd612125f68f202ae4cf5b2ee	522.0 MB	Download
orcid.db md5:7952f2757e765347c29abf7eaf90c32d	1.7 GB	Download
orcid.ttl.gz md5:a73f6f23b6e29fe577231eb354161dbb	299.4 MB	Download
pubmeds.tsv.gz md5:5138a9d5a499f796d67a8bea96beeedd	19.2 MB	Download
records.jsonl.gz md5:4b2cf187f70052011fde75eae8d1d9d0	910.6 MB	Download
records_hq.jsonl.gz md5:bc71067b5ef87f7f9a18293dc8b75678	585.7 MB	Download
schema.json md5:d40c017715931420d3c035476e207c3b	5.5 kB	Preview Download
sssom.tsv.gz md5:8acf744b2aeb38d717e713e761d03633	65.6 MB	Download

Additional details

Is derived from: Dataset: 10.23640/07243.24204912.v1 (DOI)
Requires: Software: 10.5281/zenodo.11371784 (DOI)

	All versions	This version
Views	1,189	111
Downloads	3,007	275
Data volume	2.3 TB	198.9 GB

Easy ORCID

Authors/Creators

Description

Records

Nomenclature Authority Cross-References

Authorship Links

Lexical Indexes

Ontology Artifacts

Reproduction

Files

schema.json

Files (6.0 GB)

Additional details

Related works