Published February 28, 2021 | Version 1.0 (ORCID+MAG)
Dataset Open

LAGOS-AND: A Large Gold Standard Dataset for MAG/OpenAlex Author Name Disambiguation

Creators

  • 1. Wuhan University

Description

We present a large gold standard dataset for author name disambiguation (AND) research (LAGOS-AND), which contains two sub-datasets, LAGOS-AND-BLOCK and LAGOS-AND-PAIRWISE. The datasets were automatically built by using the two authoritative sources, ORCID and DOI, based on the ORCID open database and an open literature database (MAG or OpenAlex).

The currently available versions of the LAGOS-AND datasets are:

  1. Version 1.0 (MAG+ORCID): This is the initial version of the LAGOS-AND dataset, the evaluation results and quality control measures of this version dataset can be found in this paper https://arxiv.org/abs/2104.01821.
  2. Version 2.0 (OpenAlex+ORCID): This version builds on OpenAlex instead of MAG because MAG was discontinued on 31 December 2021, and OpenAlex not only positions itself as a drop-in replacement for MAG but also keeps evolving by aggregating academic resources from other repositories. In addition, the pairwise-based sub-dataset (LAGOS-AND-PAIRWISE v2.0) improves the accuracy of labeled authorship (class label) as compared to LAGOS-AND-PAIRWISE v1.0.

Note that there are other versions of the dataset, which we call "pre-release". We recommend users to use the normal version of the dataset. These pre-release versions were originally intended to be released as the normal versions. However, during the preparation of the research paper, we found an issue with the dataset and the reviewers also made some reasonable requests for the dataset. This led us to update the dataset. Unfortunately, the Zenodo platform does not allow updates to the same version of dataset, so we had to create new versions. The created pre-release datasets are as follows:

  1. Version 2.0-alpha: For few samples in LAGOS-AND-PAIRWISE, the class labels are incorrect. We improve the accuracy of the class label in the normal Version 2.0 dataset by using a better random sampling approach.
  2. Version 1.0-beta: We created this version because a sub-dataset of this version LAGOS-AND-PAIRWISE contains only ~500K author pairs, while it should contain ~1M author pairs, as described in our paper. We fixed the problem in the Version 1.0 dataset.
  3. Version 1.0-alpha: The earliest dataset uploaded to Zenodo, corresponding to the original dataset before addressing the reviewers' comments and suggestions. In contrast, the dataset in Version 1.0 is the dataset after the reviewers' comments and suggestions have been addressed.

Files

Files (4.7 GB)

Name Size Download all
md5:f12de97a3cde63a85bd5779449944619
343.1 MB Download
md5:1a50cdad306bb5c43cc403135917b8f0
3.4 GB Download
md5:263531d8d0d2fab0f428818443ffeaf1
962.5 MB Download