OpenAlex Author Name Disambiguation V3 Data - Disambiguation Model

Published July 31, 2023 | Version v1

Dataset Open

5 Separate files used in the OpenAlex (https://openalex.org) V3 Author Name Disambiguation Model Creation:

ORCID_hard_negative_pairs: Pairs of ORCIDs where either the full name, family name, or given name are a match and would therefore be more difficult to disambiguate.
Disambiguator_all_possible_training_data: Dataset created which contains all possible features for modeling and all possible samples of data. Eventually, this was split into train/val/test and also processed more to create a better balance of positive to negative samples for our purposes.
Disambiguator_final_train_data: Final data which the disambiguator was trained on.
Disambiguator_final_val_data: Data which was used to test the model during training to optimize the features/hyperparameters chosen.
Disambiguator_final_test_data: Final dataset which gave model performance indication after all hyperparameters were tuned and features were chosen.

More details can be found at https://github.com/ourresearch/openalex-name-disambiguation

Files

Name	Size	Download all
Disambiguator_all_possible_training_data.parquet md5:4f7eebe468a7502532fe242a7bc4b426	2.6 GB	Download
Disambiguator_final_test_data.parquet md5:9a502992f30163a38dea40cac61629bc	611.6 kB	Download
Disambiguator_final_train_data.parquet md5:344fd0d16a4b2e6a22e37ef0274e215f	6.9 MB	Download
Disambiguator_final_val_data.parquet md5:896c9a25e2a83e884ffeae89c9b6a79b	867.7 kB	Download
ORCID_hard_negative_pairs.parquet md5:ede6376dd066b505a367b8a57f12d142	7.2 MB	Download