Published July 31, 2023 | Version v1
Dataset Open

OpenAlex Author Name Disambiguation V3 Data - Disambiguation Model

Description

5 Separate files used in the OpenAlex (https://openalex.org) V3 Author Name Disambiguation Model Creation:

  1. ORCID_hard_negative_pairs: Pairs of ORCIDs where either the full name, family name, or given name are a match and would therefore be more difficult to disambiguate.
  2. Disambiguator_all_possible_training_data: Dataset created which contains all possible features for modeling and all possible samples of data. Eventually, this was split into train/val/test and also processed more to create a better balance of positive to negative samples for our purposes.
  3. Disambiguator_final_train_data: Final data which the disambiguator was trained on.
  4. Disambiguator_final_val_data: Data which was used to test the model during training to optimize the features/hyperparameters chosen.
  5. Disambiguator_final_test_data: Final dataset which gave model performance indication after all hyperparameters were tuned and features were chosen.

More details can be found at https://github.com/ourresearch/openalex-name-disambiguation

Files

Files (2.6 GB)

Name Size Download all
md5:4f7eebe468a7502532fe242a7bc4b426
2.6 GB Download
md5:9a502992f30163a38dea40cac61629bc
611.6 kB Download
md5:344fd0d16a4b2e6a22e37ef0274e215f
6.9 MB Download
md5:896c9a25e2a83e884ffeae89c9b6a79b
867.7 kB Download
md5:ede6376dd066b505a367b8a57f12d142
7.2 MB Download