Published July 31, 2023
| Version v1
Dataset
Open
OpenAlex Author Name Disambiguation V3 Data - Disambiguation Model
Authors/Creators
- 1. OurResearch
Description
5 Separate files used in the OpenAlex (https://openalex.org) V3 Author Name Disambiguation Model Creation:
- ORCID_hard_negative_pairs: Pairs of ORCIDs where either the full name, family name, or given name are a match and would therefore be more difficult to disambiguate.
- Disambiguator_all_possible_training_data: Dataset created which contains all possible features for modeling and all possible samples of data. Eventually, this was split into train/val/test and also processed more to create a better balance of positive to negative samples for our purposes.
- Disambiguator_final_train_data: Final data which the disambiguator was trained on.
- Disambiguator_final_val_data: Data which was used to test the model during training to optimize the features/hyperparameters chosen.
- Disambiguator_final_test_data: Final dataset which gave model performance indication after all hyperparameters were tuned and features were chosen.
More details can be found at https://github.com/ourresearch/openalex-name-disambiguation