Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics
Description
This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits.
For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method.
For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0
Files
anhender/mse_ML_datasets-v1.0.zip
Files
(25.3 MB)
Name | Size | Download all |
---|---|---|
md5:be19c6a8cce6a2112f40e01a7dfab990
|
25.3 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/anhender/mse_ML_datasets/tree/v1.0 (URL)