Published April 25, 2023 | Version v1
Journal article Open

Code and data for "Exploiting redundancy in large materials datasets for efficient machine learning with less data"

  • 1. Department of Materials Science and Engineering, University of Toronto, 27 King's College Cir, Toronto, ON, Canada.
  • 2. Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, Gaithersburg, MD, USA.
  • 3. Canmet MATERIALS, Natural Resources Canada, 183 Longwood Road south, Hamilton, ON, Canada.

Description

Code and data for "Exploiting redundancy in large materials datasets for efficient machine learning with less data", published in Nat Commun 14, 7283 (2023).

The code is provided in the codes_2023Jul31.zip file.

There are 6 datasets considered: JARVIS 2018 and 2022 snapshot. Materials Project 2018 and 2021 snapshots. OQMD 2014 and 2021 snapshots. The older snapshot is a subset of the newer snapshot.

For each of them, there is

  • a pickle file (dataset_featurized_matminer.pkl) which is a pandas.DataFrame with index being material_id and columns containing the retrived properties (e_form, bandgap, bulk_modulus), metainformation (formula, chemical system etc.), and 273 Matminer features (the last 273 columns). Note that the atomic structures are not stored in this file.
  • a pickle file (dataset_pmg_structure.pkl) which is a pandas.Series containing the corresponding pymatgen structure objects.
  • a zip file (dataset_cif.zip) which contains the corresponding cif files converted from the corresponding pymatgen structure objects.

Please note that in the paper, we dropped materials whose formation energy is above 5 eV/atom for all the properties/tasks.

Files

jarvis22_cif.zip

Files (6.5 GB)

Name Size Download all
md5:98394c71c38d8900ef3f1955c1a26bc4
127.8 MB Download
md5:b1233a211275bddffafacde791f7902c
44.2 MB Preview Download
md5:0d16060701e26ea18ef22bdd8bee69bb
221.6 MB Download
md5:7ca1b3d2285744a8e44afb16395ab854
173.6 MB Download
md5:448f3164530c6471a15a5d3de6898dcb
182.1 MB Download
md5:11a8570de88e6ed2066e4ffae36e13d8
119.1 MB Preview Download
md5:4254ceb6707575f7811e7e52053c941c
384.4 MB Download
md5:56461c7e47ccc3869a8f6a60919b2cf1
713.5 MB Download
md5:04172bc5b4a2949cd909778d4bd7f5f6
640.1 MB Download
md5:3982b475aa5b4689e33a962fffd030d1
497.4 MB Preview Download
md5:733d271a4f6f3b8275581656af2badb0
2.5 GB Download
md5:197c6ca367b6375b7799f4eff9061d16
911.3 MB Download