Published December 8, 2025 | Version v01.1
Computational notebook Open

Materials Dataset: A-Curated-High-Fidelity-Carbide-Materials-Dataset-HF-CCD-and-Pipeline

Description

 

HF-CCD is a high-fidelity curated carbide materials dataset with machine-learning–ready descriptors, featuring a fully reproducible pipeline from raw data fetching to structural cleaning, descriptor generation, and statistical quality control. This repository provides HF-CCD, a curated high-fidelity carbide materials dataset derived from Materials Project entries, together with a complete end-to-end processing pipeline including:

  • Automated bulk data fetching
  • Descriptor generation
  • Structure and metadata cleaning
  • Quality control (QC) with statistical anomaly detection
  • Ready-to-use machine-learning feature tables

 

⚠️ Important:
This repository does not redistribute raw CIF files from Materials Project due to redistribution restrictions.
The dataset provided is an independent cleaned, derived dataset, suitable for ML/DS research and curated through our pipeline.

 

 Pipeline Overview

The HF-CCD pipeline consists of four main stages:

1. Data Fetching

  • Query MP API
  • Download metadata (formation energy, band gap, density, space group…)
  • Store JSON metadata only (no CIF redistribution)

2. Structure & Materials Cleaning

  • Remove incomplete entries
  • Check for missing physical quantities
  • Normalize chemical formula notation
  • Remove duplicated structures
  • Enforce physical boundary checks

3. Descriptor Generation

Using advanced_descriptors.py, the pipeline computes:

  • Atomic-level descriptors
  • Bonding descriptors
  • Geometric descriptors
  • Coordination and packing metrics
  • Density-based descriptors

This produces a machine-learning-ready table.

4. Quality Control (QC)

Using plot_data_quality.py:

  • Outlier detection via IQR
  • Outlier detection via Isolation Forest
  • Distribution analyses (boxplots)
  • Global dataset quality summary dashboard

All QC plots are saved to PNG/.

 

HF-CCD Dataset — Data Processing Pipeline Explanation

1. Source Data: cleaned_materials.csv (7 columns)

This file contains the fundamental material attributes downloaded from the Materials Project database.
It includes only high-level metadata, without structural descriptors.

Columns (7):

Column Description
id Materials Project ID
family Chemical family (e.g., carbide, nitride)
formula Reduced chemical formula
cif_file Structure file name (CIF)
band_gap Electronic band gap (eV)
formation_energy Formation energy per atom (eV)
density Density (g/cm³)

 This is the “raw dataset" before structural feature computation.

2. Structure-Based Feature Generation

The script advanced_descriptors.py reads the CIF files and computes local, structural, and bonding descriptors.

A. Local Environment Features (VoronoiNN)

Extracted using pymatgen.analysis.local_env.VoronoiNN().

Feature Meaning
avg_CN Average coordination number
std_CN Variation of coordination
min_CN Minimum coordination
max_CN Maximum coordination

B. Bond-Length Features

Using neighbor search with a 4.0 Å cutoff.

Feature Meaning
min_bond Shortest neighbor distance
mean_bond Average bond length
std_bond Bond length variation
max_bond Longest neighbor distance

🧩 These features describe atomic packing and bonding rigidity.

C. Structure Geometry Features

Derived from CIF lattice & space group.

Feature Description
volume_per_atom Volume normalized by number of atoms
n_atoms Number of atoms in the primitive cell
n_elements Number of unique element types
lattice_a, lattice_b, lattice_c Lattice constants
lattice_anisotropy Std / mean of (a, b, c)
spacegroup International space group number

3. Final Output: advanced_features.csv (20+ columns → 58 features after expansion)

example:

python plot_correlation_heatmap.py --input ..\data\advanced_features.csv --output ..\output\figures\correlation_heatmap.png --style all

  This is the file used for:

  • Correlation heatmap
  • Clustering analysis
  • Feature grouping
  • ML model training
  • QC statistics
  • Zenodo dataset

  Why does it have so many features?

Because each category expands raw structural information into vectorized descriptors, capturing:

  • Atomic coordination environments
  • Bond-length distributions
  • Lattice geometry
  • Symmetry
  • Stoichiometric richness

These features dramatically improve ML model performance for predicting material properties.

   Usage

  1. Fetch Materials Project data python scripts/materials_fetcher.py --output data/materials_metadata.json

  2. Clean dataset python scripts/clean_carbon.py --input data/materials_metadata.json
    --output data/hfccd_clean.csv

  3. Generate descriptors python scripts/advanced_descriptors.py
    --input data/hfccd_clean.csv
    --output data/hfccd_features.csv

  4. Run QC visualization python scripts/plot_data_quality.py
    --input data/hfccd_features.csv
    --output PNG/hfccd_qc.png
    --style all

Citation

If you use HF-CCD in academic work, please cite:

Wu, J.-H. (2025). A Curated High-Fidelity Carbide Materials Dataset (HF-CCD) and Pipeline.

https://orcid.org/0009-0001-3396-6835

https://doi.org/10.5281/zenodo.17851432

 Legal Notice

This repository does not include, redistribute, or republish raw CIF files or any protected content from Materials Project.

Only derived numerical datasets and descriptors are released.

Users must supply their own MP API key to fetch raw structures for personal research use.

 License

MIT License — free for academic and commercial use.

 

 

Files

jackman993/A-Curated-High-Fidelity-Carbide-Materials-Dataset-HF-CCD-and-Pipeline-v01.1.zip

Additional details

References

  • A. Wang, R. Murdock, S. Kauwe, A. Oliynyk, A. Gurlo, J. Brgoch, K. Persson, T. Sparks, Machine learning for materials scientists: An introductory guide towards best practices, Preprint on ChemRxiv, doi.org/10.26434/chemrxiv.12249752.v1 (2020).
  • T. Xie, J. C. Grossman, Crystal graph convolutional neural networks for accurate and interpretable prediction of material properties, Physical Review Letters 120 (2018) 145301, doi.org/10.48550/arXiv.1710.10324.
  • K. Kaufmann, D. Maryanovsky, W. M. Mellor, C. Zhu, A. S. Rosengarten, T. J. Harrington, C. Oses, C. Toher, S. Curtarolo, K. S. Vecchio, Discovery of highentropy ceramics via machine learning, npj Computational Materials 6 (2020) 164, doi.org/10.1038/s41524-020-0317-6.
  • Y. Liu, H. Meng, Z. Zhu, H. Yu, L. Zhuang, Y. Chu, Exploring mechanical and thermal properties of high-entropy ceramics via general machine learning potentials, Materials Science and Technology 41 (2025) 55–67, doi.org/10.48550/arXiv.2406.08243.
  • C. W. Park, C. Wolverton, Developing an improved crystal graph convolutional neural network framework for accelerated materials discovery- Doi.org/10.48550/arXiv.1906.05267 (2019).
  • Ethan, et al., Advanced terrestrial simulator (ats) v0.88, [software]. doi.org/10.5281/zenodo.3727209 (2020).
  • A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. D. Dacek, S. Cholia, D. Gunter, V. L. Chevrier, K. A. Persson, G. Ceder, Commentary: The materials project: A materials genome approach to accelerating materials innovation, APL Materials 1 (2013) 011002, doi.org/10.1063/1.4812323.
  • Schleder, et al., From dft to machine learning: recent approaches to materials science–a review, [dataset] doi.org/0.1088/2515-7639/ab084b] (2019).
  • M. Oguro, et al., Mortality data for japanese oak wilt disease and surrounding forest compositions, [dataset]. doi.org/10.17632/xwj98nb39r.] (2015).
  • S. Kirklin, J. E. Saal, B. Meredig, A. Thompson, J. Doak, M. Aykol, S. Rühl, C. Wolverton, The open quantum materials database (oqmd): assessing the accuracy of dft formation energies, npj Computational Materials 1 (2015) 15010. doi.org/10.1038/npjcompumats.2015.10.
  • S. Curtarolo, G. L. W. Hart, M. B. Nardelli, N. Mingo, S. Sanvito, O. Levy, The highthroughput highway to computational materials design, Nature Materials 12 (2013) 191–201. doi.org/10.1038/nmat3568.
  • C. Draxl, M. Scheffler, Nomad: The fair concept for big data-driven materials science, MRS Bulletin 43 (9) (2018) 676–682. doi.org/10.1557/mrs.2018.208.
  • W. Sun, S. T. Dacek, S. P. Ong, G. Hautier, A. Jain, W. D. Richards, A. Gamst, K. A. Persson, G. Ceder, The thermodynamic scale of inorganic crystalline metastability, Science Advances 2 (2016). doi.org/10.1126/sciadv.1600225.
  • A. Merkys, S. Vaitkus, J. Butkus, K. Okulić, S. Gražulis, Cod::cif parsing, errorcorrection and validation, Journal of Applied Crystallography 49 (1) (2016) 292–301. doi.org/10.1107/S1600576715022396.
  • M. Oguro, et al., Mortality data for japanese oak wilt disease and surrounding forest compositions, [dataset] (2015). doi.org/10.17632/xwj98nb39r.