Materials Dataset: A-Curated-High-Fidelity-Carbide-Materials-Dataset-HF-CCD-and-Pipeline
Authors/Creators
Description
HF-CCD is a high-fidelity curated carbide materials dataset with machine-learning–ready descriptors, featuring a fully reproducible pipeline from raw data fetching to structural cleaning, descriptor generation, and statistical quality control. This repository provides HF-CCD, a curated high-fidelity carbide materials dataset derived from Materials Project entries, together with a complete end-to-end processing pipeline including:
- Automated bulk data fetching
- Descriptor generation
- Structure and metadata cleaning
- Quality control (QC) with statistical anomaly detection
- Ready-to-use machine-learning feature tables
⚠️ Important:
This repository does not redistribute raw CIF files from Materials Project due to redistribution restrictions.
The dataset provided is an independent cleaned, derived dataset, suitable for ML/DS research and curated through our pipeline.
Pipeline Overview
The HF-CCD pipeline consists of four main stages:
1. Data Fetching
- Query MP API
- Download metadata (formation energy, band gap, density, space group…)
- Store JSON metadata only (no CIF redistribution)
2. Structure & Materials Cleaning
- Remove incomplete entries
- Check for missing physical quantities
- Normalize chemical formula notation
- Remove duplicated structures
- Enforce physical boundary checks
3. Descriptor Generation
Using advanced_descriptors.py, the pipeline computes:
- Atomic-level descriptors
- Bonding descriptors
- Geometric descriptors
- Coordination and packing metrics
- Density-based descriptors
This produces a machine-learning-ready table.
4. Quality Control (QC)
Using plot_data_quality.py:
- Outlier detection via IQR
- Outlier detection via Isolation Forest
- Distribution analyses (boxplots)
- Global dataset quality summary dashboard
All QC plots are saved to PNG/.
HF-CCD Dataset — Data Processing Pipeline Explanation
1. Source Data: cleaned_materials.csv (7 columns)
This file contains the fundamental material attributes downloaded from the Materials Project database.
It includes only high-level metadata, without structural descriptors.
Columns (7):
| Column | Description |
|---|---|
| id | Materials Project ID |
| family | Chemical family (e.g., carbide, nitride) |
| formula | Reduced chemical formula |
| cif_file | Structure file name (CIF) |
| band_gap | Electronic band gap (eV) |
| formation_energy | Formation energy per atom (eV) |
| density | Density (g/cm³) |
This is the “raw dataset" before structural feature computation.
2. Structure-Based Feature Generation
The script advanced_descriptors.py reads the CIF files and computes local, structural, and bonding descriptors.
A. Local Environment Features (VoronoiNN)
Extracted using pymatgen.analysis.local_env.VoronoiNN().
| Feature | Meaning |
|---|---|
| avg_CN | Average coordination number |
| std_CN | Variation of coordination |
| min_CN | Minimum coordination |
| max_CN | Maximum coordination |
B. Bond-Length Features
Using neighbor search with a 4.0 Å cutoff.
| Feature | Meaning |
|---|---|
| min_bond | Shortest neighbor distance |
| mean_bond | Average bond length |
| std_bond | Bond length variation |
| max_bond | Longest neighbor distance |
🧩 These features describe atomic packing and bonding rigidity.
C. Structure Geometry Features
Derived from CIF lattice & space group.
| Feature | Description |
|---|---|
| volume_per_atom | Volume normalized by number of atoms |
| n_atoms | Number of atoms in the primitive cell |
| n_elements | Number of unique element types |
| lattice_a, lattice_b, lattice_c | Lattice constants |
| lattice_anisotropy | Std / mean of (a, b, c) |
| spacegroup | International space group number |
3. Final Output: advanced_features.csv (20+ columns → 58 features after expansion)
example:
python plot_correlation_heatmap.py --input ..\data\advanced_features.csv --output ..\output\figures\correlation_heatmap.png --style all
This is the file used for:
- Correlation heatmap
- Clustering analysis
- Feature grouping
- ML model training
- QC statistics
- Zenodo dataset
Why does it have so many features?
Because each category expands raw structural information into vectorized descriptors, capturing:
- Atomic coordination environments
- Bond-length distributions
- Lattice geometry
- Symmetry
- Stoichiometric richness
These features dramatically improve ML model performance for predicting material properties.
Usage
-
Fetch Materials Project data python scripts/materials_fetcher.py --output data/materials_metadata.json
-
Clean dataset python scripts/clean_carbon.py --input data/materials_metadata.json
--output data/hfccd_clean.csv -
Generate descriptors python scripts/advanced_descriptors.py
--input data/hfccd_clean.csv
--output data/hfccd_features.csv -
Run QC visualization python scripts/plot_data_quality.py
--input data/hfccd_features.csv
--output PNG/hfccd_qc.png
--style all
Citation
If you use HF-CCD in academic work, please cite:
Wu, J.-H. (2025). A Curated High-Fidelity Carbide Materials Dataset (HF-CCD) and Pipeline.
https://orcid.org/0009-0001-3396-6835
https://doi.org/10.5281/zenodo.17851432
Legal Notice
This repository does not include, redistribute, or republish raw CIF files or any protected content from Materials Project.
Only derived numerical datasets and descriptors are released.
Users must supply their own MP API key to fetch raw structures for personal research use.
License
MIT License — free for academic and commercial use.
Files
jackman993/A-Curated-High-Fidelity-Carbide-Materials-Dataset-HF-CCD-and-Pipeline-v01.1.zip
Files
(6.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:c84764a0645a69e96db4310ae329aba0
|
6.1 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/jackman993/A-Curated-High-Fidelity-Carbide-Materials-Dataset-HF-CCD-and-Pipeline/tree/v01.1 (URL)
Software
References
- A. Wang, R. Murdock, S. Kauwe, A. Oliynyk, A. Gurlo, J. Brgoch, K. Persson, T. Sparks, Machine learning for materials scientists: An introductory guide towards best practices, Preprint on ChemRxiv, doi.org/10.26434/chemrxiv.12249752.v1 (2020).
- T. Xie, J. C. Grossman, Crystal graph convolutional neural networks for accurate and interpretable prediction of material properties, Physical Review Letters 120 (2018) 145301, doi.org/10.48550/arXiv.1710.10324.
- K. Kaufmann, D. Maryanovsky, W. M. Mellor, C. Zhu, A. S. Rosengarten, T. J. Harrington, C. Oses, C. Toher, S. Curtarolo, K. S. Vecchio, Discovery of highentropy ceramics via machine learning, npj Computational Materials 6 (2020) 164, doi.org/10.1038/s41524-020-0317-6.
- Y. Liu, H. Meng, Z. Zhu, H. Yu, L. Zhuang, Y. Chu, Exploring mechanical and thermal properties of high-entropy ceramics via general machine learning potentials, Materials Science and Technology 41 (2025) 55–67, doi.org/10.48550/arXiv.2406.08243.
- C. W. Park, C. Wolverton, Developing an improved crystal graph convolutional neural network framework for accelerated materials discovery- Doi.org/10.48550/arXiv.1906.05267 (2019).
- Ethan, et al., Advanced terrestrial simulator (ats) v0.88, [software]. doi.org/10.5281/zenodo.3727209 (2020).
- A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. D. Dacek, S. Cholia, D. Gunter, V. L. Chevrier, K. A. Persson, G. Ceder, Commentary: The materials project: A materials genome approach to accelerating materials innovation, APL Materials 1 (2013) 011002, doi.org/10.1063/1.4812323.
- Schleder, et al., From dft to machine learning: recent approaches to materials science–a review, [dataset] doi.org/0.1088/2515-7639/ab084b] (2019).
- M. Oguro, et al., Mortality data for japanese oak wilt disease and surrounding forest compositions, [dataset]. doi.org/10.17632/xwj98nb39r.] (2015).
- S. Kirklin, J. E. Saal, B. Meredig, A. Thompson, J. Doak, M. Aykol, S. Rühl, C. Wolverton, The open quantum materials database (oqmd): assessing the accuracy of dft formation energies, npj Computational Materials 1 (2015) 15010. doi.org/10.1038/npjcompumats.2015.10.
- S. Curtarolo, G. L. W. Hart, M. B. Nardelli, N. Mingo, S. Sanvito, O. Levy, The highthroughput highway to computational materials design, Nature Materials 12 (2013) 191–201. doi.org/10.1038/nmat3568.
- C. Draxl, M. Scheffler, Nomad: The fair concept for big data-driven materials science, MRS Bulletin 43 (9) (2018) 676–682. doi.org/10.1557/mrs.2018.208.
- W. Sun, S. T. Dacek, S. P. Ong, G. Hautier, A. Jain, W. D. Richards, A. Gamst, K. A. Persson, G. Ceder, The thermodynamic scale of inorganic crystalline metastability, Science Advances 2 (2016). doi.org/10.1126/sciadv.1600225.
- A. Merkys, S. Vaitkus, J. Butkus, K. Okulić, S. Gražulis, Cod::cif parsing, errorcorrection and validation, Journal of Applied Crystallography 49 (1) (2016) 292–301. doi.org/10.1107/S1600576715022396.
- M. Oguro, et al., Mortality data for japanese oak wilt disease and surrounding forest compositions, [dataset] (2015). doi.org/10.17632/xwj98nb39r.