Published February 13, 2026 | Version v1
Dataset Open

A Dataset of Inorganic Crystal Structures with Hybrid-DFT Derived Band Gaps: Integration of Crystallography Open Database and HSE-Band Gaps Database

  • 1. Teaching Assistant ,Faculty of computer Science, Ain shams University
  • 2. School of Artificial Intelligence, Egyptian Russian University, Cairo, Egypt
  • 3. Solid-State Electronics Laboratory, Solid-State Physics Department, Physics Research Institute, National Research Centre, 33 El-Bohouth St., Dokki, Giza, 12622, Egypt
  • 4. Science and Engineering Department, Faculty of Postgraduate Studies for Advanced Science, Renewable Energy, Beni-Suef University, Beni-Suef 62511, Egypt
  • 5. X-ray Crystallography Laboratory, Solid-State Physics Department, Physics Research Institute, National Research Centre, 33 El-Bohouth St., Dokki, Giza, 12622, Egypt

Description

This dataset comprises 4,542 inorganic crystalline materials, provided into a unified repository of Crystallographic Information Files (CIF) paired with high-accuracy electronic band gap values. The structural data were curated from the Crystallography Open Database (COD), a comprehensive open-access collection of crystal structures [1, 2].

To ensure predictive reliability, the corresponding electronic band gaps were sourced from the validated HSE database developed by Kim et al. (2020) [3]. In this underlying work, electronic structures were characterized using hybrid density functional theory (DFT) with the Heyd–Scuseria–Ernzerhof (HSE06) screened hybrid functional. This approach significantly mitigates the well-known "band-gap problem" inherent in standard semilocal exchange-correlation approximations, such as the Local Density Approximation (LDA) or Generalized Gradient Approximation (GGA), thereby providing a more physically accurate representation of the semiconducting properties within the dataset.

 

Methods

Data Acquisition and Workflow

The dataset was constructed through a multi-stage integration of the HSE band-gap database and the COD Database. Chemical formulas were systematically extracted from the HSE repository and utilized as primary keys for programmatic queries within the COD, facilitated by the aiida-cod database importer.

To ensure high data fidelity, a strict string-matching protocol was implemented: CIF entries were retrieved only when an exact correspondence was established between the query formula and the chemical formula_sum field within the COD metadata. Following verification, the corresponding crystallographic files were archived locally using a standardized naming convention based on stoichiometric identifiers.

The final curation stage involved filtering for completeness, retaining only those entries where structural coordinates and hybrid-DFT electronic data were concurrently present. This pipeline yielded a validated ensemble of 4,542 inorganic compounds, providing a robust basis for structure-property relationship analysis.

Data Sources

- Crystal Structures: Crystallography Open Database (COD) [1,2]
- Band Gap Values: (Hybrid DFT calculations using the Heyd–Scuseria–Ernzerhof (HSE06) functional) [3]

Files

dataset[1].zip

Files (10.2 MB)

Name Size Download all
md5:d484d0a68b4a3c83ed7c9616282fbab9
10.2 MB Preview Download

Additional details

Related works

Is supplement to
Dataset: Crystallography Open Database (COD) (Other)

References

  • [1] Gražulis, S., et al. (2009). Crystallography Open Database – an open-access collection of crystal structures. Journal of Applied Crystallography, 42(4), 726-729. https://doi.org/10.1107/S0021889809016690
  • [2] Gražulis, S., et al. (2012). Crystallography Open Database (COD): an open-access collection of crystal structures and platform for world-wide collaboration. Nucleic Acids Research, 40(D1), D420-D427. https://doi.org/10.1093/nar/gkr900
  • [3] Kim, C., et al. (2020). A band-gap database for semiconducting inorganic materials calculated with hybrid functional. Scientific Data, 7, 387. https://doi.org/10.1038/s41597-020-00723-8