Dataset for ADAPT: Lightweight, Long-Range Machine Learning Force Fields Without Graphs
Authors/Creators
Description
# Where does this data come from
This is the data used in the *ADAPT: Lightweight, Long-Range Machine Learning Force Fields Without Graphs* paper found at: https://arxiv.org/abs/2509.24115. It is a set of DFT calculations for point defects in silicon. The majority of the set is complex defects featuring two defect atoms, although vacancy and simple substitional defects are also present. Supplementary materials Section A in the original paper gives a description of how the data was generated and further details.
# What is in this data
In total, this dataset features 6,082 trajectory computations. We flatten the trajectory dimension to create a list of 252,240 total DFT steps. Each step, with n atoms, is represented by a tuple of shape ([n, 12], [n, 3], 1) where the first [n, 12] tensor gives the atomic coordinates followed by the features (see paper for feature list), the second [n, 3] tensor gives the DFT computed forces for each atom, and the last value is the energy for the structure. We seperate each of the three components into a different file, the [n, 12] is considered the "X" data, the [n, 3] is considered the force data (denoted "Y"), and the energy values are labeled "nrg".
Please note that all structures are placed in the first octant, with a length of 16.406184 along each side of the box. The periodic boundary condition is assumed to be enforced on all examples.
# How is this data organized
The original dataset was split into Training and Test sets where all trajectories are in the training set except for 100 selected complex defect trajecotries which make up the test set. Reasoning about why we must use full trajectories in either the training or test set can be found here: https://evandramko.github.io/files/interpolation.pdf.
The files necessary to retrain the ADAPT model using the code found at: https://github.com/EvanDramko/ADAPT_Released, are the ```~/adapt_data/raggedTrain_weighted.py``` as and ```~/adapt_data/raggedTest_weighted.py``` for forces, and ```~/adapt_data/nrg<Train/Test>.pt``` can be subsituted for the labels if doing energy training. Note that while the original atom representation should have only 12 descriptors (3 coordinates, 9 features), these feature 13 descriptors. The last feature in each atom is an "importance weighting" described in Section 2.1.2 of the paper.
The training set has been recreated as a .extxyz file to allow for the retraining of MACE. For convenience, this has been included as well in the file: ```train_xyz_format.extxyz```.
The original pickle files containing data are included as well in the folder ```pickleFiles```. We recommend that users consider using this as the source for their data to ensure that they are not picking up any left-over artifacts from other usage.
Files
data_upload.zip
Files
(5.9 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:2d9d852b3a8c5ec4156337efc060f577
|
5.9 GB | Preview Download |
Additional details
Funding
- United States Department of Energy
- DE-SC0022289
- U.S. National Science Foundation
- CCF-2212558
- U.S. National Science Foundation
- CCF-2212557
- U.S. National Science Foundation
- CCF-1918651
- National Energy Research Scientific Computing Center
- BES-ERCAP0020966