Dataset for ADAPT: Lightweight, Long-Range Machine Learning Force Fields Without Graphs

Dramko, Evan; Xiong, Yihuang; Zhu, Yizhi; Hautier, Geoffroy; Reps, Thomas; Jermaine, Christopher; Kyrillidis, Anastasios

doi:10.5281/zenodo.17411327

Published October 22, 2025 | Version v3

Dataset Open

Dataset for ADAPT: Lightweight, Long-Range Machine Learning Force Fields Without Graphs

1. Rice University
2. Dartmouth College
3. University of Wisconsin-Madison

# Where does this data come from
This is the data used in the *ADAPT: Lightweight, Long-Range Machine Learning Force Fields Without Graphs* paper found at: https://arxiv.org/abs/2509.24115. It is a set of DFT calculations for point defects in silicon. The majority of the set is complex defects featuring two defect atoms, although vacancy and simple substitional defects are also present. Supplementary materials Section A in the original paper gives a description of how the data was generated and further details.

# What is in this data
In total, this dataset features 6,082 trajectory computations. We flatten the trajectory dimension to create a list of 252,240 total DFT steps. Each step, with n atoms, is represented by a tuple of shape ([n, 12], [n, 3], 1) where the first [n, 12] tensor gives the atomic coordinates followed by the features (see paper for feature list), the second [n, 3] tensor gives the DFT computed forces for each atom, and the last value is the energy for the structure. We seperate each of the three components into a different file, the [n, 12] is considered the "X" data, the [n, 3] is considered the force data (denoted "Y"), and the energy values are labeled "nrg".

Please note that all structures are placed in the first octant, with a length of 16.406184 along each side of the box. The periodic boundary condition is assumed to be enforced on all examples.

# How is this data organized
The original dataset was split into Training and Test sets where all trajectories are in the training set except for 100 selected complex defect trajecotries which make up the test set. Reasoning about why we must use full trajectories in either the training or test set can be found here: https://evandramko.github.io/files/interpolation.pdf.

The files necessary to retrain the ADAPT model using the code found at: https://github.com/EvanDramko/ADAPT_Released, are the ```~/adapt_data/raggedTrain_weighted.py``` as and ```~/adapt_data/raggedTest_weighted.py``` for forces, and ```~/adapt_data/nrg<Train/Test>.pt``` can be subsituted for the labels if doing energy training. Note that while the original atom representation should have only 12 descriptors (3 coordinates, 9 features), these feature 13 descriptors. The last feature in each atom is an "importance weighting" described in Section 2.1.2 of the paper.

The training set has been recreated as a .extxyz file to allow for the retraining of MACE. For convenience, this has been included as well in the file: ```train_xyz_format.extxyz```.

The original pickle files containing data are included as well in the folder ```pickleFiles```. We recommend that users consider using this as the source for their data to ensure that they are not picking up any left-over artifacts from other usage.

Files

data_upload.zip

Files (5.9 GB)

Name	Size	Download all
data_upload.zip md5:2d9d852b3a8c5ec4156337efc060f577	5.9 GB	Preview Download

Additional details

United States Department of Energy
DE-SC0022289
U.S. National Science Foundation
CCF-2212558
U.S. National Science Foundation
CCF-2212557
U.S. National Science Foundation
CCF-1918651
National Energy Research Scientific Computing Center
BES-ERCAP0020966

	All versions	This version
Views	78	28
Downloads	16	3
Data volume	58.7 GB	17.8 GB

Dataset for ADAPT: Lightweight, Long-Range Machine Learning Force Fields Without Graphs

Authors/Creators

Description

Files

data_upload.zip

Files (5.9 GB)

Additional details

Funding