DPA-2: A Large Atomic Model As a Multi-task Learner

doi:10.5281/zenodo.13342300

Published January 10, 2024 | Version v4

Dataset Open

DPA-2: A Large Atomic Model As a Multi-task Learner

1. AI for Science Institute, Beijing 100080, P. R.~China
2. DP Technology, Beijing 100080, P. R.~China
3. Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, P. R.~China
4. State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100871, P.R.~China
5. University of Chinese Academy of Sciences, Beijing 100871, P.R.~China
6. HEDPS, CAPT, College of Engineering, Peking University, Beijing 100871, P.R.~China
7. Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315201, P.R.~China
8. CAS Key Laboratory of Magnetic Materials and Devices and Zhejiang Province Key Laboratory of Magnetic Materials and Application Technology, Chinese Academy of Sciences, Ningbo 315201, P.R.~China
9. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, P.R.~China
10. Shanghai Engineering Research Center of Molecular Therapeutics \& New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, P.R.~China
11. Laboratory for Biomolecular Simulation Research, Institute for Quantitative Biomedicine and Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey 08854, USA
12. Department of Chemistry, Princeton University, Princeton, New Jersey 08540, USA
13. College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, P.R.~China
14. Yuanpei College, Peking University, Beijing 100871, P.R.~China
15. School of Electrical Engineering and Electronic Information, Xihua University, Chengdu, 610039, P.R.~China
16. State Key Laboratory of Superhard Materials, College of Physics, Jilin University, Changchun 130012, P.R.~China
17. Key Laboratory of Material Simulation Methods \& Software of Ministry of Education, College of Physics, Jilin University, Changchun, 130012, P.R.~China
18. International Center of Future Science, Jilin University, Changchun, 130012, P.R.~China
19. Key Laboratory for Quantum Materials of Zhejiang Province, Department of Physics, School of Science, Westlake University, Hangzhou, Zhejiang 310030, P.R.~China
20. Atomistic Simulations, Italian Institute of Technology, 16156 Genova, Italy
21. State Key Laboratory of Physical Chemistry of Solid Surface, iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen, 361005, P.R.~China
22. Institute of Natural Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang 310030, P.R.~China
23. NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, P.R.~China
24. Institute for Advanced algorithms research, Shanghai, 201306, P.R.~China
25. Laboratory of AI for Electrochemistry (AI4EC), IKKEM, Xiamen, 361005, P.R.~China
26. Institute of Artificial Intelligence, Xiamen University, Xiamen, 361005, P.R.~China
27. Center for Machine Learning Research, Peking University, Beijing 100871, P.R.~China
28. School of Mathematical Sciences, Peking University, Beijing, 100871, P.R.~China
29. Laboratory of Computational Physics, Institute of Applied Physics and Computational Mathematics, Fenghao East Road 2, Beijing 100094, P.R.~China

Data:

The complete collection of datasets employed in this research is encapsulated within the archive file data-v1.3.tgz. This encompasses both the upstream datasets for pre-training and downstream datasets for fine-tuning, all in DeePMD format. We recommend creating a new directory and employing the command 'tar -xzvf data-v1.3.tgz' to extract the data files.
Inside each dataset contained in subdirectories (e.g., Domains, Metals, H2O, and Others), one will find:
- A README file
- A 'train' directory (included if utilized in upstream pre-training)
  - train.json -- A list of file paths for training systems
  - test.json -- A list of file paths for testing systems
- A 'downstream' directory (included if utilized in downstream fine-tuning)
  - train.json -- A list of file paths for training systems
  - test.json -- A list of file paths for testing systems
- *Main data files comprising various structures
- *Additional processing scripts
The root directory contains train.json and downstream.json files that amalgamate the respective upstream and downstream splits mentioned above.
The datasets used in this study are described in Section S1 of the Supplementary Materials and are readily accessible on AIS Square, which provides extensive details.

Code:

The 'code' directory, extractable from the archive Code_model_script.tgz, includes the DeePMD-kit's source code, which is based on PyTorch (2.0) Version. Installation and usage instructions can be found within the README file located in deepmd-pytorch-devel.zip.
UPDATE: deepmd-pytorch-devel-0110.zip supports unsupervised learning through denoising, see its README for more details.

Model:

Within the 'model' directory, also found in the extracted Code_model_script.tgz, resides the multi-task pre-trained DPA-2 model utilized in this research. Accompanying the model is its configuration file, input.json, which details the simultaneous pre-training of this model across 18 upstream datasets with shared descriptor parameters for 1 million steps.

Scripts:

The 'scripts' directory, part of the uncompressed Code_model_script.tgz, comprises all the scripts used for training, fine-tuning (learning curve analysis), and distillation in this work:
- 1. Upstream_single_task_training: Contains individual training scripts for DPA-2, Gemnet-OC, Equiformer-V2, Nequip, and Allegro, corresponding to the 18 upstream datasets. The training script for MACE is also contained and is applicable across all datasets.
- 2. Downstream_lcurve_workflow: Includes code and input files to evaluate the learning curves, including tests for DPA-2 fine-tuning transferability across 15 downstream datasets, as depicted in Figure 3 of the manuscript.
- 3. Distillation_workflow: Provides input files for distilling the fine-tuned DPA-2 models in datasets such as H2O-PBE0TS-MD, SSE-PBE-D, and FerroEle-D, as illustrated in Figure 4 of the manuscript.
It is important to note that the scripts in 'Upstream_single_task_training' require the installation of deepmd-pytorch and other related models from their respective repositories (Gemnet-OC and Equiformer-V2: here [commit hash: 9bc9373], Nequip: here [commit hash: dceaf49, tag: v0.5.6], Allegro: here [commit hash: 22f673c]), MACE here [commit hash: b76a2a9].
The scripts in 'Downstream_lcurve_workflow' and 'Distillation_workflow' leverage Dflow—a Python framework for constructing scientific computing workflow—and dpgen2, the 2nd generation of the Deep Potential GENerator, both of which are repositories in the Deep Modeling Community.

Files

deepmd-pytorch-devel-0110.zip

Files (17.4 GB)

Name	Size	Download all
Code_model_script.zip md5:ee909d42f904a29091dd740237afcfcc	268.2 MB	Preview Download
data-v1.3.tgz md5:789bedf203d673bdc95a09b582d83823	17.1 GB	Download
deepmd-pytorch-devel-0110.zip md5:3ccc4146bfeff121b65ac5f82a44b5bf	94.1 MB	Preview Download

	All versions	This version
Views	2,247	23
Downloads	1,108	17
Data volume	24.2 TB	54.2 GB

DPA-2: A Large Atomic Model As a Multi-task Learner

Creators

Description

Data:

Code:

Model:

Scripts:

Files

deepmd-pytorch-devel-0110.zip

Files (17.4 GB)