Published January 10, 2024
| Version v4
Dataset
Open
DPA-2: A Large Atomic Model As a Multi-task Learner
Creators
- Zhang, Duo1, 2, 3
- Liu, Xinzijian1, 2
- Zhang, Xiangyu4, 5
- Zhang, Chengqian2, 6
- Cai, Chun1, 2
- Bi, Hangrui1, 2
- Du, Yiming4, 5
- Qin, Xuejian7, 8
- Huang, Jiameng2, 9
- Li, Bowen10
- Shan, Yifan7, 8
- Zeng, Jinzhe11
- Zhang, Yuzhi2
- Liu, Siyuan2
- Li, Yifan12
- Chang, Junhan2, 13
- Wang, Xinyan2
- Zhou, Shuo2, 14
- Liu, Jianchuan15
- Luo, Xiaoshan16, 17
- Wang, Zhenyu17, 18
- Jiang, Wanrun1
- Wu, Jing19
- Yang, Yudi19
- Yang, Jiyuan19
- Yang, Manyi20
- Gong, Fu-Qiang21
- Zhang, Linshuang2
- Shi, Mengchao2
- Dai, Fu-Zhi1
- York, Darrin M.11
- Liu, Shi19, 22
- Zhu, Tong10, 23, 24
- Zhong, Zhicheng7, 8
- Lv, Jian17
- Cheng, Jun21, 25, 26
- Jia, Weile4
- Chen, Mohan1, 6
- Ke, Guolin2
- E, Weinan1, 27, 28
- Zhang, Linfeng1, 2
- Wang, Han6, 29
- 1. AI for Science Institute, Beijing 100080, P. R.~China
- 2. DP Technology, Beijing 100080, P. R.~China
- 3. Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, P. R.~China
- 4. State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100871, P.R.~China
- 5. University of Chinese Academy of Sciences, Beijing 100871, P.R.~China
- 6. HEDPS, CAPT, College of Engineering, Peking University, Beijing 100871, P.R.~China
- 7. Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315201, P.R.~China
- 8. CAS Key Laboratory of Magnetic Materials and Devices and Zhejiang Province Key Laboratory of Magnetic Materials and Application Technology, Chinese Academy of Sciences, Ningbo 315201, P.R.~China
- 9. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, P.R.~China
- 10. Shanghai Engineering Research Center of Molecular Therapeutics \& New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, P.R.~China
- 11. Laboratory for Biomolecular Simulation Research, Institute for Quantitative Biomedicine and Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey 08854, USA
- 12. Department of Chemistry, Princeton University, Princeton, New Jersey 08540, USA
- 13. College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, P.R.~China
- 14. Yuanpei College, Peking University, Beijing 100871, P.R.~China
- 15. School of Electrical Engineering and Electronic Information, Xihua University, Chengdu, 610039, P.R.~China
- 16. State Key Laboratory of Superhard Materials, College of Physics, Jilin University, Changchun 130012, P.R.~China
- 17. Key Laboratory of Material Simulation Methods \& Software of Ministry of Education, College of Physics, Jilin University, Changchun, 130012, P.R.~China
- 18. International Center of Future Science, Jilin University, Changchun, 130012, P.R.~China
- 19. Key Laboratory for Quantum Materials of Zhejiang Province, Department of Physics, School of Science, Westlake University, Hangzhou, Zhejiang 310030, P.R.~China
- 20. Atomistic Simulations, Italian Institute of Technology, 16156 Genova, Italy
- 21. State Key Laboratory of Physical Chemistry of Solid Surface, iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen, 361005, P.R.~China
- 22. Institute of Natural Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang 310030, P.R.~China
- 23. NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, P.R.~China
- 24. Institute for Advanced algorithms research, Shanghai, 201306, P.R.~China
- 25. Laboratory of AI for Electrochemistry (AI4EC), IKKEM, Xiamen, 361005, P.R.~China
- 26. Institute of Artificial Intelligence, Xiamen University, Xiamen, 361005, P.R.~China
- 27. Center for Machine Learning Research, Peking University, Beijing 100871, P.R.~China
- 28. School of Mathematical Sciences, Peking University, Beijing, 100871, P.R.~China
- 29. Laboratory of Computational Physics, Institute of Applied Physics and Computational Mathematics, Fenghao East Road 2, Beijing 100094, P.R.~China
Description
Data:
- The complete collection of datasets employed in this research is encapsulated within the archive file data-v1.3.tgz. This encompasses both the upstream datasets for pre-training and downstream datasets for fine-tuning, all in DeePMD format. We recommend creating a new directory and employing the command 'tar -xzvf data-v1.3.tgz' to extract the data files.
- Inside each dataset contained in subdirectories (e.g., Domains, Metals, H2O, and Others), one will find:
- A README file
- A 'train' directory (included if utilized in upstream pre-training)
- train.json -- A list of file paths for training systems
- test.json -- A list of file paths for testing systems
- A 'downstream' directory (included if utilized in downstream fine-tuning)
- train.json -- A list of file paths for training systems
- test.json -- A list of file paths for testing systems
- *Main data files comprising various structures
- *Additional processing scripts
-
The root directory contains train.json and downstream.json files that amalgamate the respective upstream and downstream splits mentioned above.
-
The datasets used in this study are described in Section S1 of the Supplementary Materials and are readily accessible on AIS Square, which provides extensive details.
Code:
-
The 'code' directory, extractable from the archive Code_model_script.tgz, includes the DeePMD-kit's source code, which is based on PyTorch (2.0) Version. Installation and usage instructions can be found within the README file located in deepmd-pytorch-devel.zip.
- UPDATE: deepmd-pytorch-devel-0110.zip supports unsupervised learning through denoising, see its README for more details.
Model:
- Within the 'model' directory, also found in the extracted Code_model_script.tgz, resides the multi-task pre-trained DPA-2 model utilized in this research. Accompanying the model is its configuration file, input.json, which details the simultaneous pre-training of this model across 18 upstream datasets with shared descriptor parameters for 1 million steps.
Scripts:
- The 'scripts' directory, part of the uncompressed Code_model_script.tgz, comprises all the scripts used for training, fine-tuning (learning curve analysis), and distillation in this work:
- 1. Upstream_single_task_training: Contains individual training scripts for DPA-2, Gemnet-OC, Equiformer-V2, Nequip, and Allegro, corresponding to the 18 upstream datasets. The training script for MACE is also contained and is applicable across all datasets.
- 2. Downstream_lcurve_workflow: Includes code and input files to evaluate the learning curves, including tests for DPA-2 fine-tuning transferability across 15 downstream datasets, as depicted in Figure 3 of the manuscript.
- 3. Distillation_workflow: Provides input files for distilling the fine-tuned DPA-2 models in datasets such as H2O-PBE0TS-MD, SSE-PBE-D, and FerroEle-D, as illustrated in Figure 4 of the manuscript.
- It is important to note that the scripts in 'Upstream_single_task_training' require the installation of deepmd-pytorch and other related models from their respective repositories (Gemnet-OC and Equiformer-V2: here [commit hash: 9bc9373], Nequip: here [commit hash: dceaf49, tag: v0.5.6], Allegro: here [commit hash: 22f673c]), MACE here [commit hash: b76a2a9].
- The scripts in 'Downstream_lcurve_workflow' and 'Distillation_workflow' leverage Dflow—a Python framework for constructing scientific computing workflow—and dpgen2, the 2nd generation of the Deep Potential GENerator, both of which are repositories in the Deep Modeling Community.