Planned intervention: On Thursday 19/09 between 05:30-06:30 (UTC), Zenodo will be unavailable because of a scheduled upgrade in our storage cluster.
Published January 10, 2024 | Version v4
Dataset Open

DPA-2: A Large Atomic Model As a Multi-task Learner

  • 1. AI for Science Institute, Beijing 100080, P. R.~China
  • 2. DP Technology, Beijing 100080, P. R.~China
  • 3. Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, P. R.~China
  • 4. State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100871, P.R.~China
  • 5. University of Chinese Academy of Sciences, Beijing 100871, P.R.~China
  • 6. HEDPS, CAPT, College of Engineering, Peking University, Beijing 100871, P.R.~China
  • 7. Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315201, P.R.~China
  • 8. CAS Key Laboratory of Magnetic Materials and Devices and Zhejiang Province Key Laboratory of Magnetic Materials and Application Technology, Chinese Academy of Sciences, Ningbo 315201, P.R.~China
  • 9. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, P.R.~China
  • 10. Shanghai Engineering Research Center of Molecular Therapeutics \& New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, P.R.~China
  • 11. Laboratory for Biomolecular Simulation Research, Institute for Quantitative Biomedicine and Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey 08854, USA
  • 12. Department of Chemistry, Princeton University, Princeton, New Jersey 08540, USA
  • 13. College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, P.R.~China
  • 14. Yuanpei College, Peking University, Beijing 100871, P.R.~China
  • 15. School of Electrical Engineering and Electronic Information, Xihua University, Chengdu, 610039, P.R.~China
  • 16. State Key Laboratory of Superhard Materials, College of Physics, Jilin University, Changchun 130012, P.R.~China
  • 17. Key Laboratory of Material Simulation Methods \& Software of Ministry of Education, College of Physics, Jilin University, Changchun, 130012, P.R.~China
  • 18. International Center of Future Science, Jilin University, Changchun, 130012, P.R.~China
  • 19. Key Laboratory for Quantum Materials of Zhejiang Province, Department of Physics, School of Science, Westlake University, Hangzhou, Zhejiang 310030, P.R.~China
  • 20. Atomistic Simulations, Italian Institute of Technology, 16156 Genova, Italy
  • 21. State Key Laboratory of Physical Chemistry of Solid Surface, iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen, 361005, P.R.~China
  • 22. Institute of Natural Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang 310030, P.R.~China
  • 23. NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, P.R.~China
  • 24. Institute for Advanced algorithms research, Shanghai, 201306, P.R.~China
  • 25. Laboratory of AI for Electrochemistry (AI4EC), IKKEM, Xiamen, 361005, P.R.~China
  • 26. Institute of Artificial Intelligence, Xiamen University, Xiamen, 361005, P.R.~China
  • 27. Center for Machine Learning Research, Peking University, Beijing 100871, P.R.~China
  • 28. School of Mathematical Sciences, Peking University, Beijing, 100871, P.R.~China
  • 29. Laboratory of Computational Physics, Institute of Applied Physics and Computational Mathematics, Fenghao East Road 2, Beijing 100094, P.R.~China

Description

Data:

  • The complete collection of datasets employed in this research is encapsulated within the archive file data-v1.3.tgz. This encompasses both the upstream datasets for pre-training and downstream datasets for fine-tuning, all in DeePMD format. We recommend creating a new directory and employing the command 'tar -xzvf data-v1.3.tgz' to extract the data files.
  • Inside each dataset contained in subdirectories (e.g., Domains, Metals, H2O, and Others), one will find:
    • A README file
    • A 'train' directory (included if utilized in upstream pre-training)
      • train.json -- A list of file paths for training systems
      • test.json -- A list of file paths for testing systems
    • A 'downstream' directory (included if utilized in downstream fine-tuning)
      • train.json -- A list of file paths for training systems
      • test.json -- A list of file paths for testing systems
    • *Main data files comprising various structures
    • *Additional processing scripts
  • The root directory contains train.json and downstream.json files that amalgamate the respective upstream and downstream splits mentioned above.
  • The datasets used in this study are described in Section S1 of the Supplementary Materials and are readily accessible on AIS Square, which provides extensive details.

 

Code:

  • The 'code' directory, extractable from the archive Code_model_script.tgz, includes the DeePMD-kit's source code, which is based on PyTorch (2.0) Version. Installation and usage instructions can be found within the README file located in deepmd-pytorch-devel.zip.
  • UPDATE: deepmd-pytorch-devel-0110.zip supports unsupervised learning through denoising, see its README for more details.

 

Model:

  • Within the 'model' directory, also found in the extracted Code_model_script.tgz, resides the multi-task pre-trained DPA-2 model utilized in this research. Accompanying the model is its configuration file, input.json, which details the simultaneous pre-training of this model across 18 upstream datasets with shared descriptor parameters for 1 million steps.

 

Scripts:

  • The 'scripts' directory, part of the uncompressed Code_model_script.tgz, comprises all the scripts used for training, fine-tuning (learning curve analysis), and distillation in this work:
    • 1. Upstream_single_task_training: Contains individual training scripts for DPA-2, Gemnet-OC, Equiformer-V2, Nequip, and Allegro, corresponding to the 18 upstream datasets. The training script for MACE is also contained and is applicable across all datasets.
    • 2. Downstream_lcurve_workflow: Includes code and input files to evaluate the learning curves, including tests for DPA-2 fine-tuning transferability across 15 downstream datasets, as depicted in Figure 3 of the manuscript. 
    • 3. Distillation_workflow: Provides input files for distilling the fine-tuned DPA-2 models in datasets such as H2O-PBE0TS-MD, SSE-PBE-D, and FerroEle-D, as illustrated in Figure 4 of the manuscript.
  • It is important to note that the scripts in 'Upstream_single_task_training' require the installation of deepmd-pytorch and other related models from their respective repositories (Gemnet-OC and Equiformer-V2: here [commit hash: 9bc9373], Nequip: here [commit hash: dceaf49, tag: v0.5.6], Allegro: here [commit hash: 22f673c]), MACE here [commit hash: b76a2a9].
  • The scripts in 'Downstream_lcurve_workflow' and 'Distillation_workflow' leverage Dflow—a Python framework for constructing scientific computing workflow—and dpgen2, the 2nd generation of the Deep Potential GENerator, both of which are repositories in the Deep Modeling Community.

Files

deepmd-pytorch-devel-0110.zip

Files (17.4 GB)

Name Size Download all
md5:ee909d42f904a29091dd740237afcfcc
268.2 MB Preview Download
md5:789bedf203d673bdc95a09b582d83823
17.1 GB Download
md5:3ccc4146bfeff121b65ac5f82a44b5bf
94.1 MB Preview Download