There is a newer version of the record available.

Published December 24, 2023 | Version v1
Dataset Open

DPA-2: Towards a universal large atomic model for molecular and material simulation

  • 1. AI for Science Institute, Beijing 100080, P. R.~China
  • 2. DP Technology, Beijing 100080, P. R.~China
  • 3. Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, P. R.~China
  • 4. State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100871, P.R.~China
  • 5. University of Chinese Academy of Sciences, Beijing 100871, P.R.~China
  • 6. HEDPS, CAPT, College of Engineering, Peking University, Beijing 100871, P.R.~China
  • 7. Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315201, P.R.~China
  • 8. CAS Key Laboratory of Magnetic Materials and Devices and Zhejiang Province Key Laboratory of Magnetic Materials and Application Technology, Chinese Academy of Sciences, Ningbo 315201, P.R.~China
  • 9. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, P.R.~China
  • 10. Shanghai Engineering Research Center of Molecular Therapeutics \& New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, P.R.~China
  • 11. Laboratory for Biomolecular Simulation Research, Institute for Quantitative Biomedicine and Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey 08854, USA
  • 12. Department of Chemistry, Princeton University, Princeton, New Jersey 08540, USA
  • 13. College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, P.R.~China
  • 14. Yuanpei College, Peking University, Beijing 100871, P.R.~China
  • 15. School of Electrical Engineering and Electronic Information, Xihua University, Chengdu, 610039, P.R.~China
  • 16. State Key Laboratory of Superhard Materials, College of Physics, Jilin University, Changchun 130012, P.R.~China
  • 17. Key Laboratory of Material Simulation Methods \& Software of Ministry of Education, College of Physics, Jilin University, Changchun, 130012, P.R.~China
  • 18. International Center of Future Science, Jilin University, Changchun, 130012, P.R.~China
  • 19. Key Laboratory for Quantum Materials of Zhejiang Province, Department of Physics, School of Science, Westlake University, Hangzhou, Zhejiang 310030, P.R.~China
  • 20. Atomistic Simulations, Italian Institute of Technology, 16156 Genova, Italy
  • 21. State Key Laboratory of Physical Chemistry of Solid Surface, iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen, 361005, P.R.~China
  • 22. Institute of Natural Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang 310030, P.R.~China
  • 23. NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, P.R.~China
  • 24. Institute for Advanced algorithms research, Shanghai, 201306, P.R.~China
  • 25. Laboratory of AI for Electrochemistry (AI4EC), IKKEM, Xiamen, 361005, P.R.~China
  • 26. Institute of Artificial Intelligence, Xiamen University, Xiamen, 361005, P.R.~China
  • 27. Center for Machine Learning Research, Peking University, Beijing 100871, P.R.~China
  • 28. School of Mathematical Sciences, Peking University, Beijing, 100871, P.R.~China
  • 29. Laboratory of Computational Physics, Institute of Applied Physics and Computational Mathematics, Fenghao East Road 2, Beijing 100094, P.R.~China

Description

Data:

  • The complete collection of datasets employed in this research is encapsulated within the archive file data-v1.3.tgz. This encompasses both the upstream datasets for pre-training and downstream datasets for fine-tuning, all in DeePMD format. We recommend creating a new directory and employing the command 'tar -xzvf data-v1.3.tgz' to extract the data files.
  • Inside each dataset contained in subdirectories (e.g., Domains, Metals, H2O, and Others), one will find:
    • A README file
    • A 'train' directory (included if utilized in upstream pre-training)
      • train.json -- A list of file paths for training systems
      • test.json -- A list of file paths for testing systems
    • A 'downstream' directory (included if utilized in downstream fine-tuning)
      • train.json -- A list of file paths for training systems
      • test.json -- A list of file paths for testing systems
    • *Main data files comprising various structures
    • *Additional processing scripts
  • The root directory contains train.json and downstream.json files that amalgamate the respective upstream and downstream splits mentioned above.
  • The datasets used in this study are described in Section S1 of the Supplementary Materials and are readily accessible on AIS Square, which provides extensive details.

 

Code:

  • The 'code' directory, extractable from the archive Code_model_script.tgz, includes the DeePMD-kit's source code, which is based on PyTorch (2.0) Version. Installation and usage instructions can be found within the README file located in deepmd-pytorch-devel.zip.

 

Model:

  • Within the 'model' directory, also found in the extracted Code_model_script.tgz, resides the multi-task pre-trained DPA-2 model utilized in this research. Accompanying the model is its configuration file, input.json, which details the simultaneous pre-training of this model across 18 upstream datasets with shared descriptor parameters for 1 million steps.

 

Scripts:

  • The 'scripts' directory, part of the uncompressed Code_model_script.tgz, comprises all the scripts used for training, fine-tuning (learning curve analysis), and distillation in this work:
    • 1. Upstream_single_task_training: Contains individual training scripts for DPA-2, Gemnet-OC, Equiformer-V2, Nequip, and Allegro, corresponding to the 18 upstream datasets.
    • 2. Downstream_lcurve_workflow: Includes code and input files to evaluate the learning curves, including tests for DPA-2 fine-tuning transferability across 15 downstream datasets, as depicted in Figure 3 of the manuscript. 
    • 3. Distillation_workflow: Provides input files for distilling the fine-tuned DPA-2 models in datasets such as H2O-PBE0TS-MD, SSE-PBE-D, and FerroEle-D, as illustrated in Figure 4 of the manuscript.
  • It is important to note that the scripts in 'Upstream_single_task_training' require the installation of deepmd-pytorch and other related models from their respective repositories (Gemnet-OC and Equiformer-V2: here [commit hash: 9bc9373], Nequip: here [commit hash: dceaf49, tag: v0.5.6], Allegro: here [commit hash: 22f673c]).
  • The scripts in 'Downstream_lcurve_workflow' and 'Distillation_workflow' leverage Dflow—a Python framework for constructing scientific computing workflow—and dpgen2, the 2nd generation of the Deep Potential GENerator, both of which are repositories in the Deep Modeling Community.

Files

Files (17.3 GB)

Name Size Download all
md5:6805fa08781263185b86e896bb7b6435
185.2 MB Download
md5:789bedf203d673bdc95a09b582d83823
17.1 GB Download