======= GENERAL INFORMATION ======= The solvated protein fragments dataset probes many-body intermolecular interactions between "protein fragments" and water molecules, which are important for the description of many biologically relevant condensed phase systems. It contains structures for all possible "amons" [1] (hydrogen-saturated covalently bonded fragments) of up to eight heavy atoms (C, N, O, S) that can be derived from chemical graphs of proteins containing the 20 natural amino acids connected via peptide bonds or disulfide bridges. For amino acids that can occur in different charge states due to (de-)protonation (i.e. carboxylic acids that can be negatively charged or amines that can be positively charged), all possible structures with up to a total charge of +-2e are included. In total, the dataset provides reference energies, forces, and dipole moments for 2731180 structures calculated at the revPBE-D3(BJ)/def2-TZVP level of theory [2-5] using the ORCA 4.0.1 code [6,7]. For more details, see https://arxiv.org/abs/1902.08408. [1] Huang, B. and von Lilienfeld, O. A. arXiv:1707.04146 (2017). [2] Grimme, S.; Antony, J.; Ehrlich, S. and Krieg, H. J. Chem. Phys. 132, 154104 (2010). [3] Grimme, S.; Ehrlich, S. and Goerigk, L. J. Comput. Chem. 32, 1456-1465 (2011). [4] Weigend, F. and Ahlrichs, R. Phys. Chem. Chem. Phys. 7, 3297-3305 (2005). [5] Zhang, Y. and Yang, W. Phys. Rev. Lett. 80, 890 (1998). [6] Neese, F. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2, 73-78 (2012). [7] Neese, F. Wiley Interdiscip. Rev. Comput. Mol. Sci. 8, e1327 (2018). ======= HOW TO CITE? ======= When using this dataset, please cite the following paper: Unke, O. T. and Meuwly, M. "PhysNet: A Neural Network for Predicting Energies, Forces, Dipole Moments and Partial Charges" arxiv:1902.08408 (2019). and the digital object identifier (DOI): Unke, O.T. and Meuwly, M. (2019). Solvated protein fragments dataset. Zenodo. http://doi.org/10.5281/zenodo.2605372. ======= DATA FORMAT ======= The dataset is stored as python dictionary in a compressed numpy binary file (.npz). The dictionary contains seven numpy arrays: R (num_data, max_atoms, 3): Cartesian coordinates of nuclei (in Angstrom [A]) Q (num_data,): Total charge (in elementary charges [e]) D (num_data, 3): Dipole moment vector with respect to the origin (in elementary charges times Angstrom [eA]) E (num_data,): Potential energy with respect to free atoms (in electronvolt [eV]) F (num_data, max_atoms, 3): Forces acting on the nuclei (in electronvolt per Angstrom [eV/A]) Z (num_data, max_atoms): Nuclear charges/atomic numbers of nuclei N (num_data,): Number of atoms in each structure (structures consisting of less than max_atoms entries are zero-padded) Please note that the potential energy is given with respect to free atoms (i.e. total atomization). The following constants were subtracted from the original values for each occurence of the corresponding elements: H: -13.717939590030356 eV C: -1029.831662730747 eV N: -1485.40806126101 eV O: -2042.7920344362644 eV S: -10831.264715514206 eV In order to recover the original values, simply add the constants back. To read the dataset, load the dictionary with python: >>> data = np.load("solvated_protein_fragments.npz") and access individual entries with the appropriate dictionary key, e.g. "Z" for the nuclear charges: >>> nuclear_charges = data["Z"] See also "read_data.py" for a more comprehensive example.