A Reactive, Scalable and Transferable Model for Molecular Energies from a Neural Network Approach Based on Local Information

Despite the ever-increasing computer power, accurate ab initio calculations for large systems (thousands to millions of atoms) remain infeasible. Instead, approximate empirical energy functions are used. Most current approaches are either transferable between diﬀerent chemical systems, but not particularly accurate, or they are ﬁne-tuned to a speciﬁc application. In this work, a data-driven method to construct a potential energy surface based on neural networks is presented. Since the total energy is decomposed into local atomic contributions, the evaluation is easily parallelizable and scales linearly with system size. With prediction errors below 0.5 kcal mol − 1 for both, unknown molecules and conﬁgurations, the method is accurate across chemical and conﬁgurational space, which is demonstrated by applying it to data sets from nonreactive and reactive molecular dynamics simulations and a diverse database of equilibrium structures. The possibility to use small molecules as reference data to predict larger structures is also explored. Since the descriptor only uses local information, high-level ab initio methods, which are computationally too expensive for large


Introduction
In 1929 Paul Dirac 1 noted that the (electronic and nuclear) Schrödinger equation (SE) contains all that is necessary to describe chemical phenomena and processes. As the underlying equation (SE) is too complicated to be solved in closed form but for the simplest systems, computational and numerical methods have been devised to find approximate solutions such that meaningful information about a system and/or a process can be obtained. Depending on the observable of interest, the meaning of "accuracy" may change, though. A total number of several ten thousand atoms is "large" from the perspective of what system size can be realistically investigated at the single-point energy level using density functional theory (DFT). 2 With increasing accuracy, or when considering optimized structures, vibrations or even (classical) nuclear dynamics, the size of the system that is computationally tractable by explicitly solving the electronic SE (i.e. by "ab initio" rather than semiempirical methods) reduces to less than thousand atoms. 3 These limitations have spurred the development of alternative, more empirical methods.
For small systems (few atoms) it is common practice to directly interpolate a set of known and precomputed reference energies (obtained from a pointwise solution of the electronic SE) to construct a continuous functional form. Popular interpolation techniques include the modified Shepard algorithm, 4-6 the moving least-squares method, 7-9 permutational invariant polynomials [10][11][12] and the reproducing kernel Hilbert space method. [13][14][15][16] For big systems (proteins or condensed matter) a typical approach is to fit a large number (> 10 3 ) of parameters of an empirical functional form, a so-called force field (FF), either to best reproduce reference energies, experimental data that can be derived from them (e.g. thermodynamic or spectroscopic observables) or both. 17 While some parameters can be determined from experiment, others (e.g. partial atomic charges) require electronic structure calculations for fragments or explicit molecular dynamics (MD) simulations (e.g. van der Waals parameters). Once parametrized, the total energy and corresponding forces required for MD simulations can be evaluated much more efficiently than by directly (and approximately) solving the SE. 18,19 With currently available computer power it is, for example, possible to run explicit atomistic MD simulations for small parts of a cell for several 100 ns. 20 However, general empirical FFs 21-25 also have a number of drawbacks, 26 including their limited accuracy, or the fact that most of them do not allow bond-breaking/bond-formation to be described. Although it is now possible to parametrize a FF to within fractions of 1 kcal mol −1 (for energies) for single, isolated systems and special potentials for metals, [27][28][29][30][31][32] bondorder based (reactive) potentials, [33][34][35][36] and reactive force fields for particular systems [37][38][39][40][41] or processes (e.g. proton transfer), 42 have become available, it would be desirable to generalize this to larger classes of problems, irrespective of the particular type of application one has in mind.
One possible step in this direction has been taken during the past decade when machine learning (ML) approaches, which give computers the ability to learn without being explicitly programmed, 43 have been applied to train a computer system using large amounts of precomputed data (typically energies) to estimate properties for unknown compounds or structures. [44][45][46][47] Hence, instead of approximately solving the electronic SE or representing its solution through a ball-and-spring model as in a FF, a computer system learns to predict energies based on an increasing amount of data. Such an approach is motivated by the observation that the electronic HamiltonianĤ is uniquely determined by the external potential, which in turn depends only on the set of nuclear charges {Z i } and atomic positions {r i } of the system. Therefore, all information necessary to determine E is contained in {Z i , r i } and there must exist an exact mapping f : {Z i , r i } → E, which returns the energy E given If the mapping f (Z i , r i ) was known, directly solving the SE could be circumvented.
This situation is reminiscent of density functional theory (DFT) in that the existence of a suitable functional is guaranteed but its actual form is not known. As such, the fundamental object of interest in the present work is the potential energy surface (PES), an approximation to f : {Z i , r i } → E, which corresponds to a 3N −dimensional hypersurface that returns the total potential energy of a system E tot (r i ) given the positions r i of all N nuclei.
Artificial neural networks (NNs) [48][49][50][51][52][53][54] are a popular class of ML algorithms which have been used to tackle various difficult problems, including speech, 55 image 56 and face recognition. 57 In particular, feed-forward NNs have been proven to be general function approximators, 58,59 which makes them suitable for approximating f : {Z i , r i } → E. Ideally, the resulting PES should be accurate, rapid to evaluate, analytically differentiable, systematically improvable, scalable and applicable to bond-breaking/bond-formation problems ("reactive PES"). Additionally, it should be transferable between different systems and configurations. 60 Existing PESs typically fulfill only some of these requirements and the "ideal PES" does not exist yet, probably due to the difficulty of finding a functional form that would satisfy all needs simultaneously. In contrast, NNs do not assume a predefined functional form, and could offer important advantages.
NNs have been used previously to fit PESs for molecular systems in the spirit of many-body expansions. [61][62][63][64] While being accurate, they typically involve a large number of individual NNs (one for each term in the many-body expansion), making the method scale poorly for large systems. Recently, there have also been efforts to predict bond energies using a NN. 65 An alternative approach, known as high-dimensional NN (HDNN) and first proposed for bulk silicon, 66 decomposes the total energy of a system into atomic contributions, which is appealing, because "energy" is an extensive property and it allows to apply the same network to systems of different size. In an HDNN, an atomic descriptor vector (the "fingerprint for the atomic environment") is provided as input and yields the atomic energy E i as output.
All atomic contributions are added to give the total energy E tot of the system for a particular configuration {r i }.
It is useful to introduce an atomic descriptor because the dimensionality of the input vector x in needs to be fixed in a feed-forward NN and using Cartesian coordinates as input would limit the applicability of the network to specific system sizes. The descriptor combines the influence of all neighboring atoms up to a cutoff radius R (e.g. 6Å) 60 with a continuous behaviour at the boundary. Introducing a cutoff allows the method to scale linearly with respect to the number of atoms. Another disadvantage of using Cartesian coordinates is that they are not invariant with respect to translation and rotation. Since NNs are purely numerical algorithms, they would output different values if the input coordinates changed due to such transformations of the system. In contrast, the descriptor is designed to be identical for all symmetry equivalent representations by construction.
In an HDNN, the entries of the descriptor vector are the values of several so-called symmetry functions, which algebraically combine distances and/or angles between the atom of interest and all other atoms in its neighborhood such that the resulting value is invariant with respect to translation, rotation and permutation of equivalent atoms. The individual symmetry functions are manually designed to respond differently to distinct combinations of distances and/or angles, such that a sufficient number (≈ 50) 66 of symmetry functions provides a unique fingerprint for an atomic environment. 60 Alternative methods to construct atomic environment descriptors as input for a NN based on orthonormal 3-D Zernike basis functions 67 or radial and angular distribution functions 68 have been discussed in the literature. In contrast, the smooth overlap of atomic positions (SOAP) approach 69 directly introduces a distance metric and a similarity kernel for atomic environments, such that it is not necessary to explicitly calculate the descriptor. Therefore, the SOAP approach is more suited for kernel-based ML methods. 70 In order to apply HDNNs to multi-component systems, 71 the symmetry functions are duplicated for each species and a separate NN is trained for each element. 60 Unfortunately, due to the rapidly increasing complexity of chemical space, this approach is still limited to systems containing only few chemical elements. 72 Furthermore, such NNs are not transferable across chemical space and have to be retrained for every new system of interest.
A conceptually different approach, the deep tensor NN (DTNN), 73 allows to reuse the same NN to predict energies of systems with different composition across chemical space. Similar to HDNNs, the DTNN accumulates atomic energy contributions to predict the total energy E tot . However, instead of an environment descriptor based on symmetry functions, it receives a vector of nuclear charges and a matrix of atomic distances as input. A tensor layer [74][75][76] then builds a coefficient vector c i for each atom i, which acts as the environment-dependent fingerprint. To do so, the coefficient vector c i is initialized depending on the species of atom i and recursively refined in T steps by adding interaction vectors v ij , which depend on the pairwise distance between atoms i and j = i, as well as the current c j of atom j. After T refinements, the final c i is passed as input to a fully-connected layer to determine the atomic energy contribution E i of atom i.
Because each refinement step considers all pairwise distances, the evaluation of the DTNN scales quadratically with respect to the number of atoms. Although introducing a distance cutoff to achieve linear scaling has been proposed, 73 it is important to note that even with a cutoff, the network still requires information about all atoms present in the system in order to recursively refine the coefficient vectors c i (every refinement step requires knowledge about the current c j of other atoms). Using T = 3 interaction passes and 100k reference structures, the DTNN predicts the energy of structures in the QM9 dataset 77 accurately with a mean absolute error (MAE) of 0.84 kcal mol −1 . 73 More recently, the SchNet architecture was proposed, 78 which improves upon the DTNN.
Instead of refining the coefficient vectors c i with a tensor layer, they are iteratively updated by residual connections 79 between three interaction blocks. 78 The interaction blocks utilize interatomic continuous-filter convolutions 78  In the present work, a NN-based method tailored for accurate energy evaluations, which can be applied to construct PESs for nonreactive and reactive dynamics of chemically heterogeneous systems in the condensed phase, is introduced. While being inspired by highdimensional NNs, the descriptor does not rely on hand-crafted symmetry functions and encodes atomic species and environment simultaneously, similar to the coefficient vectors c i in the DTNN and SchNet. This allows to train a single NN to predict the atomic energy contributions E i of all elements in their chemical environments. In contrast, high-dimensional NNs require separate NNs for each element. Contrary to iterative approaches based on tensor layers 73 or convolution, 78 the descriptor contains strictly local information and is calculated in a single step. Thus, the proposed method scales linearly with respect to system size and can even be evaluated in parallel, because each atomic descriptor is independent of other descriptors and needs no iterative refinement. When applied to the QM9 dataset, 77 the proposed approach yields predictions with errors below 0.5 kcal mol −1 , transferable across chemical space. The predictions are also transferable across configurational space, as is demonstrated by applying the same method to several MD datasets. 81 When trained with appropriate reference data, the method is also able to describe reactions. By analyzing individual atomic energy contributions E i , it is shown that the network predicts energies in a chemically intuitive and interpretable way. Further, the possibility to train the network on small molecules to predict the energies of larger systems is demonstrated. Finally, possible future improvements are discussed.

Methods
In order to predict the energy of a system of interest, such as a molecule, a descriptor for each atom is supplied to a NN, which predicts an atomic energy contribution E i . The individual contributions are added to obtain the total energy E tot . Figure 1 gives a schematic overview of the computational protocol.
In the following, the atomic descriptor (section 2.1), the NN (section 2.2) and the process for training the NN (section 2.3) are described in more detail. It is important to note that only total energies are required as reference data during training, as the NN automatically learns to perform the energy decomposition into atomic contributions. This way, only true quantum mechanical observables are used as reference data and no, ultimately arbitrary,  The descriptor vector c i is supplied to a NN, which (d) outputs an atomic energy contribution E i . Finally, the individual contributions are (e) accumulated to give E tot = i E i . Since addition is commutative, E tot is automatically invariant with respect to atom permutations. energy decomposition scheme 82-84 needs to be imposed.

Atomic descriptor
Individual atoms and their local environment are represented by a descriptor, which needs to encode all information relevant to predicting its atomic energy contribution (relative positions and species of neighboring atoms). Further, due to the way feed-forward NNs are designed (see section 2.2), the descriptor must be of fixed size, no matter how many atoms are present. Finally, it is advantageous if the descriptor is invariant with respect to transformations which do not alter the energy of the system. This way, translational invariance, rotational invariance and invariance with respect to permutation of equivalent atoms need not be learned explicitly by the neural network.
In this work, the atomic descriptor consists of two parts: one part encoding the atomic species (C, H, O, . . . ) and a second part which encodes the local environment up to a cutoff radius R. Note that an atomic descriptor that encodes species and environment separately has been proposed previously. 68 There are several reasons for introducing a cutoff. First, the energy prediction scales linearly with respect to the number of atoms present in the system of interest. Second, while the network can be trained on rather small systems, it can then be applied to much larger systems, because locally, atomic environments of small and large systems are equivalent. Finally, it is a valid assumption that most (but not all) chemical interactions, which are relevant to the energy of the system, such as bonding, are inherently short-ranged. Methods to correct for long-range interactions are well-known in the literature 60,66,71,72 and are discussed in section 3. Hence the descriptor used here combines computationally advantageous aspects with a design based on physical/chemical principles.
Species descriptor. In principle, the atomic species could be encoded by a single number, either by an integer identifier (e.g. H= 1, C= 2, N= 3, . . . ) or by the nuclear charge Z (e.g. H= 1, C= 6, N= 7, . . . ). However, this introduces an ordinal relationship (e.g. H < C < N) between different atomic species, which can be detrimental to the network performance.
Since neural networks are a purely numerical algorithm, ordinal relations in inputs directly correlate with the network response, which is not meaningful for atomic species. Alternatively, a one-hot 85 encoding (e.g. H = [1 0 0 · · · ], C = [0 1 0 · · · ], N = [0 0 1 · · · ]) would be possible. However, two potential disadvantages of a one-hot encoding are that 1) the dimensionality of the encoding vector must necessarily be equal to the cardinality of the set of atomic species present in the data and 2) all encodings are equidistant by construction.
Since, it is intuitive to expect e.g. atomic species from the same group in the periodic table to behave similar to one another, an optimal encoding should be able to directly represent these similarities.
For these reasons, the atomic species are rather encoded by embeddings. An embedding is a mapping from a discrete object i to a vector of real numbers v i ∈ IR D , where D is the dimensionality of the embeddings. For example, word embeddings 86 find wide spread use in the field of natural language processing. Here, words are mapped to a comparatively low-dimensional vector space, such that semantically similar words (e.g. "red", "green", and "blue" or "king", "monarch" and "emperor") appear close to each other (||v red − v blue || < ||v red − v king ||).
During the training process of the NN, the entries of the embedding vectors v i are free parameters, such that meaningful embeddings are directly learned from data. In this work, the dimensionality D of the embeddings is set equal to the number N g of distinct groups (columns) in the periodic table which are present in the reference data. For example, in the QM9 dataset, N g is 5. Note that a lower dimensionality would still allow a unique encoding of each element (albeit introducing an ordinal relation in the extreme case of D = 1).
However, elements from the same group in the periodic table are expected to have similar properties and choosing D = N g principally allows to encode every distinct group in orthogonal directions, thus avoiding ordinal relations between species. For more details on the concept of embeddings, the reader is referred to the literature. 87 Environment descriptor. All information about the local environment of a given atom i up to a cutoff radius R is contained in the neighborhood density function ρ i given by where the position r = (x, y, z) T ∈ R 3 is relative to atom i, Z j and r j are nuclear charge and relative position of neighboring atom j, δ is the Dirac delta function, and the sum runs over all atoms j closer than R. The concept of a neighborhood density function has been used previously in the derivation of the SOAP similarity kernel. 69 Note that the use of relative positions r − r j makes ρ i translationally invariant and the commutativity of addition ensures permutational invariance. By construction, ρ i is zero everywhere except for positions r j of neighboring atoms j, where the function value encodes the atomic species of j by its nuclear charge. Thus, ρ i completely describes the local atomic environment of atom i up to a distance R.
In order to obtain a fixed length input x in for use in a feed-forward layer, ρ i is expanded into a basis set of fixed dimension with expansion coefficients c klm and basis functions ψ The Zernike descriptor 67 also relies on a basis set expansion, but uses different basis functions. K and L define the maximum degree of the radial and angular parts of the expansion and R the cutoff radius, respectively. In order to be consistent with the commonly used notation of spherical harmonics, the Cartesian coordinate vector r is transformed 88 to spherical coordinates.
Many different choices for the radial basis functions g k (r; R) are possible. Here is chosen which ensures that basis functions are evenly spaced inside the cutoff sphere. Due to the cutoff function s(r; R), g k (r; R) is zero whenever r > R. Choosing as cutoff function, with r s = R − R K , ensures that g k (r; R) has smooth first and second derivatives, such that no numerical artifacts are introduced when an atom enters or leaves the cutoff-sphere, while leaving the Gaussian part of g k (r; R) largely unaffected (see Fig-ure S1). The cutoff function s(r; R) is a smooth approximation to the step function and influences the value of g k (r; R) only when r > r s . Although it would be possible to use a non-sigmoid cutoff function that starts decaying as soon as r > 0, this would lead to largely different numerical influences of g k (r; R) on the network predictions depending on the value of k, therefore effectively introducing an a priori distance-based weighting. In contrast, the present choice of s(r; R) allows that the NN learns to weigh the influence of different distances in a data-driven manner.
As long as K and L are sufficiently large, the information stored in the coefficients c klm is comparable to that encoded in ρ i . Note that for predicting energies, some loss of information is not problematic as long as the resulting descriptor can distinguish different environments sufficiently well.
The expansion coefficients c klm for a general function f (r) can be obtained from projecting c klm = f (r)ψ klm (r)dr. Fortunately, it is not necessary to calculate an integral to obtain the expansion coefficients for the neighborhood density function. Since ρ i (r) is the sum of δ functions (Eq. 1), the coefficients are efficiently obtained by summation Note that the values of the coefficients c klm still depend on the orientation of the chosen reference coordinate system, because the values of the 2l + 1 spherical harmonics for a particular l are orientation dependent. Fortunately, the 2l + 1 coefficients for given combination of k and l can be combined to a rotationally invariant quantity a kl according to (Eq. 7).
In total, there are K · L different a kl values, which are concatenated to the atom embedding vector v of dimensionality N g to form the descriptor vector c. Because a kl has continuous first derivatives with respect to the atom coordinates, derivatives necessary for e.g.
force calculations are easily obtained by the chain rule. Note that because a single vector c = x in ∈ R Ng+K·L is supplied to the NN, it is not able to distinguish between species and environment descriptor. In this work, K = 7, L = 7 and R = 3Å are chosen for all datasets.
Section S1.1 details how the values of K, L and R were chosen and how they influence the predictive accuracy of the NN.

Neural Network
A feed-forward NN consists of an input layer connected to one or multiple hidden layers and an output layer. Every layer can be considered as a function which takes an n in -dimensional input vector x and transforms it to an n out -dimensional output vector y. For most NNs, the transformation in each layer can be written as where W is an n in × n out weight matrix, b is an n out -dimensional bias vector, and φ(x) is the activation function. For simplicity, the shorthand notation φ(x) is used, which symbolizes element-wise application of φ(x) to x (performed independently on each vector entry). All entries of the weight matrix W and the bias vector b are free parameters, which are initialized randomly and optimized when the network is trained.
The output y of each layer is the input x to the next successive layer until the output layer is reached. Usually, the output layer uses the identity function as activation function and its output y out is the prediction of the neural network (it is possible to predict more than one quantity at once using the same network). The input layer applies no transformation to its input data x in at all (the activation function is the identity function, W is the identity matrix and the bias vector contains only zeros) and is only used to provide data for the first hidden layer. The complete NN can therefore be written as a nested version of Eq. 8 with different weight matrices W i , bias vectors b i , and activation functions φ i (x) for each layer i. For example, a NN with two hidden layers can be written as Note that it is not necessary to symbolically differentiate an expression such as Eq. 9 if derivatives of y out with respect to x in (or any weight or bias parameter) are required. Instead, analytical derivatives are efficiently calculated using automatic differentiation. 89 The In the present work, square unit augmented layers 92 given by are used to construct the NN instead of ordinary layers (see Eq. 8).
Here, x 2 is shorthand notation for the element-wise square of x. The independent weight matrices W1 and W2 are of size n in ×n out and b and φ are bias vector and activation function, respectively (see Eq. 8). The reason for using square unit augmented layers is that properties reminiscent of radial basis function networks [93][94][95] can be included at little additional computational expense, 92 provided that a sigmoidal activation function is used (see Figure S2 for an illustration).
The activation function for the hidden layers is φ(x) = s · arcsinh(x) where s = 1.25673480 ensures that φ(x) has self-normalizing properties 96 (activations converge automatically to zero mean and unit variance), similar to the recently proposed SELU 96 function. For the output layer, the identity function is used. In the present work arcsinh(x) was found to give superior results compared to more commonly used activation functions such as tanh(x). One possible reason for the improved performance is that the function does not saturate for large or small values of x (see Figure S2), which alleviates the vanishing gradient problem 97 and helps to improve learning.
In summary, the energy prediction consists of the following steps (see also Figure 1

Training
NNs are trained to predict energies on the QM9 dataset, 77 several MD datasets 81  Prior to training, each dataset is split into three parts: the training set, the validation set and the test set. During training, the squared error per atom (SEpA) is minimized via Adam optimization in minibatches 104 of ten reference structures, using a learning rate of 10

Atomic energies
Since the NNs are trained to decompose energies of a system into atomic contributions, it is instructive to visualize the "energy spectrum" for each atomic species in the QM9 dataset : "Spectra" of atomic energies in the QM9 dataset for different species (relative to the energy of a free atom). In order to obtain the spectra, the atomic energy predictions of all five NNs trained on 100k structures were averaged and the curves are obtained by kernel density estimation with the Sheather-Jones bandwidth selection method. 107 Figures S7, S8, S9, S10 and S11 show the respective unaveraged results. The atomic energies of C atoms span the widest range (> 100 kcal mol −1 ), followed by N (> 60 kcal mol  Table S1 for an illustration of the exponential growth of possible combinations when traversing the bonding graph).
Interestingly, however, most atoms can be assigned to just a few clusters (see Figure S5).
For example, more than half of all C atoms belong to the 331 most common C-atom clusters.
Since only graph-based information (but no geometric information such as distances and angles) is considered in the clustering approach, it is not evident that atoms belonging to the same cluster are energetically similar. As a qualitative test for how meaningful the clustering is, the cluster statistics (mean and variance of atomic energies for each cluster) from the raw data is considered (see Figure 2). For this, every cluster is represented by a Gaussian distribution with mean and variance equal to the corresponding cluster statistics, and normalized according to the atom count. Even though assuming a Gaussian distribution is a crude approximation, the sum of all Gaussians (see Figure S6) closely resembles Figure   2, so the graph-based clustering approach is considered to be meaningful.
In order to interpret the data, chemical similarities between different clusters are analyzed and they are summarized based on functional groups into different atom types. Apart from allowing interpretation of the network predictions, the energies of different atom types can be tabulated and used for a rapid estimate of the energy of a molecule given only its chemical structure, similar to how NMR-chemical shifts can be estimated. 111 Table 1 lists atomic energies (relative to a free atom) of functionally different C atom types.      strengths. An exception are conjugated sp 2 -hybridized C atoms, which are even more stable due to their "aromatic" nature. When bound to electronegative atoms, such as N, O and F, the stabilization energy of carbon atoms appears to be correlated with the electronegativity of the bonding partner. A physically appealing interpretation is that a large difference in electronegativity increases the ionic character of the bond and therefore increases the stabilization energy.
While such trends may be obvious to chemists, a somewhat more subtle effect can be seen in the increasing stability from primary to quaternary C-atoms. This can be explained by hyperconjugation 108 (electron density from occupied σ-bonds is donated to unoccupied orbitals, also known as the positive inductive or +I effect 112 ). Such a resonance stabilization is well known for carbocations and carbon radicals, which become more stable with increasing number of neighboring alkyl groups. A related trend is found from chemical shift measurements in 13 C-NMR experiments, where typical shifts increase from 15-30 ppm, to 22-45 ppm and further to 30-58 ppm when going from primary to tertiary C-atoms. 113 This is usually attributed to the increased nuclear shielding due to the additional electron density around the nucleus.
Similar observations are made for H, N, O and F atoms (see Tables S2, S3, S4 and S5).
Note that some of the previously discussed trends can be reversed for the other elements.
For example, instead of being stabilized by neighboring alkyl groups, O atoms typically are destabilized by the +I effect. However, this is to be expected since O atoms are already partially negatively charged due to their high electronegativity. The +I effect then leads to an amplification of this charge and therefore destabilization.

Errors
QM9 dataset. 77 Mean absolute errors (MAEs) and root mean squared errors (RMSEs) for the NNs trained with different training set sizes are summarized in Table 2 and compared with the performance of the DTNN 73 and SchNet. 78 The NN trained on 100k reference structures predicts structures in the QM9 dataset accurately with a MAE of 0.41 kcal mol −1 and an RMSE of 0.86 kcal mol −1 . Note that SchNet has lower errors for larger training set sizes, but is outperformed by the present approach for smaller training sets. Also, SchNet does not employ a cutoff radius R and therefore uses significantly more information in its prediction. Figure 3 shows the convergence of MAE and RMSE with increasing training set size.
While MAE and RMSE are useful measures for the overall performance of a method, it can also be instructive to consider how errors are distributed. Figure 4 reveals that for all training set sizes starting from 10k, more than half of all errors are below 0.5 kcal mol −1 ,  While most outliers can be explained by underrepresented environments in the training data, some of the largest prediction errors are probably due to a different reason. They belong to a group of eleven molecules in the QM9 dataset for which the electronic structure calculation did not converge at all (three molecules) or only using loose convergence criteria (eight molecules). 77 Most of these structures feature unconventional chemical bonding and their electronic structure potentially has multi-reference character. Therefore, it is possible that the quantum mechanical reference energies themselves are erroneous for these structures, explaining the large prediction errors. At the very least, they seem to be particularly difficult to predict for ab initio methods as well.
The ability of the NN to identify problematic structures can even be advantageous to detect but based on all molecules (with up to 29 atoms) (see Fig. S4). This demonstrates that the learned representations are transferable and can be used to accurately predict larger structures. Nonetheless, the performance is inferior compared to a randomly chosen training sets drawn from the full data set. One possible physical explanation is that this is due to They all belong to a group of eleven molecules for which the reference electronic structure was difficult to converge. 77 The structures with the IDs 129158 and 117523 could not be converged at all. 77 Prediction errors are averaged across neural networks trained on 100k reference structures (only NNs that contain a given structure in the test set were considered). Note that, even though many of the structures contain a motif reminiscent of 1,2,3-oxadiazole, the presence of this motif alone can not be the cause for the large prediction errors: the QM9 dataset contains close to 1k structures with similar motifs, for which accurate predictions are possible.
the lack of an adequate description of long range interactions, which are more important for extended structures containing many atoms. These deficiencies could be addressed by explicitly including long range contributions into the prediction.
MD datasets. 81 MAEs and RMSEs for the NNs trained with different training set sizes are summarized in Table 3 and compared with results for gradient-domain machine learning (GDML). 81  Figure 6 shows a 10 ps MD trajectory of malonaldehyde. Note that the NN approach automatically leads to a reactive PES. A direct comparison of the NN-learned and MP2-reference energies yields a correlation coefficient of 0.997.

Discussion and Conclusion
Although the results show that accurate predictions can be obtained from training a NN with a descriptor based on encoding the chemical environment of an atom, it is useful to discuss potential problems and possible improvements to the prediction method. For example, even though introducing a cutoff radius R is necessary for computational efficiency, it can limit the accuracy of the neural network. Since all atoms beyond the cutoff radius of R = 3Å are ignored in the descriptor by construction, interactions extending over larger distances can not be captured by the present approach. Most interactions relevant in chemistry are sufficiently short ranged that this is not an issue, but there are important exceptions: Coulomb and dispersion interactions. These long range contributions to the total energy are especially important for the correct description of intermolecular interactions and are therefore crucial for condensed phase systems. While it is always possible to increase R until the error introduced by the cutoff is negligible, this is not very efficient, as a larger number of atoms would need to be considered for the calculation of the expansion coefficients c klm (see Eq. 1 and Eq. 6). Further, it is likely that higher order expansion terms (see Eq. 2) are necessary to resolve differences between atomic environments for larger R, such that the calculation of the descriptor becomes more expensive. Fortunately, the physical laws governing Coulomb and dispersion interaction are well understood, such that it is possible to include both contributions explicitly without increasing the cutoff R.
For better describing Coulomb interactions, separately trained neural networks have previously been used 116 to predict environment-dependent Hirshfeld charges. 117 The electrostatic contribution E ele is then simply subtracted from the total energy E tot prior to training networks for predicting the short range contributions. The total energy can be recovered by combining electrostatic energies calculated from the predicted charges and the short range contributions. Note that only charge-charge interactions are necessary for the calculation of the electrostatic energy, as interactions between higher multipoles 118,119 decay faster and can therefore be implicitly described in the short range contributions. 60 In order to apply a similar method to the approach presented in this work, it is not necessary to introduce a second NN. Instead, the existing network could simply be trained to predict an atomic energy contribution E i and an environment dependent charge q i simultaneously, by introducing a second network output and an appropriate modification of the objective function (Eq. 11). Also, it is not necessary to rely on a charge decomposition scheme such as Hirshfeld's method 117 to obtain a reference value for q i . Recently, it was shown that a NN can be trained to predict environment dependent charges such that the electrostatic moments, a true quantum mechanical observable, are reproduced. 120 This way, no arbitrary decomposition scheme needs to be imposed.
To account for long range dispersion interactions, it was shown 121 that the D3 scheme in DFT calculations proposed by Grimme 122 can be used for NNs without modification. Since the neural network is trained on DFT reference energies, the standard C 6 coefficients 122 for calculating the dispersion interaction can be reused. The possibility of predicting environmentdependent C 6 coefficients, instead of using constant values, should be pointed out. That way the dispersion correction is more flexible and can adapt to the reference data. This would require the introduction of another network output and a suitable modification of the objective function (Eq. 11), similar to the possible treatment of Coulomb interactions.
Recently, it was shown that van der Waals interactions are essential for the understanding of the properties of liquid water. 123 These findings show the importance of a correct treatment of long-range dispersion when studying condensed-phase systems.
In the present work a general atomic descriptor, which is applicable to any chemical system was introduced. Using the descriptor as input, NNs trained on 100k reference structures can learn to accurately predict energies of structures in the QM9 dataset 77 across chemical space with a MAE of 0.41 kcal mol −1 . Although the performance is slightly worse than that of the SchNet architecture 78 (MAE of 0.34 kcal mol −1 ), the difference in accuracy is considered to be an acceptable trade-off for the increased computational efficiency, as the atomic descriptor developed here requires only strictly local information (due to the introduction of a cutoff radius R) and the network architecture is much simpler. This allows efficient calculation of thousands of atomic contributions in parallel, which is an advantage in the context of a large molecular dynamics simulation. For smaller training set sizes (e.g. 50k reference structures), the method proposed in this work outperforms SchNet ( Table 2). As such, fewer reference calculations are needed to obtain chemical accuracy.
Since the QM9 dataset contains exclusively equilibrium structures it is only suited to assess transferability across chemical space. In order to demonstrate the predictive power of a NN across configurational space, the same method was also applied to data sampled from MD simulations. Using 100,000 reference structures, MAEs between 0.09 and 0.35 kcal mol −1 were obtained (see Table 3). Finally, it was also demonstrated that this network can be used to describe chemical reactions (here proton transfer), provided that appropriate reference structures are included in the training set. The NN is able to describe intramolecular Htransfer in malonaldehyde with a similar quality as high-level ab initio methods ( Table 4).
The present approach is particularly suitable to evaluate accurate energies. In principle, it also allows to efficiently evaluate forces as is required in molecular dynamics simulations.
In addition, the method automatically leads to a reactive PES (provided that appropriate structures around the transition state are contained in the training set), as no notion of chemical bonds is introduced in the construction of the atomic descriptor. In the present work it was demonstrated that NNs trained on systems containing few atoms are transferable to larger systems which facilitates the possibility to train networks using very accurate ab initio reference energies. While they are typically slower than empirical force fields by one to two orders of magnitude, the energy prediction is several orders of magnitudes faster than ab initio methods (the energy prediction of a system with 17 atoms takes < 1 ms on on a desktop computer equipped with an Intel Xeon Processor E3-1275 at 3.40 GHz) and scales linearly with respect to the number of atoms. On the same machine, training the NN takes approximately three weeks and only needs to be performed once. Depending on system size and level of theory, this is approximately the same time scale as a single ab initio calculation. While FFs are still undisputedly the fastest approximate method, NNs promise huge potential speedups and it might be feasible to combine the two to a hybrid approach similar to QM/MM methods.
The atomic energy contributions predicted by the network are chemically intuitive and may offer new insights. For example, they can be used as a guideline for designing novel types of empirical force fields through atom typing based on quantitative information instead of chemical intuition. Finally, it is possible to systematically improve the predictions of the neural network by simply adding new reference data to the training set. As such, several properties of an "ideal PES" as put forward in the introduction are fulfilled by the present approach.
In order to use the present approach in MD simulations in a similar manner to FFs, an appropriate reference dataset is necessary to train the NN. Ideally, this dataset should contain a multitude of different chemical structures, representative of both, equilibrium and non-equilibrium geometries. For a meaningful description of reactions, transition state geometries need to be included as well. Future work will focus on using the present technology in conventional and reactive MD simulations together with a physically motivated treatment of long range contributions to the energy. This is necessary to correctly describe the intermolecular interactions governing the dynamics in condensed phase simulations.