Investigation of Machine Learning-based Coarse-Grained Mapping Schemes for Organic Molecules

Due to the wide range of timescales that are present in macromolecular systems, hierarchical multiscale strategies are necessary for their computational study. Coarse-graining (CG) allows to establish a link between different system resolutions and provides the backbone for the development of robust multiscale simulations and analyses. The CG mapping process is typically system- and application-specific, and it relies on chemical intuition. In this work, we explored the application of a Machine Learning strategy, based on Variational Autoencoders, for the development of suitable mapping schemes from the atomistic to the coarse-grained space of molecules with increasing chemical complexity. An extensive evaluation of the effect of the model hyperparameters on the training process and on the final output was performed, and an existing method was extended with the definition of different loss functions and the implementation of a selection criterion that ensures physical consistency of the output. The relationship between the input feature choice and the reconstruction accuracy was analyzed, supporting the need to introduce rotational invariance into the system. Strengths and limitations of the approach, both in the mapping and in the backmapping steps, are highlighted and critically discussed.


Variational Autoencoders as CG tools
The Cartesian coordinates of the data generated from the aforementioned molecular simulations were used as input to train Variational Autoencoders (VAE). An Autoencoder is a type of neural network that is typically trained to reproduce its input data. Its middle layers have lower dimension compared to the input/output ones.
Therefore, owing to the presence of this bottleneck in the network architecture, the VAE learns to reconstruct the data from a compressed latent representation. The model consists of two parts, an Encoder, that in the present case maps the atomistic input to the CG latent representation, and a Decoder, that maps from the latent representation back to the atomistic level. In this work, one intermediate layer was used.
The choice of a VAE as a coarse-graining tool is based on its ability to reduce the dimensionality of the input data, by projecting them to a lower dimensional space, and then reconstruct the initial data, by projecting them back to the initial space. Thus, in a way it can model coarse-graining and reverse mapping at the same time. The projection to the lower dimensional space through the Encoder should be done in such a way that sufficient information is preserved to reconstruct the initial atomistic coordinates from the latent space. Through the Encoder, we are searching for a tensor that can project the atomistic coordinates to a desired coarse-grained space and through the Decoder we are searching for a tensor that can project the latent vector back to the initial atomistic coordinates. Let be the Encoder's tensor, which performs a projection as indicated in the following: : ℝ 3 → ℝ 3 , , ∈ ℕ * , > .
Where is the number of atoms in the full atomistic representation and is the number of coarse-grained moieties that substitute the atoms. can be written in a matrix form: = ( 11 12 … … Let be the initial atomistic coordinates and the coordinates in the coarse-grained space. The i-th coarse-grained moiety's coordinates can be described by the following relationship: Some important characteristics of the matrix are that ≥0 and that its rows are normalized to 1. It is necessary to perform this step in order to interpret as the coefficients indicating the participation of each atom to each coarse grained moiety [12]: The matrix is obtained here from the normalization by row of the weight matrix of the Encoder. Therefore, the atomistic coordinates that are used as input data, are connected to the Encoder's neurons via weights that are stored in . The weight values are expressed as a function of stochastic parameters as follows: The parameters have a probabilistic interpretation. Specifically, expresses the probability of assignment of an atom to a coarse-grained moiety . Thus, the coefficients are normalized when summed over . An assignment matrix can be defined. Each column of the matrix contains the probabilities of each atom to belong to a coarse-grained moiety . Each atom is assigned to the moiety with the highest assignment probability = [ ( 1 , 2 , … , )] Moreover, are sampled from the Gumbel-Softmax distribution where a reparametrization is applied to ensure that the network is differentiable and allow backpropagation to take place during the training process [13]. The above is known as the Gumbel-Softmax reparametrization trick.
The parameters correspond here to the encoder's weight, and so to the probability of an atom to belong to a coarsegrained moiety , and is a parameter sampled from the Gumbel distribution. Finally, is a "temperature" parameter whose value follows a stepwise decrease until the end of the training making the distributions of atoms to coarse-grained moieties narrower, approaching the limit of a one-hot encoding.
Following the projection to the coarse-grained level, the Decoder uses the latent vector produced by the Encoder to reconstruct the full atomistic space. Let be the Decoder's tensor. Mathematically it can be described as: : ℝ 3 → ℝ 3 , , ∈ ℕ * And can be expressed in a matrix form: The reconstructed vectors are given by the following relation: In summary, after a successful VAE's training, we obtain a coarse-grained mapping and a reverse mapping at the same time.

Loss Function
In order to train the VAE, a loss function should be defined. For VAEs the most typical choice is an error metric, such as MSE, calculated from the difference between the input x and the reconstructed output x': In addition to the reconstruction, Wang et al. [6] proposed the addition of an instantaneous force regularization term, to promote the learning of mapping schemes that minimize the average force in the CG space, thus yielding a smoother potential energy hypersurface, to facilitate the subsequent learning of a CG force field [14]. The loss function is thus defined as: where ρ acts as a weighting factor that controls the balance between the two components in the loss function. Indeed, depending on the choice of units, the instantaneous average force can be orders of magnitude larger than the reconstruction part of the loss, leading to extreme behaviors from the model during the optimization process, which will be discussed in Section 3. Choosing an appropriate value for the hyperparameter ρ can mitigate this effect. However, the choice of an optimal value for ρ is heavily dependent on the chemical system examined, which acts as an obstacle for a transferable application of the method. Therefore, an alternative loss function is proposed here, based on the observation that the optimal value of ρ to avoid degenerate behavior was observed to be the one that brings both terms to similar orders of magnitude. To achieve this in a system-independent manner, the reconstruction as well as the forces part of the loss, are normalized. The normalization is performed by dividing each term by the maximum value it assumed until the current epoch of the optimization, and it can be written as: Where α ∈ [0,1], is an hyperparameter that regulates the balance between the two terms of the loss function and K is the current optimization epoch. In each denominator, the maximum ℒ rec and ℒ forces value recorded until the current epoch of the optimization process is chosen. For the loss function expressed with Eq. (2) and Eq. (3), two different schemes were examined. In the first one, the forces' part is activated only after the reconstruction has reached a predefined threshold value, akin to what is suggested by Wang et al. [6]. The purpose of this criterion is to proceed with the forces part of the optimization only after the optimization of the reconstruction part has reached a desired level. In the second case, both terms were computed from the beginning. No systematic effect resulted from these two implementations, therefore it was chosen to keep both terms from the beginning for simplicity.

Mapping connectivity to the CG space
The determination of the connections (CG bonds) between the coarse-grained moieties provides critical information about a system's configuration in the coarse-grained space, especially for complex systems with multi-bead representations. To map the connectivity from the atomistic to the CG space, as a first step, an adjacency matrix (an undirected graph) is built to represent the connections between the atoms in the full atomistic configuration. Each atom is assigned to a unique identity, which is an integer number. Let AC be the adjacency matrix. If an atom k is connected to an atom l then Akl = Alk = 1, otherwise Akl = Alk = 0. After training the VAE model, from the assignment matrix , we deduce to which moiety each atom is assigned. If there is a connection between two atoms k and l that are assigned to two different moieties s and t, i.e., if Akl = 1, then a connection between the moieties s and t is also present. Let be the coarse-grained level connectivity matrix. When there is a connection between two moieties s and t then CGC,st = CGC,ts = 1, else CGC,st = CGC,ts = 0. For the ethane example, if we end up with two moieties with the first group containing the atoms with IDs 1,2,3 and 4 and the second the atoms with IDs 5,6,7 and 8, then, because of the connection between the carbons that belong in two different moieties, the matrix describing the connections between the moieties can be written as: In all tests, the initial and final values used for were 4 and 10 -1 , respectively. An early stopping criterion was implemented to consider training concluded if, for 10 consecutive epochs, the loss function did not vary by more than 10 -2 . Therefore, the number of epochs used for training was either determined by the early stopping criterion or by an upper threshold of 5 × 10 3 . With these assumptions, five hyperparameters remain, and a grid search was conducted to determine their effect on the final value of the loss, on the mapping, and on the reconstruction. This preliminary investigation of hyperparameters optimization was conducted considering the loss definition given by Eq. (2). The data were split randomly into 3 sets: 80% of the configurations were used as training set, 10% were used as test set, which allows to evaluate how the model performs on unseen data during training, for a fixed set of hyperparameters, and 10% were used as validation set, to compare the accuracy of models trained with different hyperparameters. Both the Adam optimization algorithm and the mini-batch Stochastic Gradient Descent (SGD) one were tested, leading to similar results. The weights of the Neural Network were initialized randomly.
From the systematic testing of the various hyperparameter combinations, we observed significant variability in the obtained results, in terms of mappings produced, final values of the loss, and reconstruction accuracy. No systematic trends or correlations with the hyperparameter values seemed to emerge for the case of the learning rate, the batch size and the decay ratio. Some weak tendencies were observed: on average, a larger batch size and larger decay ratio values yielded lower loss values, therefore 10 2 and 10 -1 were adopted for the batch size and the decay ratio, respectively. Concerning the learning rate, provided that enough epochs were used, to allow the training to converge, we did not observe a clear effect on the final results, therefore a higher learning rate was used, for computational efficiency. The effect of the force regularization weight is discussed in detail in Section 3.2, as it has profound influence on the produced mappings. Apart from the hyperparameters values, it was noted that a source of significant variability of the results lies in the random initialization of the network weights, which will be further discussed in Section 3.3.

Effect of force regularization and loss normalization
The values of and are tightly interconnected, as noted also by Wang et al. [6]. In particular, for the systems analyzed in this work, the reconstruction component of the loss typically reaches values of the order of 10 -1 or 10 -2 , while the unscaled force component changes by orders of magnitude as the complexity of the molecule increases: ethane ~10 1 , ethylbenzene ~10 2 , CTA ~10 3 , PIM-1 ~10 3 in units of kcal 2 /Å 2 /mol 2 . Even though this difference across systems is expected, it adds difficulties in the selection of the parameter values and in the interpretation of the results.
Moreover, it is not clear what criterion should be used to guide the selection of an appropriate value for , as lower values will always result in lower losses. It seems, though, important that the force regularization does not significantly exceed the reconstruction loss, otherwise this will trigger problematic behavior during the model training.
Especially in the cases where the force regularization is predominant in the loss function, the model progressively lumps together more atoms, effectively using fewer CG beads to perform the assignment than the total number available. This continues until the extreme case is reached, in which the majority of the encoder weights are zeroed and the whole molecule is mapped into a single CG bead. This behavior is independent of the VAE latent dimension size. Although this clearly minimizes the force component of the loss, it significantly worsens the reconstruction accuracy. If the model is intended to be used to perform backmapping, through the Decoder, this decrease in the reconstruction accuracy could be unacceptable.
An example of this behavior is shown in Fig. 2, for the case of PIM-1, using = 10 -2 . Even though the latent dimension of the VAE is 10, the model progressively reduced the number of beads effectively used in the assignment until only one was left. Around epoch 300, when the reduction from 2 to 1 used CG beads takes place, it can be observed that the reconstruction component of the loss indeed increases almost by a factor 3, compared to the lowest values reached during previous epochs. It is important to notice that, even if the model does not reduce to a single bead, depending on the value of it is not always possible to achieve the CG level desired, equal to the size of the latent dimension, since part of the latent variables will be discarded by the model. Moreover, it should be noted that, when the force regularization is present, the loss depends mostly on the number of used CG beads, rather than on how the atoms are subdivided among the moieties.
Since the value of , which brings the force and reconstruction terms of the loss to a comparable level, is systemspecific, the loss defined by Eq. (3) was introduced. With this normalization, the degenerate cases were = 1 were avoided and the need to search for an optimal value of for each system is bypassed. Indeed, the equivalent of in the normalized loss, , did not have a systematic effect on the results for values of 0.1, 0.5, 0.8. More extreme values were not considered because they would defeat the purpose of the normalization, which is to bring the two components of the loss to the same order of magnitude. Therefore, we observed that could also be eliminated (which is equivalent to considering it equal to 0.5) and the model implementation could be simplified by including one less hyperparameter.

Multiplicity of mappings and connectivity preservation
An unexpected finding of this study is that the mapping obtained from the trained VAE model depends on the random initialization of the network weights. Training the model multiple times with the same hyperparameters yielded different, non-equivalent mappings. This was a general behavior, manifested for all hyperparameter combinations and for all the molecules tested, both simple and complex ones, as shown in Fig. 3 for C2H6, and Fig. 4 for PIM-1. In particular, in the latter case shown, not only the assignments, but also the number of CG moieties chosen by the model differ, which was a common occurrence.   4 also displays another problematic feature encountered in several of the produced results: the VAE model sometimes groups together into the same CG bead atoms that are located far apart, thus breaking, in the CG representation, the link with the connectivity of the underlying atomistic model.
In order to avoid these unphysical mappings, at first an additional term was introduced into the loss function, which would penalize connectivity-breaking assignments. An assignment would be considered connectivity-preserving if, for all pairs of atoms belonging to a CG moiety, it would be possible to find a connecting path that did not involve atoms belonging to a different CG moiety. Although a promising concept, this loss function did not manage to reproduce interesting results. In particular, we observed that, in order to minimize the connectivity penalty, degenerate mappings of all atoms included into the same CG moiety were produced. Moreover, the evaluation of this loss function is significantly more costly, as it requires the creation of subgraphs for the atoms belonging to each moiety. Therefore, this constraint was enforced instead by implementing an acceptability verification at the end of training, ensuring that only connectivity-preserving mappings were retained, and repeating the training with a different random initialization of the network until an acceptable solution was found. It should be noted that the presence or absence of the force regularization does not prevent the generation of connectivity-breaking mappings.

Effect of diversity between configurations
As previously anticipated, when the force regularization is included in the loss, the reconstruction accuracy tends to decrease, therefore, in order to investigate aspects related to the possibility to use the Decoder to backmap from the CG space to the atomistic space, the following tests were conducted using the loss function defined by Eq. (1). To test the approach on a C2H6 molecule, two models with latent dimension of 2 and 3 nodes were considered. Two CG beads correspond to the resolution obtained in the study by Wang et al. [6] for the same molecule. In Fig. 5, the trend of the loss function during the training can be observed, as well as the progression of the reconstructed molecule. For the 2bead case, the model indeed converges to a chemically intuitive partitioning of the system, however, surprisingly, the reconstructed molecule lies on a single plane. Since a VAE model provides a lossy compression of the input, some information loss is to be expected. In the case of low dimensionality both in the input/output and latent space, the expressive power of the model might be insufficient to reconstruct the system in 3D.
On the other hand, with 3 nodes in the latent dimension, the VAE learns to accurately reconstruct the molecule. However, as will be discussed in detail also for ethylbenzene, this positive result is strongly dependent on the molecular configurations present in the dataset. In particular, accurate reconstruction hinges on the fact that the various configurations considered did not differ significantly from one another in their positions, and only a limited portion of the 3D space was explored within that dataset.
In the case of ethylbenzene, configurations with larger differences in terms of positions are present in the dataset. In  When the model is trained using all available configurations, the reconstruction of the aliphatic section of the molecule fails (Fig. 6a). In the dataset, those atoms explored a larger portion of the 3D space, rotating around the bond connecting to the aromatic ring. In the reconstruction, each atom is placed by the VAE in the average position that it had spanned in the dataset. As can be observed, this leads to a backmapped configuration with overlapping between atoms in an unrealistic fashion. Interestingly, the aromatic section is reconstructed well, and it can be seen how, for this section of the molecule, there was less significant displacement around the average position for each atom. If the model is trained with fewer, more similar configurations (Fig. 6b), a good reconstruction is obtained for the whole molecule. This behavior is exemplified here for ethylbenzene, but similar results were obtained also in the case of CTA and PIM-1.

Absence of rotational invariance
A limitation encountered for the use of the VAE model -in the current implementation -to perform backmapping lies in the fact that it does not satisfy rotational invariance, as exemplified in Fig. 7 for the case of CTA. In the upper part of the figure, it can be observed that a model trained with a small set of similar configurations (in the spirit of Fig. 6b) exhibits indeed high reconstruction accuracy. However, when the same model is applied to a rotated version of the same input configuration, it is unable to reconstruct it. This behavior stems from the use of Cartesian coordinates as input features and suggests that the investigation of a rotationally invariant input feature, such as interatomic distances, to overcome this limitation should be pursued. In the current implementation, these limitations severely hinder the application of the VAE model to a multi-molecule system, or to a set of configurations (trajectory) where the molecules explore different positions, rotations, and conformations. Training on such data would produce a model that only reconstructs the average position of each atom, yielding unrealistic molecular conformations.

CONCLUSIONS
In this work, we investigated the use of a VAE to perform automatic mapping of atomistic systems into coarse grained beads, and reconstruct the atomistic detail, i.e., backmapping. The model contains several hyperparameters, whose optimization is a costly step in terms of computational resources. In this architecture and for the systems studied, the learning rate, the batch size and the decay ratio did not appear to play a critical role on the results, therefore computational efficiency was the guiding criterion in the selection. Within the range of values tested, their effect seemed to be overshadowed by the effect of randomness in the network initialization. Therefore, with the exception of and , it would seem to be unnecessary to conduct an extensive search for optimal values of the VAE hyperparameters.
If a normalized definition of the loss function is introduced, it is possible to avoid the problematic behavior related to the strong effect of the force regularization, i.e. the collapse of all atoms into one single CG bead. Moreover, normalization avoids the necessity to introduce, and optimize, the force regularization weight as an additional hyperparameter. In fact, the same regularizing effect can be achieved by adopting different values of (lower number of beads corresponds to smoother potential energy hypersurface in the CG space), combined with the filtering step of the output, that eliminates the connectivity-breaking mappings, which would introduce unfavorably high energy conformations of the CG molecule.
These observations possibly imply that the introduction of the force regularization is not necessary and that its avoidance has the added benefit of leading to higher accuracy of the reconstruction.
One of the more general observations of this study was the variety of results obtained, that did not exhibit any apparent correlation with the network hyperparameters and seemed to be a result of the different initializations of its weights. Evaluating which of the outputs would be preferable, requires considerations related to the intended use for the produced mapping, and might vary from system to system and from application to application. The VAE model, as modified in this study, can produce a variety of physically consistent options at various levels of resolutions, that can be further screened by devising fitness criteria for the specific application.
Concerning the reconstruction, in this implementation the model can only reproduce realistic configurations when trained on sufficiently similar input data, because, when a more diverse set is used in the training, only the average value is learned and reproduced in the reconstructed output.
From the analysis performed and the limitations highlighted, opportunities for future developments would concern the adoption of input features that preserve rotational and translational invariance during training. Moreover, as there is not a unique way to backmap the system from the CG to the atomistic level, the reverse mapping strategy should be extended via modifying the Decoder implementation in order to produce a variety of backmapped configurations, following the probability distribution of the underlying atomistic systems. The ultimate goal would be to associate the mapping procedure to the development of appropriate ML-based CG interaction potentials [7,15,16], that will enable the conduction of molecular simulations at the CG level and will provide a complete evaluation of the extracted mappings, in conjunction with the developed CG interaction potentials.