Direct Steering of de novo Molecular Generation using Descriptor Conditional Recurrent Neural Networks (cRNNs)

of


Introduction
The disruptive impact deep generative models have delivered over the past couple of years in the domain of de-novo drug design is profound, due to offering the capability of directing the generative process towards chemical regions of interest [1][2][3]. Specifically, deep learning has found applications in biological tasks such as bioactivity and synthesis prediction, image processing and de novo design of novel molecules [4]. A challenging task deep networks tried to address is the inverse molecular design [5] which refers to the generation of molecular structures that meet desired conditions, such as specific physicochemical properties or properties predicted by Quantitative Structure-Activity Relationship (QSAR) models.
The Simplified Molecular-Input Line-Entry System (SMILES) [6] is a popular choice [7] to represent molecules when using recurrent neural networks (RNNs). The alphanumeric nature of SMILES strings makes them compatible with state-of-the-art natural language processing algorithms performing sequence modelling and generation, such as RNNs. In particular, RNNs are a widely accepted approach to the task of sequence modeling because of their ability to memorize past inputs and incorporate them into their inference [8].
Unbiased RNN generative models trained on a relatively small number of SMILES strings were shown to be able to cover a much larger chemical space [9]. Moreover, the augmentation of a dataset using SMILES with randomized atom order has demonstrated state-of-the-art performance with respect to the uniformity and completeness of the coverage of chemical regions, compared to simply using their canonical variants [10]. After learning the general rules of the chemical space, e.g. atom types, bond types and size of molecules, the prior network can be further specialized using smaller datasets in a transfer learning fashion [11] or using reinforcement learning [12][13][14].
More complicated architectures such as autoencoders [15], which include two jointly-trained neural networks responsible for converting the input to and back from a latent representation, have been extensively benchmarked [16,17]. The quality of the latent space of an autoencoder was also proven to benefit from the usage of randomized SMILES strings [18][19][20]. Moreover, the latent space representation of a molecule can be used towards optimizing QSAR endpoints using GANs [21], Bayesian Optimization [15] or Particle Swarm Optimization [22]. The combination of a heteroencoder [19], that is trained on pairs of randomized SMILES strings of the same molecule, with a generative adversarial network (GAN) [23] has further demonstrated automatic navigation towards properties of interest.
Alternatively, learning to precondition the structure generation eliminates the need for optimization loops. One approach demonstrated this capability by concatenating SMILES strings with the properties of interest as input to a variational autoencoder [24]. Molecular graphs [25] have also been used in pairs along with the desired change in properties as conditions to train a variational autoencoder on.
Latent representations that are generated from a GAN architecture may also be exploited as input conditions for decoding neural networks [23].
In this work, we demonstrate that molecule side information, such as molecular descriptors, can be incorporated into the RNN-based generative process. We construct conditional recurrent neural networks (cRNNs) by setting the internal states of long short-term memory cells (LSTM [26]) as the input conditions. The architecture is related to, but conceptually simpler than, a conditional autoencoder as we only utilize an RNN-based decoder part. The generation is conditioned with properties calculated directly from the molecular structure or QSAR models, thus the encoder part is no longer needed. The conditional seed successfully steers the focus of the RNN towards a particular subset of the chemical domain, such as bioactive compounds with respect to a specific protein target.
Our approach complements the existing state-of-the-art conditional generative models such as conditional VAE, reinforcement learning etc. and may be used for populating molecular libraries.

Datasets
The datasets used in this work originate from two publicly available sources: ChEMBL [27] and ExCAPE-DB [28]. Data from ChEMBL were used to train the generative neural network and data from the dopamine receptor D2 (DRD2) target data in ExCAPE-DB were used to train a Quantitative Structure-Activity Relationship (QSAR) model using a support vector classification model to estimate the potency of a generated compound towards DRD2.

ChEMBL
The neural network was trained with a subset of the ChEMBL version 25. Initially, the complete dataset has been standardized using the MolVS Python module [29] using the super parent setting, which standardizes fragment, charge, isotope, stereochemistry and tautomeric states. Molecules were filtered to only contain the atoms [H, C, N, O, F, S, Cl, Br] with total heavy atoms less than 50. Next, the known active molecules found in the DRD2 dataset (see below) were removed from the dataset. The dataset was split into training and test subsets with a 9:1 ratio. During training, 10% of the training subset was used as a fixed validation set. [31] were removed from the DRD2 dataset. All compounds with a pXC50 value greater than five were selected as known actives along with 100,000 random DRD2 measured inactive compounds from ExCAPEDB. Stereochemical information was removed by converting all molecules to non-isomeric SMILES strings. The dataset was further reduced to exclude SMILES strings that were longer than the ones in ChEMBL or contained characters not found in ChEMBL. This led to removing strings with iodine and phosphorus. All active molecules were clustered based on the pairwise Tanimoto distance of their Morgan fingerprints with a radius of two using the implementation of the Butina algorithm [32] found in RDKit. The maximum distance threshold for the algorithm to associate neighbours was fixed to 0.4 with a value above it dictating different clusters. All clusters were sorted based on their size and were assigned to the train, validation and test subsets iteratively using a "4-1-1" scheme, i.e. for every four clusters assigned to the train set, one cluster was assigned to the validation set and one cluster to the test set in order of decreasing cluster size.

SMILES strings Randomization and Vectorization
During training, the atom order of all molecules was randomized using RDKit. After converting them back to SMILES, every constituting character was one-hot encoded. Every SMILES string was thus represented by a two-dimensional array with dimensions corresponding to the length of the vocabulary and the maximum canonical SMILES length found in ChEMBL, with an offset of five extra characters to account for randomized SMILES which were longer than their canonical representation.
The characters "^" and "$" were inserted in the beginning and end of each one-hot encoded string respectively. Resulting arrays that corresponded to shorter SMILES strings were padded with the endcharacter "$". The considered vocabulary consisted of 35 tokens that included all common unique alphanumeric characters found in ChEMBL and DRD2 datasets after filtering, the delimiters "^" and "$" and the token "?" to account for one-hot encoding of unknown characters.
The randomization and vectorization of all SMILES strings was performed dynamically using a modified version of the molvecgen Python package [33] during training.

Recurrent Neural Network
The neural network resembles the decoder architecture described in [19]. It was implemented in Keras v2.2.4 [35] with TensorFlowGPU v1.12.0 backend [36] and it is schematically shown in Figure 1 The model was trained for 100 epochs with randomized SMILES strings following the Teacher's Forcing method [38], using the ground truth at each step as prior knowledge instead of the character previously predicted by the network. A batch size of 128 sequences was used along with the Adam optimizer with default parameters [39] and an initial learning rate of 10 −3 . A custom learning rate schedule was used, where the learning rate was kept constant for the first 50 epochs and then decayed exponentially at each epoch, down to a value of 10 −6 at the final epoch.
A copy of the trained model was modified for the purpose of predicting single characters to jointly form SMILES strings. While maintaining the trained connection weights, the shape of the output of the last feedforward layer was set to a one-dimensional vector expressing the probability of sampling each of the known characters at every step. Also, the LSTM layers were set to stateful mode. During inference, a single character per iteration is sampled out of this vector of probabilities using multinomial sampling. After setting the initial states according to the descriptors of interest, the biased generation is triggered by feeding the start-character "^" to the network and ends when the endcharacter "$" is sampled.
Two different cRNN models were constructed and trained following this procedure, each based on different input descriptors. The first PhysChem-Based (PCB) model is shown schematically in Figure 1A.
The model uses the Wildman-Crippen partition coefficient (LogP) [40], topological polar surface area (TPSA), molecular weight (MW), number of hydrogen bond acceptors (HBA), number of hydrogen bond donors (HBD) and the drug-likeness score (QED) [41] calculated using their RDKit implementations as well as the soft label predicted by the QSAR SVC model described above. The calculated values were scaled individually to achieve a distribution with zero mean and unit variance and they were concatenated. The second FingerPrint-Based (FPB, Figure 1B) model was trained solely on Morgan fingerprints of radius two and 2048 bits, which are similar to Extended Connectivity Fingerprints (ECFP). The training and inference schemes of the cRNN models are described in Figure 1A-B and Figure   1C respectively. Model training and inferencing was performed on an NVIDIA Tesla V100 GPU on a 64-bit CentOS v7.5 server with 128 GB of RAM. The training process of the PCB and FPB models utilized 5 and 25 GB of RAM, respectively.

Figure 1: Conditional Recurrent Neural Network (cRNN) models based on different conditions. A) The PhysChem-Based (PCB) model accepts six scalar properties calculated by RDKit Python library and concatenates them with the probabilistic bioactivity prediction of the QSAR model. B) The FingerPrint-Based (FPB) model accepts a 2048bit Morgan fingerprint vector calculated by RDKit. Both models are trained on randomized SMILES strings as targets. C) Model inference is biased by the conditional seed and triggered by the starting character "^".
Inferencing stops when "$" is generated.

Transfer Learning Model
The baseline model consists of the same neural network architecture as described above with the notable difference that the initial states, instead of being set based on known descriptors, are rather being reset to zero in the beginning of the generation of each string. This approach is similar to the prior network described in [12] with the difference that each character is treated independently rather than within multi-character tokens. The network was likewise trained with Teacher's Forcing, learning the character set and the grammar of the SMILES strings found in ChEMBL. The selected RNN dimensions were identical to the ones in the case of the conditional RNN.
Next, the prior model was further trained exclusively with the known actives of the DRD2 train dataset for an additional 200 epochs, following a transfer learning strategy [42]. The initial learning rate was set to 10 −4.5 and it was decayed exponentially down to 10 −6 by the end of the training.

Likelihood of Known Sequences
The likelihood of sampling a given SMILES strings was estimated using the negative log likelihood (NLL) as previously described [9], with a modification that incorporates the knowledge that is induced into the initial states of the generation in the case of a conditional model. The conditional negative loglikelihood is described in Equation 1.
are the characters in the known SMILES sequence S, are the predicted model outputs, N is the length of the sequence S and c refers to the seeding conditions. The sign of the log-likelihood is negated to reflect that higher values correspond to more improbable sequences.

Resulting Datasets
The filtering process described in the previous section resulted in the sizes of datasets shown in Table   1. The QSAR Support Vector Classification model with parameters C=5.53 and γ=0.022 was selected as the one with the highest F1-score (0.92) towards the DRD2 validation set. This model was used to label all compounds in the ChEMBL dataset leading to 2.3% of ChEMBL compounds being classified with a probability greater than 50% of being active against the DRD2 receptor ( Table 1). As shown in Figure   2, the property distribution of the two datasets largely follows each other, except that the QED score of the DRD2 dataset is shifted towards higher values since those molecules are expected to be a priori drug-like.  (2) CHEMBL_TEST 149,679 2.3 (2) (1): Known active compounds (2): Predicted active compounds (probability ≥ 0.5) by the QSAR model

NLL distributions of datasets
The NLL of sampling all molecules in the ChEMBL25 dataset and all known active compounds in the DRD2 dataset was calculated with respect to all different models based on canonical SMILES strings.  Additionally, the position of the distributions can be interpreted in two ways: first, the closer to zero the NLL distribution moves, the more deterministic the output of the model gets. This can be due to either limited generalizability of the model or more detailed description of the target, such as in the case of a conditional network. Second, differences in NLL distributions between train and test sets can be a sign of overfitting or mode collapse [9]. This seems to be the case with the transfer learning model, which exhibits a distribution with a lower mean NLL towards the active compounds in the DRD2 train dataset compared to the unseen active ones in the DRD2 test set. In contrast, the NLL distributions of sampling all four datasets with either of the conditional networks regardless of the dataset are on par, which makes overfitting a less likely cause. Here, the similar distributions regardless of a dataset demonstrates that the conditional models are able to generate both active and inactive compounds at equal ease, given that the states are set accordingly.

Sampling of Active Molecules
The structures shown in Figure 4 were generated by the two conditional networks using known active compounds from the DRD2 test set as conditional seeds, which are shown in the centre. The exemplified molecules in the dashed circle were generated by the FPB model whereas the ones outside of it were generated by the PCB model. All the molecules were filtered to have a QED score greater than 0.8 and were predicted to be active by the QSAR model with a probability greater than 0.8. The FPB-based generations demonstrate almost identical structure to the seed at least at a scaffold level. On the other hand, the PCB-generated molecules have clearly different scaffolds to seed, which can be attributed to the fact that the selected physicochemical descriptors do not encode structural information directly.
The correlation between the seed and the output of the models was further investigated by calculating the Tanimoto similarity of multiple generations. For that purpose, 100 seeds were randomly selected from the unseen active compounds of the DRD2 test set and for each one, 256 molecules were generated in a batch by each of the conditional models yielding a total of 25,600 SMILES strings. For each batch, the pairwise Tanimoto similarities were calculated between the scaffolds of the associated seed and of all uniquely generated compounds. The results are plotted in Figure 5A, while the predicted probabilities of being active for each compound were plotted in Figure 5B. The PCBgenerated scaffolds tend to be dissimilar to their seeds in contrast to the FPB-generated ones, the similarity of which to the seeding scaffolds follows a bimodal distribution that is shifted to the right, showing that similar or identical scaffolds are generated. However, in both cases the distribution of active probabilities is comparable ( Figure 5B), proving that both models can generate predicted active compounds given the appropriate conditions.

Benchmarking
In an attempt to further compare all models, all three of them were tested with respect to the metrics provided by the MOSES framework [16]. For that purpose, 25,600 compounds were additionally sampled by the model trained with transfer learning, similar to the sampling done for the other models as described above. The metrics were calculated with respect to the active compounds of the DRD2 test set that was used as a reference dataset. The PCB model performs the worst with respect to most metrics, except for predicted active fraction and uniqueness among 10,000 samples. However, the metrics need to be interpreted carefully. The seed conditions used for the generation were extracted from active compounds of the DRD2 test set, which were not included in the training set of both conditional models. The active class is heavily underrepresented in the datasets that they were trained on (only 2.3% of predicted actives in ChEMBL, cf. Table 1) and thus the set of conditional seeds correspond to a demanding task, which becomes even harder for the PCB model to fulfil since much less information is included in the physicochemical descriptors than in the fingerprints. On the contrary, the transfer learning model was trained directly on known actives and it is independent of any input during generation, while trying to replicate what has been seen during training. Lacking input conditions offers an implicit advantage over the conditional models in terms of valid generated structures, because specific input combinations may cause a consistent drop in generated validity. However, within a sample of 10,000 generated structures, the PCB and the transfer learning models are on par regarding uniqueness. This metric is a performance indicator, yet it does not fully expose the differences between the models, simply because the output space is too large to generate enough duplicates within only 10,000 samples. On the other hand, the FCB model has low uniqueness, but this is expected as the more deterministic nature and the lower number of possible SMILES to sample from a single fingerprint naturally lead to duplicated outputs and penalized uniqueness.
The Frechét ChemNet distance (FCD) [43] underlines the chemical distance between the reference and the generated distributions. As such, it is heavily in favour of the conditional models, because the seeds drawn from the test set purposely force the generated distributions towards it and consequently It is noteworthy that a higher fraction of unique predicted active compounds was sampled by the PCB model whereas the least of them were generated by the FPB model. The FPB model was punished because of its high reconstructability, which negatively affects its uniqueness score.
More specifically, the molecular reconstructability of the input descriptors was assessed by trying to retrieve the molecule that was represented by them at each batch. By identifying the most frequently sampled molecule out of 256 generations using a single conditional seed, almost 65% of the FPB generations were proven to be successful reconstructions of the molecule behind the seeding fingerprint. Further experimentation with a deeper FPB model with four decoding layers and 512 LSTM units each made it possible to increase the reconstructability to 72%. Nonetheless, reconstructions were very scarce when using the physicochemical descriptors in the PCB model, because 256 samples were not enough to completely rediscover the diverse molecular space behind a given input condition.
In order to investigate whether novelty of the conditional models is influenced by the training or seeding dataset, 100 new conditions were drawn from each one of the training and test subsets of ChEMBL. Then, the novelty of the unique valid generated structures out of 256 generations (one batch) per set of conditions was assessed with respect to both datasets. The results are shown in Figure 6. As hypothesized, both models use the conditions stemming from unseen molecules and generate structures that are not present in either dataset. For any of the models, the difference between datasets is insignificant, reflecting a consistent generation of novel compounds regardless of the origin of the seeding conditions.

Control of Generated Properties
The primary advantage of the PCB model is the ability to generate molecules that follow the desirable properties. This was tested by using 10 conditional seeds derived from randomly selected active compounds from the DRD2 test set whose QED scores are all greater than 0.8. For each conditional seed, a batch of 256 SMILES were generated and the physicochemical properties defined in the condition were calculated for all the generated valid molecules using RDKit. As shown in Figure 7, most of the properties of generated compounds exhibit only small deviation from the defined conditional setpoint, with the QED property having relatively large variance around the reference level. LogP, TPSA, molecular weight and HBD setpoints were adequately matched in the generated molecular properties, followed by HBA which seems to be unstable for low values of LogP and high values of molecular weight. The QED formula contains the weighted sum [41] of all the other five properties and, consequently, the requested conditions along with the QED setpoint may render it impossible for that equation to be satisfied. Therefore, the QED property was hard to keep at the reference value as shown by the large spread around the target value ( Figure 8). Those are cases in which input combinations were ill-defined and resulted to either unattractive molecules or invalid structures, something that has also been observed in the latent space vectors of autoencoders [21]. In the cRNN context, such combinations may refer to under-represented regions in the training dataset, which are either due to the lack of relevant samples in the source or due to conflicts between the requested descriptor ranges. The conditions are entangled since they depend on each other, as observed from the behaviour of the QED score. In most cases, the user is probably interested in tuning only one of the properties rather than restraining many of them; nonetheless, the property conditions ought to be set at reasonable values to avoid the entanglement problem. More sophisticated sampling approaches, such as the LatentGAN architecture [23], could potentially address the entanglement problem. Particularly, the generator component of the LatentGAN may be used to autonomously propose a valid combination of input properties that lead to active generations towards bioactivity targets.

Exclusivity of Sampling
Sampling the cRNN model with the seed conditions derived from a query structure should theoretically make it more likely to generate structures similar to the seed (c.f. Figure 3) and less likely to sample dissimilar molecules. To investigate this hypothesis, 100,000 molecules were randomly selected from the ChEMBL test set and clustered using the DBSCAN algorithm [44], based on the Euclidean distance of their five scaled physicochemical properties (LogP, TPSA, MW, HBA, HBD). A value of ε = 0.1 and 10 minimum samples for associating core points were selected as parameters of the DBSCAN algorithm.
Next, two clusters of molecules (with size of 53 and 57 respectively) were manually selected to keep the variance of their descriptors within a range as narrow as possible, with preferably small overlap.
The distributions of LogP, TPSA and MW of the selected clusters are shown in Figure   The seed conditions were selected as the coordinates of the geometric centre of each cluster. The conditional NLL of sampling the canonical SMILES of each cluster under different seeds was calculated according to Equation 1 ( Figure 9D). The cross-conditional NLL was calculated for each cluster by swapping the conditional seeds of both clusters. Theoretically, the generation of molecules from these two clusters during conditional sampling should be mutually exclusive using their own cluster centre as seed. In other words, using each cluster centre as the seed should have a higher probability to sample the compounds within the same cluster.
This hypothesis is actually supported by Figure 10D. In overall, the molecules in cluster 1 (blue curve) are more likely to be sampled than the ones in cluster 2 (green curve), when the conditional vector is derived from the seed of cluster 1. When the seeds of both clusters are swapped, it is less probable to sample any molecule from one cluster using the seed of the other. Even though there is an overlap between the self-conditional and cross-conditional NLL curves, in both cases the former ones describe lower NLL values, thus showing that relevant molecules are more likely to be sampled. By comparing the self-conditional NLL curves of Figure 9 to the curves of Figure 3, it is observed that the NLL curves of Figure 8 are all shifted towards higher NLL values. This is expected though, since the conditions considered for each SMILES string were not derived from its own properties but from the mean properties of the cluster instead.

Applications to Drug Discovery
The cRNN architecture provides a way to address the inverse QSAR problem directly. Particularly, the PCB cRNN is able to generate molecular structures with desired properties. In contrast, other available methods suggest the use of optimization algorithms [15] [22] or reinforcement learning [12] to close the loop and steer one or more initial candidate molecules towards the aspired region of the chemical domain in an iterative process. Such optimization approaches require computationally expensive looping over a cost or desirability function, whereas in our case a batch of 256 potentially interesting SMILES strings with properties close to predefined target values can be generated within less than a second. Runtime is of concern when scaling up the generative process for the design of diverse molecular libraries, the populating of which in a timely and interactive fashion might be essential.
Having a quasi-instant generation also allows interactive applications to be built, where a fast feedback cycle permits experimentation with the target properties for library generation.
The pretrained cRNN could also serve as a starting point for additional optimization techniques such as reinforcement learning [12], where the standard conditions combined with the cRNN are complemented with project specific QSAR and desirability functions. Other optimization techniques such as Particle Swarm Optimization [22] or Bayesian Optimization [15] could be used to optimize the conditional seed, which in the case of the PCB model would be directly interpretable and for the FBP model could yield a series of similar compounds.

Conclusions
In this work, the effect of introducing molecular descriptors as inputs to an existing SMILES generator architecture based on recurrent neural networks has been investigated. Primarily, it was shown that known molecules are more likely to be rediscovered when sampling using the descriptor conditions that represent them as inputs to a cRNN, compared to a prior unbiased model that is simply trained on the complete molecular dataset. Our approach also demonstrated the capacity of generating novel compounds that were predicted active against the DRD2 receptor, which were also chemically closer to known active compounds than a baseline model trained with transfer learning. Additionally, a larger fraction of predicted actives was generated by the cRNN than the baseline model. Using molecular fingerprints as conditions focuses the molecular generation even more than physicochemical properties, by acting as structural restrictions that impose a scaffold on the output that is similar, if not identical, to the reference. This also demonstrated the proposed architecture's capability to function as a fingerprint inverter, by being able to resample the original molecule even up to 72% of the time by using a more complex network. On the other hand, physicochemical properties are more versatile and lead to molecules with more diverse structures and different scaffolds than the molecule that the conditions were derived from. The cRNN architecture tackles the inverse QSAR problem by directly shaping the properties of the generated molecules while avoiding computationally expensive optimization loops. Nonetheless, even though we have been able to optimize the conditions independently of each other, not all input combinations led to valid structures due to the conditions being correlated. As an example, this was observed when conditioning with a high QED setpoint while keeping the other conditions, which are constituents of the QED score calculation, fixed. The cRNN has thus been demonstrated as a potentially useful architecture with intermediate output space between unbiased character-based RNNs and fully steered autoencoders with a 1:1 relation between latent space vectors and molecules.