A Transformer Model for Retrosynthesis

We describe a Transformer model for a retrosynthetic reaction prediction task. The model is trained on 45 033 experimental reaction examples extracted from USA patents. It can successfully predict the reactants set for 42.7% of cases on the external test set. During the training procedure, we applied different learning rate schedules and snapshot learning. These techniques can prevent overfitting and thus can be a reason to get rid of internal validation dataset that is advantageous for deep models with millions of parameters. We thoroughly investigated different approaches to train Transformer models and found that snapshot learning with averaging weights on learning rates minima works best. While decoding the model output probabilities there is a strong influence of the temperature that improves at T=1.3 the accuracy of models up to 1-2%. We describe a Transformer model for a retrosynthetic reaction prediction task. The model is trained on 45 033 experimental reaction examples extracted from USA patents. It can successfully predict the reactants set for 42.7% of cases on the external test set. During the training procedure, we applied different learning rate schedules and snapshot learning. These techniques can prevent overﬁtting and thus can be a reason to get rid of internal validation dataset that is advantageous for deep models with millions of parameters. We thoroughly investigated different approaches to train Transformer models and found that snapshot learning with averaging weights on learning rates minima works best. While decoding the model output probabilities there is a strong inﬂuence of the temperature that improves at T=1.3 the accuracy of models up to 1-2%.


COC(=O)c1cccc(-c2nc3cccnc3[nH]2)c1
: An example of a retrosynthetic reaction: on the left side of the arrow the target molecule is depicted, and on the right side the one possible set of reactants that can lead to the target is shown in common chemistry-like scheme and using SMILES notation. Here two successive amidation reactions result in cyclisation and aromatization.
simplify the target molecule bringing to the scene less complex compounds. Some of them may be already available, while the others undergo the next step of retrosynthesis decomposition. Due to the recursive nature of the procedure, it can deal with thousands of putative compounds so computational retrosynthetic approaches can greatly help chemists in finding the best routes. Managing of the database of such rules is complicated and more critical the models based on it are not ready to accommodate new reactions and will always be outdated. Unfortunately, almost more than 60 years of developing rule-based systems ended with no remarkable success in synthesis planning programs [14]. Another approach to tackle the problem is to use so-called template-free methods inspired by the success of machine-translation. They don't require the database of templates and rules due to an inherent possibility to derive this information during training directly from a database of organic reactions with clearly designated roles of reactants, products, reagents, and conditions.
The analogy between machine translation and retrosynthesis is evident: each target molecule has its predecessors from which it can be synthesized as every meaningful sentence one can translate from source language to target one. If all parts of a reaction are written in SMILES notation, then our source and target sentence are composed of valid SMILES tokens as words. The main goal of the work is to build a model which could for a given target molecule for our example 2 COC(=O)c1cccc(-c2nc3cccnc3[nH]2)c1 in fig. 1 correctly predict the set of reactants. Namely, it should predict Nc1cccnc1N.COC(=O)c1cccc(C(=O)O)c1 in this case.
Neural sequence-to-sequence (seq2seq) approach has been recently applied for a direct reaction prediction task [15,16] with outstanding statistical parameters of final models -90.4% of accuracy on test set. Seq2seq modeling has been also tested on retrosynthesis task [17], but due to the complex nature of retrosynthesis itself and difficulty in estimating the correct predictions of reactants 3 , accuracy on the test set was moderate 37.4% but still comparable to rule-based systems 35. 4%. We questioned about the possibility of improvement models for one-step retrosynthesis utilizing modern neural network architectures and training techniques. Applying the Transformer Model [18], together with cyclical learning rate schedule [19], resulted in a model with accuracy 42.7%, that is > 5% higher compare to the baseline model [17].
Our main contributions are: • We show that Transformer can be efficiently used for a retrosynthesis prediction task.
• We show that for this particular task there is no advantage to use a validation dataset for early-stopping or other parameters optimization. We trained all the parameters directly from the training dataset.
• Applying weights averaging and snapshot learning helped to train the most precise model for one-step retrosynthesis prediction. We averaged weights on 5 successive cycles of learning rate schedule.
• Increasing the temperature while performing a beam-search procedure improves the accuracy up to 2%. 2 This reaction is in the test set and it was correctly predicted by our model. 3 A target molecule usually can be synthesized with different reactions starting from different sets of reactants. The predictions of the model may be correct from organic chemist point of view but differ from the reactant set in ground truth. This may lead to underestimation of effectiveness of models.

Dataset
In this study we used the same dataset of reactions as in [17]. This dataset was filtered from the USPTO database [20] originally derived from the USA patents and contains 50 000 reactions classified into 10 reaction types [21]. The authors [17] further preprocessed the database by splitting multiple products reactions into multiple single products reactions. The resulting dataset contains 40 029, 5 004, and 5 004 reactions for training, validation, and testing respectively. Information about the reaction type was discarded as we aimed at building a general model using SMILES of products and reactants only.

Model input
The seq2seq models were developed to support machine translation where the input is a sentence in one language, and the output is a sentence with approximately the same meaning but in another language. String nature of data implies some tokenization procedures similar to word2vec to be used for preprocessing the input. Most of works in cheminformatics dealing with SMILES tokenize the input with a regexp equal or similar to [15]. . The thing in brackets according to SMILES specification can be quite a complex gathering not only the element's name itself, but also its isotopic value, stereochemistry configuration, the formal charge, and the number of hydrogens 4 . Strictly speaking, to do tokenization right one should also parse the content of brackets just increasing the number of possible words in the vocabulary what eventually leads to the simplest tokenization only with letters. We tried different schemes of tokenization in this work but did not see any improvements in using them over simple character-based method.
Our final vocabulary has length of 66 symbols 5 : chars = "^#%()+-.0123456789=@ABCDEFGHIKLMNOPRSTVXYZ[\\]abcdefgilmnoprstuy$" To convert a token to a dense vector we used a trainable embedding 6 of size 64. It is well known that training neural networks in batches is more stable, faster, and leads to more accurate models. To facilitate batch training we also used masks of input strings of shape (batch_size, max_length) with elements equal to 1 for those positions where are valid SMILES symbols and 0 everywhere else.

Transformer model
We used a promising Transformer [18] model for this study which is a new generation of encoder-decoder neural networks family. The architecture is suited for exploration of the internal representation of data by deriving questions (Q) the data could be asked for, keys for its indexed knowledge (K), and answers written as values (V ) corresponding to queries and keys. Technically these three entities are simply matrixes learned during the network training. Multiplying them with the input (X) gives keys (k), questions (q), and values (v) relevant to a current batch. Equipped with these calculated parameters of the input the self-attention layers transforms it pointing out to some encoding (decoding) parts based on the attention vector.
The Transformer has wholly got rid of any recurrences or convolutional operations. To tackle distances between elements of a string a positional encoding matrix was proposed with elements equal to the values of trigonometric functions depending on the position in a string and also the position in the embedding direction. Summed with learned embeddings positional encodings do their job linking far located parts of the input together. The output of self-attention 4 We do not want to exclude stereochemistry information from our model as well as charges and explicit hydrogens that will lead to reducing of the dataset. Moreover, work in generative models showed excellent abilities of models to close cycles, for example, c1cc(COC)cccc1. If the model can capture such a long distance relation why should it be cracked on much simplier substrings enclosed by brackets? 5 This vocabulary derived from the complete USPTO set and is a little bit wider than needed for this study. But for future extending of the models it is better to fix the input shape to the biggest possible value. 6 The encoder and the decoder share embeddings in this study.
layers is then mixed with original data, layer-wise normalized, and passed position-wise through a couple of ordinary dense layers to go further either in next level of self-attention layers or to a decoder as an information-rich vector representing the input. The decoder part of Transformer resembles the encoder but has an additional self-attention layer which corresponds to encoder's output.
Transformer model shows the state-of-the-art results in machine translation and reaction prediction outcomes [16]. The latter work showed that training the Transformer on large and noisy datasets results in a model that can outperform not only other machine models but also well qualified and experienced organic chemists.

Model inference
The model estimates the probability of the next symbol over the model's vocabulary given all previous symbols in the string. Technically, the Transformer model first calculates logits, z i , and then transforms them to probabilities.
Here x i is the input of the models at i position; L -the length of the input string; y i is the decoded output of the model up to position (i − 1); and z i -logits that are to be converted to probabilities: where V is the size of the vocabulary (66 in this work) and T stands for the temperature 7 usually assigned to 1.0 in standard softmax layers. With higher T the landscape of the probability distribution becomes more smooth. During the training the model adapts its weights to better predict q i , so y i = q i .
During the inference however we have several possibilities how to convert q i into y i , namely greedy and beam search. The first one picks up a symbol with maximum probability whereas the second one at each step holds top − K (K = beam's size) suggestions of the model and summarises the overall likelihood for each of K final decodings. The beam search allows better inference and the probability landscape exploration compared to the greedy search because at a particular step of decoding it may choose a symbol with less than maximum probability, but the total likelihood of the result can be higher due to more significant probabilities on the next steps.

Training heuristics
Training a Transformer model is a challenge, and several heuristics have been proposed [19], some of them were used in this study: Using as bigger batch size as possible. Due to our hardware limitations we could not set the batch size more then 64 8 ; Increasing the learning rate at the beginning of training up to warmup steps 9 . The authors of the original Transformer paper [18] used 4 000 steps for warming. The Transformer model for reaction prediction task from [16] used 8 000 steps. We analysed different values for warmup and eventually found that 16 000 works well with our model.
Applying cyclic learning rate schedules. This tips can generally improve any model [22] through better loss landscape exploration with bigger learning rates after the optimiser felt down to some local minima. For this study we used the following scheme for learning rate calculation depending on the step: where cycle stands for the number of steps while the learning rate is decreasing before raising to the maximum again.
where f actor is just a constant. Big values of f actor introduce numerical instability during training, so after several trials we set f actor = 20.0. The curve for learning rate in this study is shown in fig. 3, plot (4, f). 7 Similar to formula of Boltzmann (Gibbs) distribution used in statistical mechanics. 8 Our first implementation of the model required a lot of memory to deal with masks of reactants and products. Though later we improved the code we still remained this size for consistency of the results. 9 In our implementation 1 step is equivalent to 1 batch. The number of reactions for training is 40 029 + 5 004, so one epoch is equal to 704 batches.
Averaging weights during last steps (usually [10][11][12][13][14][15][16][17][18][19][20] of training or at minima of learning rates in case of snapshop learning [23]. Also with cyclic learning rate schedules it is possible to average weights of those models that have minimum in losses just before increasing of the rate. Such approach leads to more stable and plain region in loss landscapes [22] Following the standard machine learning protocol, we trained our first models (T1) using three datasets for training, validation, and external testing (8:1:1) as was done in [17]. Learning curves for T1 are depicted in fig. 3, (c) and (d) for training and validation loss, respectively, (a) shows the original learning rate schedule developed by the authors of the Transformer but with 16 000 warmup steps. On reaching cross-entropy loss about 0.1 on the validation dataset, it stagnates without noticeable fluctuations as training loss steadily decreases. After warming up phase the learning rate begins fading and eventually after 1 000 epochs its value reaches 2.8 * 10 −5 inevitable causing to stop training because of too small updates.
During the decoding procedure, we explored the influence of the temperature parameter on the final quality of prediction and found that inferring at higher temperatures gives better result then at T=1. This observation similarly repeated for all our models. Fig. 2 shows the influence of this parameter on the reactants prediction of the part of the training set. Clearly, at T=1.3 the model reaches the maximum of chemically-based accuracy. This fact one can explain that at higher temperatures the landscape of output probabilities of the model is softer letting the beam-search procedure to find more suitable ways during decoding. Of course, the temperature influences only relative distances between peaks, so it does not affect the greedy search method.  If we applied the early stopping technique, the training of a model is stopped around 200 epoch 10 . Effectiveness of such a model marked T1 1 in table 1 resulted in TOP-1 37.9% on the test set. If we chose the last one model obtained at 1 000 epoch, then the model T1 2 gave us better value -39.8%. In this case, we did not see any need of the validation dataset and keeping in mind that our model has almost 2 millions of parameters we decided to combine training and validation sets and train our next models on both data, e.g., without validation. The model T2 was trained on all data and with the same learning rate schedule as T1. The results obtained when applying T2 to the test set are better than for T1 model namely 41.8% vs. 39.8%, respectively.
Then we trained our model with cyclic learning rate schedule, eq. 3, fig. 3 (b) for better exploration of loss landscape. During training, we also saved the character-based accuracy of the model, fig. 3, (f). This snapshot training regime [23] produces a set of different weights at each minimum of learning rate. Averaging them is to some extent equivalent to a consensus of models but within one model [22]. We tried different averaging regimes for T3 and found that averaging five last cycles gives better results.
Our final T3 model outperforms [17] by 5.3% with beam search and more critical it is also effective with greedy search 40.6%. The latter one is much faster and consequently more suitable for virtual screening campaigns.
It worth to notice that TOP-5 accuracy reaches almost 70%. That means the model can correctly predict reactants but sometimes scoring is wrong and TOP-1 is much less. We tried to improve TOP-1 scoring with internal confidence estimation.

Internal scoring
The beam search calculates the sum of negative logarithms of probabilities of selecting a token at a particular step, and thus, this value can be a measure of internal confidence. To check this hypothesis, we selected T3-2 model and estimated its internal performance to distinguish between correct and invalid predictions. The  parameters of the classifier were: AUC = 0.77, optimal threshold = 0.00678. Then we validated the model with an additional condition: if the score is less than optimal threshold we selected the answer, otherwise we went to the next candidate in the possible reactant sets returned by the beam search. The results were even worse than without thresholds, 28.45 vs. 42.42. A possible explanation is that the estimation does not deal with organic chemistry. The model tries to derive some character-based scoring relying only on tokens in a string and increasing this value does not influence the quality of prognosis. The same effect we saw during training when the character accuracy is 98% whereas chemistry-based metric is much lower.
Estimation of optimal thresholds on training sets almost always a bad idea due to the biasing of a model to its source data. The correct way is to use validation dataset instead. We built the classifier for the T1-2 with characteristics: AUC = 0.65, optimal threshold 0.00396, and applied it for testing the model. The results were again worse, 14.1% vs. 40.85%. There are no significant differencies of accuracies when using unnormalized or normalized on the length of the reactants string scores. Fig. 4 shows ROC curves for T1-2 and T3-2 models derived at T=1.3. Evidently one cannot use this estimation to improve TOP-1 scoring.

Discussion
Much attention paid in the scientific literature for rule-based approaches [14,25]. Since the authors of [26] have described the algorithm of automatic rule extraction from mapped reaction database several implementations of the procedure appeared, and then widely accepted by researchers. However, it should be noticed that, first, there is no algorithm to make atom-mapping [12] if it is absent (the typical situation with laboratory notebooks (ELN) for example). Second, all available information on synthesis usually contains only positive reactions, so all binary classification accuracies are inevitable overestimated because of artificial negative sets exploited in studies. Finally, the absence of commonly accepted dataset for testing makes the results of different groups practically disparate and biased to those problems the authors tried to solve. The authors of [25] selected 40 molecules from DrugBank database to test their multiscale models, whereas [17] used database specially prepared for classification [21].
Our model can correctly predict reactant set in TOP-5 with accuracy 69.8%. Internal confidence estimation cannot guarantee a correct ordering of reactants sets, so different scoring methods should be developed. One of the promising ways is to use a forward reaction prediction model to estimate whether it is possible to assemble a target molecule from reactants proposed. The scoring model should have excellent characteristics and probably it is possible to apply the same cycling learning rate and snapshot averaging to build it.
First work on applying reinforcement learning for the whole retrosynthetic path [14] showed superior performance compared to the rule-based methods developed before. More important if can deal with several steps of synthesis. But the policy learned during the training again used extracted rules limiting the method. Thus, the development of models for direct estimation of reactants is still of prime importance. During the encoding process, the Transformer finds an internal representation of a reaction which can be useful for multicomponent QSAR [27] for predicting rate constants [28] and yields of reactions. Embedding such systems in policy networks within reinforcement learning paradigm can bring forward an entirely data-driven approach to solve challenging organic synthesis problems.

Conclusions
We have described a Transformer model for retrosynthesis one-step prediction task. Our final model trained with cyclic learning rate schedule and its weights were averaged during last five loss minimum. The model outperforms the previous published retrosynthetic character-based model by 5.3%. It also does not require the extraction of specific rules, atom mappings, and reaction types in reaction dataset. We believe it is possible to improve the model further applying knowledge distillation method [29] for example. The current model can be used as a building block for reinforcement learning aimed at solving complex organic problems.
All source code and also models built are available online via github https://github.com/bigchem/retrosynthesis

Appendix
In this section we report all characteristics of repetitive runs of all models built in this study. Average values based on this data are summed in table 1. For every run we report learning curves with loss functions and learning rate, then  model relevant information are collected in a table including greedy search decoding accuracies, beam search accuracies  for Top-1, Top-3, and Top-5 Figure 9: Learning curves and rate schedule for T2 model. Loss for training on the left, usual learning rate schedule on the right.   https://github.com/bigchem/retrosynthesis/papers/retrosynthesis/models/t2-3-last.h5 https://github.com/bigchem/retrosynthesis/papers/retrosynthesis/models/t2-3-avg.h5 Learning rate * 100 Epochs Learning rate Averaging Figure 11: Learning curves and rate schedule for T3 model. Loss for training on the left, usual learning rate schedule on the right.  Figure 12: Learning curves and rate schedule for T3 model. Loss for training on the left, usual learning rate schedule on the right.  Figure 13: Learning curves and rate schedule for T3 model. Loss for training on the left, usual learning rate schedule on the right.