‘Ring Breaker’: Assessing Synthetic Accessibility of the Ring System Chemical Space

ring the synthetic accessibility of ring systems remains a challenge. ‘Ring Breaker’ enables the prediction of ring-forming reactions, for which we have demonstrated its utility on frequently found and unprecedented ring systems, in agreement with literature syntheses. We demonstrate its performance on a range of ring fragments from the ZINC database and highlight its potential for incorporation into computer aided synthesis planning tools. Additionally, we generate a multi-label dataset using bipartite reaction graphs on which we train ‘Ring Breaker’ to model the relationship between one ring fragment and the multiple reactions recorded for its synthesis in the dataset; we thereby overcome the single-label approaches previously used. These approaches to ring formation and retrosynthetic disconnection offer opportunities for chemists to explore and select more efficient syntheses/synthetic routes. Abstract Ring systems in pharmaceuticals, agrochemicals and dyes are ubiquitous chemical motifs. Whilst the synthesis of common ring systems is well described, and novel ring systems can be readily computationally enumerated, the synthetic accessibility of unprecedented ring systems remains a challenge. ‘Ring Breaker ’ enables the prediction of ring-forming reactions, for which we have demonstrated its utility on frequently found and unprecedented ring systems, in agreement with literature syntheses. We demonstrate its performance on a range of ring fragments from the ZINC database and highlight its potential for incorporation into computer aided synthesis planning tools. Additionally, we generate a multi-label dataset using bipartite reaction graphs on which we train ‘Ring Breaker ’ to model the relationship between one ring fragment and the multiple reactions recorded for its synthesis in the dataset; we thereby overcome the single-label approaches previously used. These approaches to ring formation and retrosynthetic disconnection offer opportunities for chemists to explore and select more efficient syntheses/synthetic routes. and experimental procedures.


Introduction
The recent wave of artificial intelligence (AI) within drug discovery has heavily impacted in the fields of de novo design, synthesis planning, and bioactivity prediction to name a few. 1,2 This holds the promise of accelerating Design, Make, Test, Analyze (DMTA) cycles, for which predictive models are desired to reduce failure rates in the drug discovery process. 1,2 Computer aided synthesis planning (CASP) has long been investigated as a means for predicting how to make a given compound. [3][4][5] However, despite recent progress in the field, [6][7][8][9][10][11][12] synthetic planning tools based on neural network classifiers have failed to recognize reactions that are infrequently used or rare, due to to the heavily biased datasets available. 7,13 As such, CASP tools have not yet focused on the synthesis of ring systems, the reactions for which often fall within the noise of datasets in question. The ability to deconstruct ring systems in novel ways, offers medicinal and process chemists alike, the opportunity to explore a wider range of chemical space and create more efficient synthetic routes, thereby leading to a competitive advantage. 11 Ring systems are key scaffold components in medicinal chemistry, and are fundamental motifs to a number of drugs on the market today. 14 They vary greatly in nature -ring systems can be saturated, unsaturated, polycyclic and range in sizes from small heterocyclic rings to large macrocycles. In addition, they span over a range of chemical domains, from cyclic peptides to natural products, specialty chemicals, and dyes. As such, it is not surprising that many of the most frequently used reactions in organic synthesis pertain to the coupling of ring systems. 15,16 Although coupling reactions enable the synthesis of a wide range of structures, they are limited by the available building blocks. Ring-forming strategies, on the other hand, can enable the synthesis of novel building blocks containing ring systems, which can then be coupled to other fragments, thus allowing for the expansion of the synthetically feasible chemical space.
Ring systems play a role in the electronic distribution, three dimensionality, and scaffold rigidity of the small molecules they are part of. 14,17 They can directly interact with a protein target, such as in the welldefined example of the hinge binding motifs for kinase targets. 18 In addition, they contribute to physiochemical properties such as lipophilicity or polarity and molecular reactivity, which in turn will determine a molecule's absorption and distribution, metabolic stability, excretion and toxicity (ADMET) profile. 14 Therefore, synthetic approaches to novel ring systems are desired in order to tune and exploit property profiles derived from their interaction with the target. As such, numerous publications have followed the exhaustive computational enumeration of heteroaromatic ring systems first described by Pitt et al. 19 These aim to enrich structure-activity relationship information, explore the chemical space of ring systems, and find motifs relevant for use in medicinal chemistry. 17,[20][21][22] However, as of yet the synthetic accessibility of ring systems remains poorly explored.
Furthermore, the neural networks upon which CASP tools are built, 7,8,23 are trained using the single label (templates) obtained from the dataset. As an analogy to retrosynthetic planning, this resembles a one compound to one reaction (template) situation, whereas in 'truth' a compound can be synthesized by multiple reactions at any given step in the pathway. In this study we propose a method for the extraction of multiple labels from the underlying dataset and demonstrate its use in the prediction of retrosynthetic ring disconnections. This was extended to the prediction of previously unseen fragments, such as the so called 'Rings of the Future' for which we examine the predictive performance. 19,24 We show how 'Ring Breaker' can be viewed as a specialist for predicting ring formations, and used alongside current CASP tools, to guide route finding into avenues/tracks exploiting ring synthesis. The implications of predicting ring synthesis are far reaching and extend beyond the medicinal chemistry domain, to dyes, fragrances and agrochemicals to name a few.

Dataset Generation
Reaction datasets, in their current form, contain records of individual reactions whereby one compound can be the product of several different reaction classes or combination of reactants. In previous approaches these individual records have been used to train neural networks to either predict a retrosynthetic step or for reaction prediction. 6-8, 25, 26 However, this strategy neglects the one to many nature of retrosynthetic analysis, where a given compound may be constructed in more than one way. To overcome the limitation imposed by direct use of the dataset entries, we first build a bipartite reaction graph to map the relationship between all compounds designated as products in the reaction dataset with their corresponding template or reaction rule. 27,28 Using only templates that have been validated by applying them to the product and confirming that they regenerate the reactants recorded in the dataset, we ensure that the graph represents a 'partial' ground truth of the retrosynthetic space. For each compound designated as a product, the bipartite reaction graph is queried to obtain the neighboring connected nodes, from which we can extract a multi-label dataset for the subsequent training of neural networks. This approach allows us to train a multi-label multi-class classification neural network for the prediction of retrosynthetic steps, as opposed to the single-label multi-class classification network previously described ( Figure 1). In doing so, the number of samples is limited to that of the number of products recorded in the dataset, rather than the number of individual reaction entries. This speeds up training of the network by reducing the number of samples, resulting in a more efficient way of scaling to the ever-growing chemical literature. Additionally, the label vectors better represent the nature of the problem and are closer to the ground truth as their sparsity is reduced. The ground truth defined as containing all possible retrosynthetic disconnections, and as such reaction templates that can be applied to any given product. trained considering a one product to one template relationship. However, as multiple templates/reactions may be used on a given compound, it is desirable to train the model considering a one product to multiple template relationship. Here we build a bipartite reaction graph connecting compounds with their associated templates, which we subsequently query to extract a multi-label dataset. b) Examples of ring formations described in the USPTO dataset; these include ring closing metathesis and the Diels Alder reaction. The USPTO dataset was filtered using the crude measure of the difference in the number of rings between products and reactants to obtain a dataset describing ring formations.
In this work, we limited the reaction templates to those describing ring formations by using the crude measure of the difference in the number of rings between the products and reactants. We retain only reactions in which the difference is greater than one, thereby allowing multiple ring formations in one synthetic step. The bipartite reaction graph is then built, describing the retrosynthetic space corresponding to ring formations and queried to build a domain specific multi-label dataset ( Figure 1, Table 1). Compared to the entirety of the datasets from which the ring formations were extracted, we found that ring-forming reactions constitute 4.5 % and 5.8 % of the USPTO and Reaxys® † datasets, respectively. An even smaller percentage of all the templates extracted from these datasets correspond to ring formations (Table 1). Therefore, an all-encompassing classifier that considers all extracted † Copyright © 2019 Elsevier Life Science IP Limited except certain content provided by third parties. Reaxys is a trademark of Elsevier Life Science IP Limited templates to predict which can be applied in any given situation, has the difficult task of differentiating templates that can be applied. We propose a specialized ring formation classifier called 'Ring Breaker', that overcomes the current limitations of predicting ring syntheses. This can be injected as needed into a full retrosynthetic tool to enable access to well documented, as well as previously unreported ring systems.

Prediction of well-known ring formations
To demonstrate the utility of 'Ring Breaker' on common ring formations, we retrieved examples from the organic chemistry literature which exemplify commonly used ring-forming reactions ( Figure 2). In each case, the first applicable prediction (i.e. the first prediction for which application of the predicted template successfully generated a set of reactants) has been shown. The predictions were made by two models trained on the USPTO or Reaxys data respectively, to determine whether a difference in performance could be observed between the two datasets, considering their differing size and coverage as determined in a previous study. 29  We found that for the 20 substrates tested in this part of the study, 'Ring Breaker' performed better for the prediction of ring formations on average ( Table 2) However, for comparative purposes, those shown in the table have been restricted solely to ring formations to determine whether: 1) the quantity of ring formations in the top 50 templates predicted varied between the trained models and datasets, and 2) to determine the rank (i.e. the placement of the prediction) out of the top 50 predicted templates. Whilst one can argue that increasing the search space beyond the top 50 templates will give rise to more predicted templates that encode ring formations, this serves to increase the search breadth of the subsequent tree search. To this end, the computational expense associated with enumerating the tree to this extent must be balanced with the probability and accuracy of the prediction, where accuracy refers to the ability to predict a feasible set of reactants. In previous studies, the top 50 templates were used alongside a cumulative probability cutoff of 0.995 as a stopping criterion for further expansion of the search breadth. 7 As such, it is unlikely that predictions beyond the top 50 templates will be enumerated even if considered due to their large cumulative probability, and there is no guarantee a predicted template can be successfully applied to yield a set of reactants. 7,29 Therefore, this reflects to our measure of comparison outlined previously, the quantity of ring formations in the top 50 templates predicted and their rank. Legend: number of ring formations predicted (rank of first applicable ring formation) e.g. the value 5(2) refers to five ring-formations predicted in the top 50 templates and first applicable one is the second prediction overall Diels-Alder reaction

Dataset Ring
Across the substrates examined ( Figure 2) we found that 'Ring Breaker' was able to suggest a ring-forming template in 98 % of cases using a model trained on Reaxys, compared to 45 % of cases when using the general model (Table 2). We also determined that on average the number of ring-forming templates predicted within the top 50 predictions by 'Ring Breaker' exceeded that predicted by the general model.
These were also ranked higher than the first applicable ring-forming template predicted by the general model. Thereby, we established a clear benefit for the case of using a specialized ring-forming model, in conjunction with the general model, to increase the likelihood of predicting a ring-forming reaction during enumeration of the search tree. In addition, we found that the model trained on the Reaxys dataset outperformed that trained on the UPSTO dataset for the 20 cases examined using 'Ring Breaker' ( The examples shown henceforth serve to illustrate: 1) the strengths and drawbacks of the template-based approach, 2) the differences between 'Ring Breaker' and the general model regarding the first applicable template.
The Diels-Alder reaction is one of the most well-known ring-forming reactions, and commonplace in an undergraduate chemist's education. However, the general model fails to predict the template leading to the correct set of reactants as shown in Figure 3, for substrates 7 and 8 ( Figure 2). The Diels-Alder approach cannot be predicted by either the USPTO or Reaxys models for the synthesis of quinolines   The Paal-Knorr series of ring synthesis can be used to provide access to substituted furans, 32 pyrroles, and thiophenes ( Figure 5). 33 Its versatility and structural similarity between components, makes it an interesting case for testing retrosynthetic disconnections. The heteroaromatic ring varies by a single nitrogen, oxygen, or sulfur atom, and the ground truth disconnection in each case is almost the same. Figure 4 shows that the disconnections predicted by the model are dependent on the dataset. Both the Reaxys and USPTO datasets contain complementary templates, whereby the model trained on each dataset can predict retrosynthetic disconnections in some cases but not others. For the case of Paal-Knorrfuran synthesis, the USPTO data is not able to predict a disconnection, whereas the model trained on the Reaxys data predicts the literature disconnection with a high probability (0.983 and 0.998). 33 The case of pyrrole synthesis further highlights an interesting problem, whereby the correct disconnection can be predicted by the USPTO model for a simplified ring system (Figure 4, 1). However, when the molecular complexity around the ring system was increased by replacement of the methyl groups with phenyl groups ( Figure 4, 2), 33 the model failed to respond to the change and was not able to predict an outcome. On the other hand, the model trained on Reaxys was able to correctly identify the ring system (Figure 4, 2) and predict the retrosynthetic disconnection reported in the literature. 33 This highlights the underlying problem of template based approaches. The templates must be specific enough to yield a substructure match to the compound they are applied to and produce feasible reactants, whilst being general enough to be applicable across a broad range of suitable compounds without being promiscuous. Balancing these two requirements means that in cases such as the pyrrole synthesis, the template predicted for the simplified ring system cannot be applied to the more complex ring system shown. In contrast to 'Ring Breaker', the general model is only able to predict the correct set of precursors for compound 2, which uses the Paal-Knorr-furan synthesis. In the case of thiophene synthesis, the model cannot identify a suitable template for the simple ring system ( Figure 5, 6), regardless of the dataset used. However, in the more complex case ( Figure 5, 5), it focuses its efforts on the morpholine ring, predicting the shown disconnection ( Figure 5) with a high probability. Whilst this does not indicate that the system cannot predict thiophene formation, it alludes to the fact that these templates may be under represented underlying dataset.

Prediction of Fragments
We performed a one-step retrosynthetic analysis to focus on the ring-forming step required to synthesize a range of ring containing subsets from the ZINC database ( Figure 6). 34 Examining a range of ring systems, from the most commonly occurring (in > 100 K substances) to the rarest (in < 1 K substances) we found that 'Ring Breaker' exhibited superior performance over the general models across all subsets examined, regardless of the dataset used. The reason for this may be two-fold. First 'Ring Breaker' is exclusively limited to ring formations, so application of a promiscuous template may still lead to a result. However, this alone is not likely to lead to the large difference in performance observed. Second, the limited and domain specific training set better allows the model to learn in which context ring-forming templates can be predicted. This is in comparison to the general model in which ring-forming templates can be drowned out in the noise by more frequently-occurring templates, as there are several possible options for disconnections aside from the ring-forming templates.
Furthermore, we found that the Reaxys 'Ring Breaker' outperformed that trained on the USPTO dataset ( Figure 7). This is in contrast to our previous observations, where we reported that the ability to generate synthetic routes for the general model did not depend on the training dataset. 29 We have now determined that for the domain specific case of ring formations there is a clear effect arising from the training set used, attributed to the number and diversity of the samples available to the network for training. The performance of the model on ring systems classed as 'rare' in the ZINC database, is surprising ( Figure   6). These rings systems can be assumed to be difficult to access synthetically, yet the model is able to predict a one-step retrosynthetic disconnection in most cases. Examples are shown in Figure 7, with their corresponding patent precedent, which refers to the patent containing the reaction from which the predicted template was extracted. Whilst the retrosynthetic disconnection may not be used as described in the forward sense, we show that 'Ring Breaker' can act as an idea generator from which a trained synthetic chemist can build upon.  In some cases (e.g. furan synthesis) that could not be predicted for the unsubstituted ring system ( Figure   7), we have previously observed that a disconnection could be predicted from the substituted ring system ( Figure 4). In such cases, it is a problem of template availability and the underlying dataset on which the model is trained. The template must be able to describe the changing atoms and bonds in the reaction, and therefore is specific to the reaction from which it was extracted in terms of the local molecular environment. Yet the template must also be able to be generally applied to a variety of compounds containing the same molecular environment from which the template was first extracted. Finally, the network is trained on the product of the reactions, and the uses corresponding templates as labels.
Therefore, for the network to 'learn' in which context a given template can be applied, there must be a sufficient number of diverse examples containing the same local molecular environment to which the template has a substructure match. In this way, the network is better able to generalize to which compound a given template can be applied, and may explain why compounds, and by association templates that occur frequently within the dataset are better 'understood' by the network.

Accessing Virtual Fragments -'Rings of the Future'
Since the exhaustive computational enumeration of heteroaromatic ring systems first described by Pitt et al. 19 several articles have detailed the enumeration of ring systems, 17,[20][21][22] yet none have proposed syntheses to access the motifs described. We examined 'Ring Breaker' in the context of novel ring systems, the so-called 'Rings of the Future'. 19 Having trained the model on patent data up to 2016, we selected two novel ring systems from the literature for which the synthesis were reported in 2016, 24 and ensured that they were not present in the dataset. Rather than predicting the full synthetic route, we focused on the ring-forming step. We found the first applicable template in both cases corresponded to the disconnection reported in the literature (Figure 8). This further demonstrates the applicability of 'Ring Breaker' to previously unseen ring systems and shows how the approach can be used as an idea generator to explore novel ring-based scaffolds. Furthermore, the literature or patent precedent allows for researchers to lookup reaction conditions and experimental procedures.

Incorporation into Computer Aided Synthetic Planning Tools
In their current state, template-based synthetic planning tools, which rely on a classification network to predict which template can be applicable in a given context, struggle to differentiate ring-forming reactions from the multitude of other suitable reactions that can be applied to any given compound. This is due to the large number of templates available and the relatively low frequency of ring forming reactions within the datasets ( Table 1). As such, in cases where a ring disconnection may be suitable or may lead to a more efficient synthetic route, the network and subsequent tree search do not often prioritize, apply, and generate synthetic routes which proceed through ring formations. To overcome this problem, the 'Ring Breaker' model can be viewed as a specialist which can be consulted at various stages of the tree search to yield routes that proceed through ring formations, in addition to its use as a standalone tool. In Figure 9 we demonstrate one such use case, whereby the heterocycle is formed as an alternative to its functionalization to yield the queried compound. This methodology, using a domain specific model in conjunction with a general model, can be extended to other areas of synthetic chemistry in which the data is limited and domain specific knowledge (i.e. a specialist) is required. This highlights the utility of the method across the range of common, rare, and previously unseen ring systems, where the tool can be used as an idea generator. Given that the model varies in predictive capability depending on the substitution of the ring system, we established that this originates from the availability of a suitable template and by association the underlying dataset. Whilst suitable templates describing the reaction are suggested, they cannot be applied as they do not share an exact sub-structure match to the query compound.
We propose that the specialized model can be used alongside the current 'all encompassing' model currently used in synthetic planning tools, and as a stand-alone idea generator for proposing retrosynthetic disconnections to a wide range of ring systems, including those previously unseen. This has implications in the pharmaceutical, agrochemical, and dye industries, to name a few, where ring systems are an important and widely used motif at the center of many marketed compounds. 14 Furthermore, we propose that this methodology can be extended to other specialized domains within synthesis planning tasks where the data may be limited and domain specific knowledge (i.e. a specialist) is required.

Reaction Datasets and Template Extraction
The United States Patent Office extracts (USPTO) ranging from the years 1976 to 2016 is publicly available. 35 This is split into granted and applied patents and is openly available for use by the community.
The Reaxys 36 dataset is commercially available, provided by Elsevier under licensing agreements. The ring subsets described were obtained from the ZINC database and used as is. 34 All reactions were atom-mapped and classified using the commercially available Filbert and HazELNut packages (v. 3.1.8) provided by NextMove software. 37 These were subsequently processed using RDKit and RDChiral for template extraction. 38, 39 The bipartite reaction graph was built using NetworkX and queried to yield the multi label dataset as described previously. 40

Classification Network
The template library was constructed by filtering the respective dataset for templates that occurred a minimum of 3 times. In all cases duplicate reactions were removed prior to filtering. Products were represented as extended connectivity fingerprints (ECFP) with a radius of 2, using the Morgan algorithm in RDKit. 41 Whereas, templates were represented as binarized labels in a one-vs-all fashion using the scikitlearn library using the 'LabelBinarizer'. 42 Both the input ECFP4 and output vectors were precomputed.
Training, validation, and test sets were constructed as a random 90/5/5 split of the datasets, using a random state of 42, where the datasets were shuffled prior to splitting. This was conducted using the scikit-learn library. 42 The network framed as a supervised multiclass classification problem was trained using Keras 43 with Tensorflow 44 as the backend, the Adam optimizer with an initial learning rate of 0.001, 45 and categorical cross entropy as the loss function. The learning rate was decayed on plateau by a factor of 0.5, where the plateau was considered as no improvement of the validation loss after 5 epochs. The top 1, 5, 10, and 50 accuracies were monitored throughout the training process, and the loss on the validation set was used with early stopping (patience 10) to determine the number of epochs for which the model was trained.
The standard model deployed within this study was trained as described in our previous work. 29

Declarations Availability of Data and Materials
Reaxys datasets were used with permissions. Filbert, NameRxn and HazelNut were used for atommapping and classification under license from NextMove software. All code used in the production of this work will be made available at: https://github.com/reymond-group/RingBreaker