Comparative Analysis of Anthraquinone and Chalcone Derivatives-Based Virtual Combinatorial Library. A Cheminformatics "Proof-of-Concept"
Description
This computational “proof-of-concept” study illustrated the combinatorial approach used to explain how the selected natural products' structures undergo molecular diversity analysis. A virtual combinatorial library (1.6M) based on 20 anthraquinones and 24 chalcones were enumerated. The resulting compounds were optimized to the near drug-likeness properties and the physicochemical descriptors were calculated for all datasets including FDA, Non-FDA, and natural products (NPs) datasets from ZINC 15. UMAP and principal component analysis (PCA) were applied to compare and represent the chemical space coverage of each dataset. Subsequently, the Laplacian score, and Gini coefficient, were applied to delineate feature selection, and selectivity among properties respectively. Finally, we demonstrated the diversity between the datasets by employing Murcko’s, and central scaffolds systems, calculated three fingerprint descriptors, and analyzed their diversity by PCA and self-organizing maps (SOM). The optimized enumeration resulted in 1,610,268 compounds with NP-Likeness, and synthetic feasibility mean scores close to FDA, Non-FDA, and NPs datasets. The overlap between the chemical space of 1.6M was more prominent with NPs. Laplacian score has prioritized NP-likeness and hydrogen bond acceptor properties (1.0 and 0.923) respectively, while the Gini coefficient showed that all properties have selective effects on datasets (0.81 to 0.93). Scaffold and fingerprint diversity indicated that the descending order for the tested datasets was FDA, Non-FDA, NPs, 1.6M. Virtual combinatorial libraries based on NPs can be considered as a source of the combinatorial compound with NP-likeness properties. Furthermore, measuring molecular diversity is supposed to be performed by different methods to allow for comparison and better judgment.
This link provides an illustration of the whole virtual combinatorial library using the TMAP algorithm in addition to the complete dataset. TMAP is a recent algorithm applied to visualize ultra-large high-dimensional chemical libraries for structures and physicochemical properties (Probst & Reymond, 2020). This approach creates and distributes intuitive tree representations of big data sets with arbitrary dimensionality in the order of 107.
To visualize the whole library of compounds, download the "index(2).rar", then extract the index.html that pop-up in the WinRAR application.