There is a newer version of the record available.

Published August 2, 2022 | Version v1
Dataset Open

Comparative Analysis of Anthraquinone and Chalcone Derivatives-Based Virtual Combinatorial Library. A Cheminformatics "Proof-of-Concept"

  • 1. Universiti Brunei Darussalam

Description

This computational “proof-of-concept” study illustrated the combinatorial approach used to explain how the selected natural products' structures undergo molecular diversity analysis. A virtual combinatorial library (1.6M) based on 20 anthraquinones and 24 chalcones were enumerated. The resulting compounds were optimized to the near drug-likeness properties and the physicochemical descriptors were calculated for all datasets including FDA, Non-FDA, and natural products (NPs) datasets from ZINC 15. UMAP and principal component analysis (PCA) were applied to compare and represent the chemical space coverage of each dataset. Subsequently, the Laplacian score, and Gini coefficient, were applied to delineate feature selection, and selectivity among properties respectively. Finally, we demonstrated the diversity between the datasets by employing Murcko’s, and central scaffolds systems, calculated three fingerprint descriptors, and analyzed their diversity by PCA and self-organizing maps (SOM). The optimized enumeration resulted in 1,610,268 compounds with NP-Likeness, and synthetic feasibility mean scores close to FDA, Non-FDA, and NPs datasets. The overlap between the chemical space of 1.6M was more prominent with NPs. Laplacian score has prioritized NP-likeness and hydrogen bond acceptor properties (1.0 and 0.923) respectively, while the Gini coefficient showed that all properties have selective effects on datasets (0.81 to 0.93). Scaffold and fingerprint diversity indicated that the descending order for the tested datasets was FDA, Non-FDA, NPs, 1.6M. Virtual combinatorial libraries based on NPs can be considered as a source of the combinatorial compound with NP-likeness properties. Furthermore, measuring molecular diversity is supposed to be performed by different methods to allow for comparison and better judgment. 

This link provides an illustration of the whole virtual combinatorial library using the TMAP algorithm in addition to the complete dataset. TMAP is a recent algorithm applied to visualize ultra-large high-dimensional chemical libraries for structures and physicochemical properties (Probst & Reymond, 2020). This approach creates and distributes intuitive tree representations of big data sets with arbitrary dimensionality in the order of 107.

To visualize the whole library of compounds, download the "index(2).rar", then extract the index.html that pop-up in the WinRAR application.

Files

Library_total_ligands_emp.csv

Files (498.2 MB)

Name Size Download all
md5:958290f2621c60b6a5ace12794156f19
33.2 MB Download
md5:4c2f4f45ae1400974aecdfece6122987
85.7 kB Download
md5:2d2f4ddc684d7f1249c8ec6256bf6aaf
290.5 MB Download
md5:93f0678e6a3bc99c168c936128d49cf0
174.5 MB Preview Download