Comparative Analysis of Anthraquinone and Chalcone Derivatives-Based Virtual Combinatorial Library. A Cheminformatics "Proof-of-Concept"

Moshawih, Said

doi:10.5281/zenodo.6950712

Published August 2, 2022 | Version v1

Dataset Open

Comparative Analysis of Anthraquinone and Chalcone Derivatives-Based Virtual Combinatorial Library. A Cheminformatics "Proof-of-Concept"

Moshawih, Said¹

1. Universiti Brunei Darussalam

This computational “proof-of-concept” study illustrated the combinatorial approach used to explain how the selected natural products' structures undergo molecular diversity analysis. A virtual combinatorial library (1.6M) based on 20 anthraquinones and 24 chalcones were enumerated. The resulting compounds were optimized to the near drug-likeness properties and the physicochemical descriptors were calculated for all datasets including FDA, Non-FDA, and natural products (NPs) datasets from ZINC 15. UMAP and principal component analysis (PCA) were applied to compare and represent the chemical space coverage of each dataset. Subsequently, the Laplacian score, and Gini coefficient, were applied to delineate feature selection, and selectivity among properties respectively. Finally, we demonstrated the diversity between the datasets by employing Murcko’s, and central scaffolds systems, calculated three fingerprint descriptors, and analyzed their diversity by PCA and self-organizing maps (SOM). The optimized enumeration resulted in 1,610,268 compounds with NP-Likeness, and synthetic feasibility mean scores close to FDA, Non-FDA, and NPs datasets. The overlap between the chemical space of 1.6M was more prominent with NPs. Laplacian score has prioritized NP-likeness and hydrogen bond acceptor properties (1.0 and 0.923) respectively, while the Gini coefficient showed that all properties have selective effects on datasets (0.81 to 0.93). Scaffold and fingerprint diversity indicated that the descending order for the tested datasets was FDA, Non-FDA, NPs, 1.6M. Virtual combinatorial libraries based on NPs can be considered as a source of the combinatorial compound with NP-likeness properties. Furthermore, measuring molecular diversity is supposed to be performed by different methods to allow for comparison and better judgment.

This link provides an illustration of the whole virtual combinatorial library using the TMAP algorithm in addition to the complete dataset. TMAP is a recent algorithm applied to visualize ultra-large high-dimensional chemical libraries for structures and physicochemical properties (Probst & Reymond, 2020). This approach creates and distributes intuitive tree representations of big data sets with arbitrary dimensionality in the order of 10⁷.

To visualize the whole library of compounds, download the "index(2).rar", then extract the index.html that pop-up in the WinRAR application.

Files

Library_total_ligands_emp.csv

Files (498.2 MB)

Name	Size	Download all
index (2).rar md5:958290f2621c60b6a5ace12794156f19	33.2 MB	Download
index.html md5:4c2f4f45ae1400974aecdfece6122987	85.7 kB	Download
index.js md5:2d2f4ddc684d7f1249c8ec6256bf6aaf	290.5 MB	Download
Library_total_ligands_emp.csv md5:93f0678e6a3bc99c168c936128d49cf0	174.5 MB	Preview Download

	All versions	This version
Views	280	110
Downloads	36	26
Data volume	5.8 GB	5.4 GB

Comparative Analysis of Anthraquinone and Chalcone Derivatives-Based Virtual Combinatorial Library. A Cheminformatics "Proof-of-Concept"

Creators

Description

Files

Library_total_ligands_emp.csv

Files (498.2 MB)