Published May 16, 2024 | Version v3
Dataset Open

Stereoisomers are not Machine Learning's Best Friends: Experimental results of the prediction of the association constant between a cyclodextrin and a guest with Stereo2vec

Description

This study addresses the challenge of accurately identifying stereoisomers in cheminformatics which originates from our objective to apply machine learning to predict association constant between a cyclodextrin and a guest. Identifying stereoisomers is indeed crucial for machine learning applications. Current tools offer various molecular descriptors, including their textual representation as Isomeric SMILES which can distinguish stereoisomers. But such representation is text-based and does not have a fixed size, so a conversion is needed to make it usable to machine learning approaches. Word embedding techniques can be used to solve this problem. Mol2vec, a word embedding approach for molecules, offers such a conversion. Unfortunately, it cannot distinguish between stereoisomers due to its inability to capture the spatial configuration of molecular structures. This study proposes several approaches that use word embedding techniques to handle molecular discrimination using stereochemical information of molecules or considering Isomeric SMILES notation as a text in Natural Language Processing. Our aim is to generate a distinct vector for each unique molecule, correctly identifying stereoisomer information in cheminformatics. The proposed approaches are then compared on our original machine learning task: predicting the association constant between a cyclodextrin and a guest molecule.

Files

Base Features_LGBM.csv

Files (4.0 MB)

Name Size Download all
md5:6c83f3a4dc237b51fdc9ba6eca981d4a
188.4 kB Preview Download
md5:603ebc87dc09738afd85fb7f0519c4a0
188.4 kB Preview Download
md5:c43dffcc06a69fa90a89f032c4938b84
188.4 kB Preview Download
md5:7338a973becb31bfd290d30ef6e97e48
3.0 kB Download
md5:398a2b8407f3a4863efdb4a426d7bdc0
188.4 kB Preview Download
md5:92951d0200012c012e83fbbc4016b0d3
188.4 kB Preview Download
md5:47c93a7e722ef82131f68a8d29ac06d0
188.4 kB Preview Download
md5:3670963e57f7d3c7cfd354e295a402c8
188.4 kB Preview Download
md5:7adc459a35b2f99539d8ae3151825eb3
188.4 kB Preview Download
md5:08ed0f842ed06a74ff98bae73a72db3e
188.4 kB Preview Download
md5:e5ba0e560e3d31286e652f4f5bd1cd1f
188.4 kB Preview Download
md5:3581367c48ccc40a3571a62a9a5c9f60
188.5 kB Preview Download
md5:1c34665b08587a5aea5c37f3fd532152
188.4 kB Preview Download
md5:e35843fb855af0c25e926f144f6db512
1.5 kB Preview Download
md5:99307da02ff083cb16bab08704cd8aea
188.4 kB Preview Download
md5:a5429e5198648a76b518d6387ecf904b
188.4 kB Preview Download
md5:51cd23349ace5e252d1fbde5540b9725
188.4 kB Preview Download
md5:f2620d32667f9df89ae59a8127c979dc
188.4 kB Preview Download
md5:daa4e76bc5748ca92efc75ebbdb14a60
188.4 kB Preview Download
md5:3926627e298e578505443971812b8480
188.4 kB Preview Download
md5:49e680a0ea9614911e90b4860e068c47
188.4 kB Preview Download
md5:7a2b57de0b9b972d89d33130f84725ab
188.4 kB Preview Download
md5:9febc692515372211f4e76c7017877b0
188.4 kB Preview Download
md5:67b34b5fad23bdef47931d91c7aa2c71
740 Bytes Preview Download