Published February 13, 2024 | Version v2
Model Open

Stereoisomers are not Machine Learning's Best Friends: Stereo2vec Models

Description

This study addresses the challenge of accurately identifying stereoisomers in cheminformatics which originates from our objective to apply machine learning to predict association constant between a cyclodextrin and a guest. Identifying stereoisomers is indeed crucial for machine learning applications. Current tools offer various molecular descriptors, including their textual representation as Isomeric SMILES which can distinguish stereoisomers. But such representation is text-based and does not have a fixed size, so a conversion is needed to make it usable to machine learning approaches. Word embedding techniques can be used to solve this problem. Mol2vec, a word embedding approach for molecules, offers such a conversion. Unfortunately, it cannot distinguish between stereoisomers due to its inability to capture the spatial configuration of molecular structures. This study proposes several approaches that use word embedding techniques to handle molecular discrimination using stereochemical information of molecules or considering Isomeric SMILES notation as a text in Natural Language Processing. Our aim is to generate a distinct vector for each unique molecule, correctly identifying stereoisomer information in cheminformatics. The proposed approaches are then compared on our original machine learning task: predicting the association constant between a cyclodextrin and a guest molecule.

Files

LICENCE.md

Files (13.5 GB)

Name Size Download all
md5:f9f615ae90190f8347c5ae6a392351b7
660.8 MB Download
md5:b6de69f241cda5615c133bcfbc54d692
398.6 MB Download
md5:d9f0ce963c83ed5acb74c6eada4bba57
80.0 MB Download
md5:515931a6c01adad626d6f1777706bf8c
398.6 MB Download
md5:e8e60d5d37466c972ff425b79fbde07c
78.3 kB Download
md5:0c35559b2422359df4f76c3ae3ae1795
2.4 GB Download
md5:47dbadbf9711c6ac2b328f2db0554e90
78.0 kB Download
md5:05a6bb68ce2d2faa3fc87a56294fd2d1
26.6 kB Download
md5:e96518aedd9b7c50566b02d8a62eb99f
680.0 MB Download
md5:e35843fb855af0c25e926f144f6db512
1.5 kB Preview Download
md5:212f52cc42a2fa1b059ce102555ee37a
10.8 MB Download
md5:e70abb61c5f3ccaef3876c192492e24b
2.4 GB Download
md5:e159751ca7c730242fd8b23078f83fcf
10.8 MB Download
md5:8f15a128905a82d58da28bee4744c878
13.3 MB Download
md5:ad9d9a1fa614ee4031d0aab8c08c68df
2.4 GB Download
md5:bb6d28ef089b85228dd1ecd564558dad
12.2 MB Download
md5:0620ecf6e7013a743ff5a715508be9bf
7.8 MB Download
md5:ea5c0c3877560bb09f6ae03869412fd9
1.6 GB Download
md5:9da2787d034ebcbccfad40ecfb022a7f
11.7 MB Download
md5:46a9a2046d4ac076d53749d9cd586504
2.4 GB Download
md5:1b58401724409529005fcbfcf5c24e2e
10.9 MB Download
md5:2baa7254018775a1718c9a9778de16b4
1.3 kB Preview Download