Pretrained Transformer Encoder for SMILES Strings
Description
This dataset provides the pretrained parameters for a transformer encoder designed for extracting feature representations from SMILES strings. The model was pretrained on masked token prediction using a comprehensive dataset combining SMILES from various sources: ChEMBL 33 (~2.4M molecules), GuacaMol v1 (~1.6M molecules), MOSES (~1.8M molecules), BindingDB (~1.2M molecules), and PDBbind v2020 (~15,710 molecules).
The architecture consists of 10 sequential transformer blocks, each implementing multi-head self-attention followed by position-wise feed-forward layers with Gaussian Error Linear Unit (GELU) activation, residual connections, and layer normalization for stable training. The model produces contextualized embeddings for each token, and the global molecular representation is derived from the start token embedding, which aggregates sequence-wide information.
This model has been optimized for drug discovery applications, including protein-ligand binding affinity prediction, and can serve as a foundational tool for researchers working on cheminformatics, computational biology, and medicinal chemistry.
Files
Files
(854.4 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:50917f2649192b057c1f24a5ae52f641
|
854.4 MB | Download |
Additional details
Software
- Programming language
- Python