Pretrained Transformer Encoder for SMILES Strings

Kyro, Gregory

doi:10.5281/zenodo.14515812

Published December 18, 2024 | Version v1

Dataset Open

Pretrained Transformer Encoder for SMILES Strings

Kyro, Gregory¹

1. Yale University

This dataset provides the pretrained parameters for a transformer encoder designed for extracting feature representations from SMILES strings. The model was pretrained on masked token prediction using a comprehensive dataset combining SMILES from various sources: ChEMBL 33 (~2.4M molecules), GuacaMol v1 (~1.6M molecules), MOSES (~1.8M molecules), BindingDB (~1.2M molecules), and PDBbind v2020 (~15,710 molecules).

The architecture consists of 10 sequential transformer blocks, each implementing multi-head self-attention followed by position-wise feed-forward layers with Gaussian Error Linear Unit (GELU) activation, residual connections, and layer normalization for stable training. The model produces contextualized embeddings for each token, and the global molecular representation is derived from the start token embedding, which aggregates sequence-wide information.

This model has been optimized for drug discovery applications, including protein-ligand binding affinity prediction, and can serve as a foundational tool for researchers working on cheminformatics, computational biology, and medicinal chemistry.

Files

Files (854.4 MB)

Name	Size	Download all
Transformer_Encoder_for_SMILES.pt md5:50917f2649192b057c1f24a5ae52f641	854.4 MB	Download

Additional details

Programming language: Python

	All versions	This version
Views	208	154
Downloads	289	22
Data volume	376.2 GB	19.7 GB

Pretrained Transformer Encoder for SMILES Strings

Authors/Creators

Description

Files

Files (854.4 MB)

Additional details

Software