There is a newer version of the record available.

Published December 18, 2024 | Version v1
Dataset Open

Pretrained Transformer Encoder for SMILES Strings

Authors/Creators

  • 1. ROR icon Yale University

Description

This dataset provides the pretrained parameters for a transformer encoder designed for extracting feature representations from SMILES strings. The model was pretrained on masked token prediction using a comprehensive dataset combining SMILES from various sources: ChEMBL 33 (~2.4M molecules), GuacaMol v1 (~1.6M molecules), MOSES (~1.8M molecules), BindingDB (~1.2M molecules), and PDBbind v2020 (~15,710 molecules).

The architecture consists of 10 sequential transformer blocks, each implementing multi-head self-attention followed by position-wise feed-forward layers with Gaussian Error Linear Unit (GELU) activation, residual connections, and layer normalization for stable training. The model produces contextualized embeddings for each token, and the global molecular representation is derived from the start token embedding, which aggregates sequence-wide information.

This model has been optimized for drug discovery applications, including protein-ligand binding affinity prediction, and can serve as a foundational tool for researchers working on cheminformatics, computational biology, and medicinal chemistry.

Files

Files (854.4 MB)

Name Size Download all
md5:50917f2649192b057c1f24a5ae52f641
854.4 MB Download

Additional details

Software

Programming language
Python