Published June 17, 2025 | Version v2
Dataset Open

Datasets, alphabets and models from paper 'Reverse Engineering Molecules from Fingerprints through Deterministic Enumeration and Generative Models.

  • 1. EDMO icon National Research Institute For Agriculture, Food And Environment
  • 2. ROR icon University of Manchester

Description

Files utilized and produced within the molecule-signature project:

  • alphabets.zip: Alphabets of molecule signatures built from MetaNetX, eMolecules and ChEMBL
  • datasets.zip: Datasets from MetaNetX, eMolecules, DrugBank and MolForge used to build alphabets, train generative models, and evaluate methods.
  • models.zip: PyTorch/Lightning models and SentencePiece tokenization models for decoding SMILES from ECFP.

See embedded README.md files and the publication for in depth details.

Files

alphabets.zip

Files (3.5 GB)

Name Size Download all
md5:44c8f21ab7425fc529d35734444d5346
120.5 MB Preview Download
md5:c0d1c47aaa075a52ea7e448c4f8d6423
3.0 GB Preview Download
md5:bb975a8cdec84297c92d23c85f92cddf
470.7 MB Preview Download

Additional details

Funding

Agence Nationale de la Recherche
Galaxy-BioProd - Galaxy-BioProd: An operating portal for the production of biosourced products ANR-22-PEBB-0008
Agence Nationale de la Recherche
GENCI - GENCI ANR-17-EQPX-0001
Agence Nationale de la Recherche
IFB (ex Renabi-IFB) - Institut français de bioinformatique ANR-11-INBS-0013

Software

Repository URL
https://github.com/brsynth/molecule-signature-paper
Programming language
Python
Development Status
Active