Published December 5, 2025 | Version v3
Publication Open

Tokenization for Molecular Foundation Models

  • 1. ROR icon University of Michigan–Ann Arbor

Description

Data Drop and Source Code for Tokenization for Molecular Foundation Models

File Contents
smirk-0.1.1.tar.gz Source code for the Smirk Tokenizer
TokenizerStats.tar.gz Source code for paper plots, tokenizer analysis, substring ambiguities, and fine-tuned models.
ngram_tokenizer_stats.tar.gz

N-gram models and tabulated Cross-Entropy and Information Loss for all evaluated tokenizers serialized using JLD2. The source code for working with and loading these files is in `TokenizerStats.tar.gz`. Decompresses to ~30 GB. Summary statistics, fixed-effects models, and coverage statistics are additionally provided.

tmQM.tar.qz Source code and generated property prediction dataset constructed from the tmQM dataset. The dataset is provided in the Apache Arrow format and provides molecules transcoded into OpenSMILES from the source XYZ files. The dataset is readily readable using HuggingFace's load_dataset function.
safetensors-models.tar.xz Safetensor checkpoints for all pre-trained and fine-tuned models (270) trained as part of this work. Instructions for loading these models are provided in TokenizerStats.tar.gz. Decompresses to ~31.7 GiB.
pubchem_ambiguous_substrings.csv.xz Collection of molecules with ambiguous substrings (i.e., Sc, Cn, Sn, etc.) retrieved from PubChem. Generation details are provided in the paper's supporting information. The filtering script (check_ambi_smiles.py) is included within TokenizerStats.tar.gz

 

Files

Files (48.6 GB)

Name Size Download all
md5:5ff4f735400fec28f09a584cfda0218b
20.3 GB Download
md5:2fceeb45c4fc40e4b50d4c8febbf8e36
27.0 MB Download
md5:422b2f8afb9dd7f9c305920e4875ecbe
27.9 GB Download
md5:a8204551bec8a4c2c5f61c78bcae28e1
40.1 kB Download
md5:e06483086d6a6474fae81b7492859d1f
401.7 MB Download
md5:cdb161c2b8860bd6d358990cf852a87a
4.2 MB Download

Additional details

Software

Repository URL
https://github.com/BattModels/smirk
Programming language
Python , Rust , Julia