Published October 12, 2024 | Version 2.0
Dataset Open

Chemical Language Model Linker datasets and models

Description

Dataset

The unfiltered version of the PubChem dataset used for evaluation in the Chemical Language Model Linker (ChemLML) manuscript. The original dataset comes from the PubChem database. If you use this dataset, please see the PubChem download policies and citation guidelines.

There are entires for 257,619 chemicals, each with the fields:

  • description
  • Name
  • CID
  • ANID
  • SMILES
  • SELFIES

The PubChem dataset is available under the Creative Commons Zero v1.0 Universal license.

Relevant citations:

  • S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B. A. Shoemaker, P. A. Thiessen, B. Yu, L. Zaslavsky, J. Zhang, E. E. Bolton, PubChem 2023 update. Nucleic Acids Research 51, D1373–D1380 (2023).
  • Y. Deng, S. S. Ericksen, A. Gitter, Chemical Language Model Linker: blending text and molecules with modular adapters. Journal of Chemical Information and Modeling (2025).

Models

The `.pth` files are saved PyTorch models. The filenames correspond to the ChemLML models in Table 1 of the ChemLML manuscript. These ChemLML models use the following models from Hugging Face:
- MolT5
- Text+Chem T5
- MolGen
- MolGen-7B
- Fine-tuned LLaMA2-7B
- SCIBERT
- Galactica
- MolXPT

See the Hugging Face model cards for the original models' licenses, limitations, and citations.

The models are available under the Creative Commons Attribution 4.0 International license.

Files

Files (4.0 GB)

Name Size Download all
md5:68943ef8d1cb1cb0d0dc1ba6d51805ff
665.2 MB Download
md5:04beabe042ab6b2635b6d89740daad81
18.9 MB Download
md5:515da00444c4717a42342c2a9d8e04e2
457.4 MB Download
md5:79032dcd992866cd1f5c3201b6d64c6f
18.9 MB Download
md5:13bacb56f6721845263c498daec99c4b
457.4 MB Download
md5:944bd8aadc90f2eae25726e3cc328e17
268.6 MB Download
md5:8da8717164f22764b31f98111ab55ff1
29.4 MB Download
md5:a5ad1eed02db91fe66ae6a431dbaeaca
726.9 MB Download
md5:00454f22a975cb26a54928e124af1dda
18.9 MB Download
md5:6b31430fad884ddf552f22f3d70ca2ce
519.1 MB Download
md5:589bb6ef94979fbb6fbe835c5e07d78c
29.4 MB Download
md5:668f6e4d5b63a479570d7a0a9a6b2ff6
226.7 MB Download
md5:a32890bb6ac6b1b841a023cfbf1f9e63
18.9 MB Download
md5:c51ef87710d894878816582f8f70b934
458.7 MB Download
md5:52632b7848364347d6e805af141261c2
135.2 MB Download

Additional details

Related works

Is supplement to
Software: https://github.com/gitter-lab/ChemLML (URL)
Preprint: 10.48550/arXiv.2410.20182 (DOI)
Journal article: 10.1021/acs.jcim.5c00853 (DOI)

References

  • Yifan Deng, Spencer S. Ericksen, Anthony Gitter. Chemical Language Model Linker: blending text and molecules with modular adapters. Journal of Chemical Information and Modeling 2025. doi:10.1021/acs.jcim.5c00853