Published October 13, 2020 | Version 1.0
Dataset Restricted

Leffingwell Odor Dataset

  • 1. Google Research, Brain Team
  • 2. School of Life Sciences, Arizona State University
  • 3. Department of Computer Science, University of Toronto

Description

NOTE: It's easier to download this dataset from pyrfume. Here's how:

# First install pyrfume in your Python environment. This can be done easily with pip.
# pip install pyrfume

import pyrfume
molecules = pyrfume.load_data('leffingwell/molecules.csv', remote=True)
behavior = pyrfume.load_data('leffingwell/behavior.csv', remote=True)
# e.g. to count the number of molecules with each descriptor
behavior.sum().sort_values(ascending=False).astype(int)  

Predicting properties of molecules is an area of growing research in machine learning, particularly as models for learning from graph-valued inputs improve in sophistication and robustness. A molecular property prediction problem that has received comparatively little attention during this surge in research activity is building Structure-Odor Relationships (SOR) models (as opposed to Quantitative Structure-Activity Relationships, a term from medicinal chemistry). This is a 70+ year-old problem straddling chemistry, physics, neuroscience, and machine learning.

To spur development on the SOR problem, we curated and cleaned a dataset of 3523 molecules associated with expert-labeled odor descriptors from the Leffingwell PMP 2001 database.  We provide featurizations of all molecules in the dataset using bit-based and count-based fingerprints, Mordred molecular descriptors, and the embeddings from our trained GNN model (Sanchez-Lengeling et al., 2019). This dataset is comprised of two files: 

  1. leffingwell_data.csv: this contains molecular structures, and what they smell like, along with train, test, and cross-validation splits. More detail on the file structure is found in leffingwell_readme.pdf.
  2. leffingwell_embeddings.npz: this contains several featurizations of the molecules in the dataset.
  3. leffingwell_readme.pdf: a more detailed description of the data and its provenance, including expected performance metrics.
  4. LICENSE: a copy of the CC-BY-NC license language.

The dataset, and all associated features, is freely available for research use under the CC-BY-NC license.

If you use the data in a publication, please cite:

@article{sanchez2019machine,
  title={Machine learning for scent: Learning generalizable perceptual representations of small molecules},
  author={Sanchez-Lengeling, Benjamin and Wei, Jennifer N and Lee, Brian K and Gerkin, Richard C and Aspuru-Guzik, Al{\'a}n and Wiltschko, Alexander B},
  journal={arXiv preprint arXiv:1910.10685},
  year={2019}
}

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

This dataset is available for free non-commercial use under the CC-BY-NC license.

You are currently not logged in. Do you have an account? Log in here

Additional details

Related works

Is compiled by
Preprint: arXiv:1910.10685 (arXiv)

References

  • Sanchez-Lengeling et al. (2019). Machine Learning for Scent: Learning Generalizable Perceptual Representations of Small Molecules. arXiv:1910.10685