Data and weights for 'Mapping the combinatorial coding between olfactory receptors and perception with deep learning'

Chithrananda, Seyone; Amores Fernandez, Judith; Yang, Kevin

doi:10.1101/2024.09.16.613334

Published May 8, 2026 | Version 2

Publication Open

Data and weights for 'Mapping the combinatorial coding between olfactory receptors and perception with deep learning'

1. University of California, Berkeley
2. Microsoft Research New England (United States)

Mapping the combinatorial coding between olfactory receptors and perception with deep learning — data and model checkpoints (v2)

Description

Datasets and model checkpoints associated with Mapping the combinatorial coding between olfactory receptors and perception with deep learning. This repo provides datasets and artifacts are too large for the GitHub repository, but are required to reproduce percept-prediction results without re-running upstream MolOR inference. Download and place under data/ of the repo, preserving subfolder structure.

Abstract

The sense of smell remains poorly understood, especially in contrast to visual and auditory coding. At the core of our sense of smell is the olfactory information flow, in which odorant molecules activate a subset of our olfactory receptors and combinations of unique receptor activations code for unique odors. Understanding this relationship is crucial for unraveling the mysteries of human olfaction and its potential therapeutic applications. Despite this, predicting molecule-OR interactions remains incredibly difficult. Here, we develop a novel, biologically-inspired approach denoted MolOR that first maps odorant molecules to their respective olfactory receptor (OR) activation profiles and subsequently predicts their odor percepts. Despite a lack of overlap between molecules with OR activation data and percept annotations, our joint model improves percept prediction by leveraging the OR activation profile of each odorant as auxiliary features in predicting its percepts. We extend this cross receptor-percept approach, showing that sets of molecules with very different structures but similar percepts, a common challenge for chemosensory prediction, have similar predicted OR activation profiles. Lastly, we further probe the odorant-OR model’s predictive ability, showing it can distinguish binding patterns across unique OR families, as well as between protein-coding genes or frequently occuring pseudogenes in the human olfactory subgenome. This work may aid in the potential discovery of novel odorant ligands targeting functions of orphan ORs, and in further characterizing the relationship between chemical structures and percepts. In doing so, we hope to advance our understanding of olfactory perception and the design of new odorants with desired perceptual qualities.

Description of files

weighted_loss_olfactory_subgenome_OR_logits.pt: binarized OR activation logits from the weighted-loss MolOR (90/10 split) over 845 HORDE ORs, keyed against all 5,862 GS-LF odorants. Required for the headline percept result (Table 2: GCN 845 ORs Weighted HORDE, 87.62 AUROC). 19 MB.

olfactory_subgenome_OR_logits.pt: OR activation logits from the unweighted MolOR (90/10) over 845 HORDE ORs. Used for the unweighted HORDE percept variant (Table 2: 87.20 AUROC). 19 MB.

full_weighted_1237_ORs_logits.pt: OR activation logits from the weighted-loss MolOR (90/10) over 1237 M2OR ORs. Used for weighted M2OR percept variants and the 5/10/20/50/100/400/1237 ablation (Table S1). 28 MB.

full_1237_ORs_logits.pt: OR activation logits from the unweighted MolOR (90/10) over 1237 M2OR ORs. Used for unweighted M2OR percept variants. 28 MB.

null_distributions.pkl: permutation-test results backing Figure S4 (receptor-activation specificity). For each of 10 percepts (5 narrow, 5 broad), contains the distribution of significantly-associated receptors under 100 random permutations of percept labels. Loaded by notebooks/percept_receptor_null_distribution.ipynb.

m2or_MolOR_esm_650m_unweighted.pth: best-performing MolOR for OR-binding prediction (Table 1, row 5). GCN with cross-attention over per-residue ESM-2 (650M) embeddings, trained on M2OR with unweighted BCE loss on an 80/10/10 split. Recommended weights for receptor-binding inference.

m2or_MolOR_esm_650m_weighted_for_GSLF.pth: upstream MolOR used to generate the OR activation logits that feed the GS-LF percept models. Same architecture as above, trained with the weighted-loss scheme on a 90/10 split (no test set, maximizing training data). Required to regenerate weighted OR logits from scratch.

m2or_MolOR_MPNN_esm_650m_unweighted.pth: MolOR variant with an MPNN molecular encoder in place of the GCN (Table 1, row 7, new in v2). Same ESM-650M residue embeddings and cross-attention, unweighted loss. Pairs with the MolOR_MPNN_canonical.json config.

GS_LF_baseline_gcn.pth: multi-task GCN trained on GS-LF molecules alone, with no OR features (Table 2, row 3 — 0-OR baseline, 87.04 AUROC). Reference point for the OR-augmentation ablation.

GS_LF_845_ORs_gcn.pth: best percept model (Table 2, row 7). Same GCN as the baseline but concatenates binarized predictions for all 845 HORDE ORs (from the weighted upstream MolOR above) to the molecular embedding before the prediction head.

GS_LF_pseudogenes_only_HORDE_gcn.pth: pseudogene control (Table 2, row 2). Same as the 845-OR model but using only HORDE pseudogene ORs as features; matches the no-OR baseline, supporting the v2 claim that pseudogene logits do not drive percept gains.

Usage notes

The training script classification_OR_feat_ESM.py selects which OR-logit file to load based on --prev_model_loss (weighted_loss vs unweighted_loss) and --OR_db (HORDE vs M2OR); files are truncated to the top-N ORs sorted by broadness when --n_ORs < 1237/845. Each .pth is a single representative seed; full per-seed metrics for the paper ablations live alongside eval.txt files in the source training directories.

Regenerating from scratch

OR-logit tensors can be regenerated end-to-end given (1) the corresponding upstream MolOR checkpoint above, (2) HORDE/M2OR sequence files in the GitHub repo at data/datasets/HORDE/ and data/datasets/M2OR/, and (3) the GS-LF molecule list at data/datasets/NaNs_GS_LF_isomeric_SMILES_dedup_odor_filtered.csv. Run classification_OR_feat_ESM.py with -pmp <MolOR checkpoint dir> — the script will compute and cache logits to data/datasets/full_*.pt automatically. ESM-2 (650M) per-residue embeddings will also be regenerated and cached on first run (~6.3 GB; not bundled here).

Files

Files (211.6 MB)

Name	Size	Download all
full_1237_ORs_logits.pt md5:ef3018d0243c098ca96ac8de8c3e2cd3	29.0 MB	Download
full_weighted_1237_ORs_logits.pt md5:186cea9a10d56984da4451a530affd9f	29.0 MB	Download
GS_LF_845_ORs_gcn.pth md5:0e2511d7090e46d6433b46e2eea2ced3	1.5 MB	Download
GS_LF_baseline_gcn.pth md5:87262638148468a42d1790adaa4024a0	1.0 MB	Download
GS_LF_pseudogenes_only_HORDE_gcn.pth md5:20a14562198aa3c8a1ae6c3a9c5fab59	1.3 MB	Download
m2or_MolOR_esm_650m_unweighted.pth md5:98b3bcc6f0bae4c8faddb40ffb22c7d5	25.1 MB	Download
m2or_MolOR_esm_650m_weighted_for_GSLF.pth md5:dee94df2b1b44aa80c1ad1b1d2aac5a0	25.1 MB	Download
m2or_MolOR_MPNN_esm_650m_unweighted.pth md5:a0f0acdd17032710c7f8cafe311cbd41	59.9 MB	Download
null_distributions.pkl md5:f910483b0f07bdf376c5920d2616c7ee	8.5 kB	Download
olfactory_subgenome_OR_logits.pt md5:d5f9af2640d50a5ad94f58313dfa6f3e	19.8 MB	Download
weighted_loss_olfactory_subgenome_OR_logits.pt md5:0456ca577236731844fe88b095eb82a8	19.8 MB	Download

Additional details

Submitted: 2026-05-08

In final round of revisions prior to acceptance at Cell Systems

Repository URL: https://github.com/microsoft/olfaction
Programming language: Python

Views

Downloads

Show more details

	All versions	This version
Views	24	24
Downloads	15	15
Data volume	363.4 MB	363.4 MB

More info on how stats are collected....

DOI

Resource type

Publication

Publisher

Zenodo

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more; MIT License

A short and simple permissive license with conditions only requiring preservation of copyright and license notices. Licensed works, modifications, and larger works may be distributed under different terms and without source code. Read more

Technical metadata

Created: May 13, 2026
Modified: May 13, 2026