Published May 8, 2026 | Version 2

Data and weights for 'Mapping the combinatorial coding between olfactory receptors and perception with deep learning'

  • 1. ROR icon University of California, Berkeley
  • 2. ROR icon Microsoft Research New England (United States)

Description

Mapping the combinatorial coding between olfactory receptors and perception with deep learning — data and model checkpoints (v2)

Description

Datasets and model checkpoints associated with Mapping the combinatorial coding between olfactory receptors and perception with deep learning. This repo provides datasets and artifacts are too large for the GitHub repository, but are required to reproduce percept-prediction results without re-running upstream MolOR inference. Download and place under data/ of the repo, preserving subfolder structure.

Abstract

The sense of smell remains poorly understood, especially in contrast to visual and auditory coding. At the core of our sense of smell is the olfactory information flow, in which odorant molecules activate a subset of our olfactory receptors and combinations of unique receptor activations code for unique odors. Understanding this relationship is crucial for unraveling the mysteries of human olfaction and its potential therapeutic applications. Despite this, predicting molecule-OR interactions remains incredibly difficult. Here, we develop a novel, biologically-inspired approach denoted MolOR that first maps odorant molecules to their respective olfactory receptor (OR) activation profiles and subsequently predicts their odor percepts. Despite a lack of overlap between molecules with OR activation data and percept annotations, our joint model improves percept prediction by leveraging the OR activation profile of each odorant as auxiliary features in predicting its percepts. We extend this cross receptor-percept approach, showing that sets of molecules with very different structures but similar percepts, a common challenge for chemosensory prediction, have similar predicted OR activation profiles. Lastly, we further probe the odorant-OR model’s predictive ability, showing it can distinguish binding patterns across unique OR families, as well as between protein-coding genes or frequently occuring pseudogenes in the human olfactory subgenome. This work may aid in the potential discovery of novel odorant ligands targeting functions of orphan ORs, and in further characterizing the relationship between chemical structures and percepts. In doing so, we hope to advance our understanding of olfactory perception and the design of new odorants with desired perceptual qualities.

Description of files

weighted_loss_olfactory_subgenome_OR_logits.pt: binarized OR activation logits from the weighted-loss MolOR (90/10 split) over 845 HORDE ORs, keyed against all 5,862 GS-LF odorants. Required for the headline percept result (Table 2: GCN 845 ORs Weighted HORDE, 87.62 AUROC). 19 MB.

olfactory_subgenome_OR_logits.pt: OR activation logits from the unweighted MolOR (90/10) over 845 HORDE ORs. Used for the unweighted HORDE percept variant (Table 2: 87.20 AUROC). 19 MB.

full_weighted_1237_ORs_logits.pt: OR activation logits from the weighted-loss MolOR (90/10) over 1237 M2OR ORs. Used for weighted M2OR percept variants and the 5/10/20/50/100/400/1237 ablation (Table S1). 28 MB.

full_1237_ORs_logits.pt: OR activation logits from the unweighted MolOR (90/10) over 1237 M2OR ORs. Used for unweighted M2OR percept variants. 28 MB.

null_distributions.pkl: permutation-test results backing Figure S4 (receptor-activation specificity). For each of 10 percepts (5 narrow, 5 broad), contains the distribution of significantly-associated receptors under 100 random permutations of percept labels. Loaded by notebooks/percept_receptor_null_distribution.ipynb.

m2or_MolOR_esm_650m_unweighted.pth: best-performing MolOR for OR-binding prediction (Table 1, row 5). GCN with cross-attention over per-residue ESM-2 (650M) embeddings, trained on M2OR with unweighted BCE loss on an 80/10/10 split. Recommended weights for receptor-binding inference.

m2or_MolOR_esm_650m_weighted_for_GSLF.pth: upstream MolOR used to generate the OR activation logits that feed the GS-LF percept models. Same architecture as above, trained with the weighted-loss scheme on a 90/10 split (no test set, maximizing training data). Required to regenerate weighted OR logits from scratch.

m2or_MolOR_MPNN_esm_650m_unweighted.pth: MolOR variant with an MPNN molecular encoder in place of the GCN (Table 1, row 7, new in v2). Same ESM-650M residue embeddings and cross-attention, unweighted loss. Pairs with the MolOR_MPNN_canonical.json config.

GS_LF_baseline_gcn.pth: multi-task GCN trained on GS-LF molecules alone, with no OR features (Table 2, row 3 — 0-OR baseline, 87.04 AUROC). Reference point for the OR-augmentation ablation.

GS_LF_845_ORs_gcn.pth: best percept model (Table 2, row 7). Same GCN as the baseline but concatenates binarized predictions for all 845 HORDE ORs (from the weighted upstream MolOR above) to the molecular embedding before the prediction head.

GS_LF_pseudogenes_only_HORDE_gcn.pth: pseudogene control (Table 2, row 2). Same as the 845-OR model but using only HORDE pseudogene ORs as features; matches the no-OR baseline, supporting the v2 claim that pseudogene logits do not drive percept gains.

Usage notes

The training script classification_OR_feat_ESM.py selects which OR-logit file to load based on --prev_model_loss (weighted_loss vs unweighted_loss) and --OR_db (HORDE vs M2OR); files are truncated to the top-N ORs sorted by broadness when --n_ORs < 1237/845. Each .pth is a single representative seed; full per-seed metrics for the paper ablations live alongside eval.txt files in the source training directories.

Regenerating from scratch

OR-logit tensors can be regenerated end-to-end given (1) the corresponding upstream MolOR checkpoint above, (2) HORDE/M2OR sequence files in the GitHub repo at data/datasets/HORDE/ and data/datasets/M2OR/, and (3) the GS-LF molecule list at data/datasets/NaNs_GS_LF_isomeric_SMILES_dedup_odor_filtered.csv. Run classification_OR_feat_ESM.py with -pmp <MolOR checkpoint dir> — the script will compute and cache logits to data/datasets/full_*.pt automatically. ESM-2 (650M) per-residue embeddings will also be regenerated and cached on first run (~6.3 GB; not bundled here).

Files

Files (211.6 MB)

Name Size Download all
md5:ef3018d0243c098ca96ac8de8c3e2cd3
29.0 MB Download
md5:186cea9a10d56984da4451a530affd9f
29.0 MB Download
md5:0e2511d7090e46d6433b46e2eea2ced3
1.5 MB Download
md5:87262638148468a42d1790adaa4024a0
1.0 MB Download
md5:20a14562198aa3c8a1ae6c3a9c5fab59
1.3 MB Download
md5:98b3bcc6f0bae4c8faddb40ffb22c7d5
25.1 MB Download
md5:dee94df2b1b44aa80c1ad1b1d2aac5a0
25.1 MB Download
md5:a0f0acdd17032710c7f8cafe311cbd41
59.9 MB Download
md5:f910483b0f07bdf376c5920d2616c7ee
8.5 kB Download
md5:d5f9af2640d50a5ad94f58313dfa6f3e
19.8 MB Download
md5:0456ca577236731844fe88b095eb82a8
19.8 MB Download

Additional details

Dates

Submitted
2026-05-08
In final round of revisions prior to acceptance at Cell Systems

Software

Repository URL
https://github.com/microsoft/olfaction
Programming language
Python