Data and weights for 'Mapping the combinatorial coding between olfactory receptors and perception with deep learning'
Authors/Creators
Description
Mapping the combinatorial coding between olfactory receptors and perception with deep learning — data and model checkpoints (v2)
Description
Datasets and model checkpoints associated with Mapping the combinatorial coding between olfactory receptors and perception with deep learning. This repo provides datasets and artifacts are too large for the GitHub repository, but are required to reproduce percept-prediction results without re-running upstream MolOR inference. Download and place under data/ of the repo, preserving subfolder structure.
Abstract
The sense of smell remains poorly understood, especially in contrast to visual and auditory coding. At the core of our sense of smell is the olfactory information flow, in which odorant molecules activate a subset of our olfactory receptors and combinations of unique receptor activations code for unique odors. Understanding this relationship is crucial for unraveling the mysteries of human olfaction and its potential therapeutic applications. Despite this, predicting molecule-OR interactions remains incredibly difficult. Here, we develop a novel, biologically-inspired approach denoted MolOR that first maps odorant molecules to their respective olfactory receptor (OR) activation profiles and subsequently predicts their odor percepts. Despite a lack of overlap between molecules with OR activation data and percept annotations, our joint model improves percept prediction by leveraging the OR activation profile of each odorant as auxiliary features in predicting its percepts. We extend this cross receptor-percept approach, showing that sets of molecules with very different structures but similar percepts, a common challenge for chemosensory prediction, have similar predicted OR activation profiles. Lastly, we further probe the odorant-OR model’s predictive ability, showing it can distinguish binding patterns across unique OR families, as well as between protein-coding genes or frequently occuring pseudogenes in the human olfactory subgenome. This work may aid in the potential discovery of novel odorant ligands targeting functions of orphan ORs, and in further characterizing the relationship between chemical structures and percepts. In doing so, we hope to advance our understanding of olfactory perception and the design of new odorants with desired perceptual qualities.
Description of files
weighted_loss_olfactory_subgenome_OR_logits.pt: binarized OR activation logits from the weighted-loss MolOR (90/10 split) over 845 HORDE ORs, keyed against all 5,862 GS-LF odorants. Required for the headline percept result (Table 2: GCN 845 ORs Weighted HORDE, 87.62 AUROC). 19 MB.
olfactory_subgenome_OR_logits.pt: OR activation logits from the unweighted MolOR (90/10) over 845 HORDE ORs. Used for the unweighted HORDE percept variant (Table 2: 87.20 AUROC). 19 MB.
full_weighted_1237_ORs_logits.pt: OR activation logits from the weighted-loss MolOR (90/10) over 1237 M2OR ORs. Used for weighted M2OR percept variants and the 5/10/20/50/100/400/1237 ablation (Table S1). 28 MB.
full_1237_ORs_logits.pt: OR activation logits from the unweighted MolOR (90/10) over 1237 M2OR ORs. Used for unweighted M2OR percept variants. 28 MB.
null_distributions.pkl: permutation-test results backing Figure S4 (receptor-activation specificity). For each of 10 percepts (5 narrow, 5 broad), contains the distribution of significantly-associated receptors under 100 random permutations of percept labels. Loaded by notebooks/percept_receptor_null_distribution.ipynb.
m2or_MolOR_esm_650m_unweighted.pth: best-performing MolOR for OR-binding prediction (Table 1, row 5). GCN with cross-attention over per-residue ESM-2 (650M) embeddings, trained on M2OR with unweighted BCE loss on an 80/10/10 split. Recommended weights for receptor-binding inference.
m2or_MolOR_esm_650m_weighted_for_GSLF.pth: upstream MolOR used to generate the OR activation logits that feed the GS-LF percept models. Same architecture as above, trained with the weighted-loss scheme on a 90/10 split (no test set, maximizing training data). Required to regenerate weighted OR logits from scratch.
m2or_MolOR_MPNN_esm_650m_unweighted.pth: MolOR variant with an MPNN molecular encoder in place of the GCN (Table 1, row 7, new in v2). Same ESM-650M residue embeddings and cross-attention, unweighted loss. Pairs with the MolOR_MPNN_canonical.json config.
GS_LF_baseline_gcn.pth: multi-task GCN trained on GS-LF molecules alone, with no OR features (Table 2, row 3 — 0-OR baseline, 87.04 AUROC). Reference point for the OR-augmentation ablation.
GS_LF_845_ORs_gcn.pth: best percept model (Table 2, row 7). Same GCN as the baseline but concatenates binarized predictions for all 845 HORDE ORs (from the weighted upstream MolOR above) to the molecular embedding before the prediction head.
GS_LF_pseudogenes_only_HORDE_gcn.pth: pseudogene control (Table 2, row 2). Same as the 845-OR model but using only HORDE pseudogene ORs as features; matches the no-OR baseline, supporting the v2 claim that pseudogene logits do not drive percept gains.
Usage notes
The training script classification_OR_feat_ESM.py selects which OR-logit file to load based on --prev_model_loss (weighted_loss vs unweighted_loss) and --OR_db (HORDE vs M2OR); files are truncated to the top-N ORs sorted by broadness when --n_ORs < 1237/845. Each .pth is a single representative seed; full per-seed metrics for the paper ablations live alongside eval.txt files in the source training directories.
Regenerating from scratch
OR-logit tensors can be regenerated end-to-end given (1) the corresponding upstream MolOR checkpoint above, (2) HORDE/M2OR sequence files in the GitHub repo at data/datasets/HORDE/ and data/datasets/M2OR/, and (3) the GS-LF molecule list at data/datasets/NaNs_GS_LF_isomeric_SMILES_dedup_odor_filtered.csv. Run classification_OR_feat_ESM.py with -pmp <MolOR checkpoint dir> — the script will compute and cache logits to data/datasets/full_*.pt automatically. ESM-2 (650M) per-residue embeddings will also be regenerated and cached on first run (~6.3 GB; not bundled here).
Files
Files
(211.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:ef3018d0243c098ca96ac8de8c3e2cd3
|
29.0 MB | Download |
|
md5:186cea9a10d56984da4451a530affd9f
|
29.0 MB | Download |
|
md5:0e2511d7090e46d6433b46e2eea2ced3
|
1.5 MB | Download |
|
md5:87262638148468a42d1790adaa4024a0
|
1.0 MB | Download |
|
md5:20a14562198aa3c8a1ae6c3a9c5fab59
|
1.3 MB | Download |
|
md5:98b3bcc6f0bae4c8faddb40ffb22c7d5
|
25.1 MB | Download |
|
md5:dee94df2b1b44aa80c1ad1b1d2aac5a0
|
25.1 MB | Download |
|
md5:a0f0acdd17032710c7f8cafe311cbd41
|
59.9 MB | Download |
|
md5:f910483b0f07bdf376c5920d2616c7ee
|
8.5 kB | Download |
|
md5:d5f9af2640d50a5ad94f58313dfa6f3e
|
19.8 MB | Download |
|
md5:0456ca577236731844fe88b095eb82a8
|
19.8 MB | Download |
Additional details
Dates
- Submitted
-
2026-05-08In final round of revisions prior to acceptance at Cell Systems
Software
- Repository URL
- https://github.com/microsoft/olfaction
- Programming language
- Python