InChIKey-Deduplicated ClassyFire/ChemOnt Label Collection
Authors/Creators
Description
An InChIKey-deduplicated aggregation of ClassyFire / ChemOnt chemical class labels for 73,356,229 unique compounds, assembled from PubChem, ZINC20, and Enamine REAL aligned sources contributed by collaborating laboratories. Every row carries the canonical five-tier hierarchical path [kingdom, superclass, class, subclass, direct_parent], and the non-hierarchical ClassyFire labels (intermediate_nodes, alternative_parents, geometric_descriptor, substituents, mapped_features) are exposed in a chemont_other_json column. This release also ships a model-ready Parquet train/validation/test split with a per-field label vocabulary.
- These labels cannot be fully verified against the live ClassyFire model. The dataset is an aggregation of ClassyFire results contributed by several laboratories across different time periods, not a fresh classification run. The exact source of every individual label is not always recoverable. Re-classifying every compound through the official endpoint at
classyfire.wishartlab.comto confirm that the labels match what ClassyFire would emit today is not practically feasible. The service accepts one InChIKey per request and is rate-limited to about 12 requests per minute, so a single re-classification pass over the ~73 M compounds in this release would take more than two decades. Historical version differences or occasional classification errors may therefore be present. However, the same InChIKey is occasionally reported by several of the contributing sources, and their independent ClassyFire classifications agree on the labels we retain. Cross-source agreement therefore acts as a confidence signal in place of direct re-verification, because broad label corruption would have produced systematic disagreement between sources, which is not what we observe.- Identity is InChIKey-only. The dataset is keyed by the 27-character standard InChIKey. It is not tautomer-normalized and does not attempt cross-tautomer identity resolution. If you need that, normalize the SMILES yourself before joining.
- Coverage is not complete. 704 of the 4,824 ChemOnt classes (14.6 %) do not appear on any molecule in this release, mostly in exotic inorganic and lanthanide or actinide chemistry that is genuinely under-represented in the contributing catalogs. See the Coverage breakdown section.
What is new in this release
Structures recovered from Enamine REAL. Additional 250,948 labelles structures where provided by colleagues working on the Enamine dataset.
Model-ready Parquet split. In addition to the full TSV, this release ships train.parquet, validation.parquet, and test.parquet plus vocabulary.json. The split is 80/10/10, stratified by the deepest resolved ChemOnt tree slot so that fine-grained classes are represented across all three partitions. Label columns store compact uint16 indices into a per-field vocabulary, decoded through vocabulary.json. See the Parquet split section.
Lineage
Corrected hierarchical paths. An earlier release inadvertently encoded 3,388,432 rows (4.64 %) in the ClassyFire v2.1 schema layout, in which the meta-root Chemical entities (numeric ID 9999999) occupies the kingdom slot and every other tier is shifted down by one, pushing the rightful subclass out of the five-slot path. The aggregator detects the v2.1 layout at label-mapping time and shifts it back. All rows carry the canonical [kingdom, superclass, class, subclass, direct_parent] layout, and 9999999 never appears in any tree slot.
Out-of-tree labels. Beyond the five-slot hierarchical path, the chemont_other_json column records the non-hierarchical labels ClassyFire reports per molecule, namely intermediate_nodes, alternative_parents, geometric_descriptor, substituents, and mapped_features. Each value resolves to numeric ChemOnt IDs through the bundled dictionary. Including these labels raises class coverage from 3,798 / 4,824 (78.7 %) as a strict hierarchical-path label to 4,120 / 4,824 (85.4 %) anywhere in the classification.
Headline figures
| Metric | Value |
|---|---|
| Unique InChIKeys (rows) | 73,356,229 |
| Rows with PubChem CID | 68,693,295 |
| Rows with ZINC20 ID | 30,187,323 |
| Rows carrying both PubChem CID and ZINC20 ID | 25,775,344 |
| Rows added from Enamine REAL (no CID or ZINC20 ID) | 250,948 |
| Rows with at least one unresolved tree slot | 119,839 |
| ChemOnt classes seen at least once | 4,120 / 4,824 (85.4 %) |
| ChemOnt classes never assigned to any molecule | 704 / 4,824 (14.6 %) |
Coverage breakdown
Distribution of the 4,824 ChemOnt classes by the number of molecules that carry the class as a label anywhere in the classification (five-slot hierarchical path plus the chemont_other_json fields).
| Bucket | Classes | Share of 4,824 |
|---|---|---|
| 0 rows (class never assigned) | 704 | 14.59 % |
| Exactly 1 row | 74 | 1.53 % |
| >= 1 row (covered) | 4,120 | 85.41 % |
| >= 10 rows | 3,819 | 79.17 % |
| >= 100 rows | 3,259 | 67.56 % |
| >= 1,000 rows | 2,422 | 50.21 % |
| >= 10,000 rows | 1,490 | 30.89 % |
| >= 100,000 rows | 684 | 14.18 % |
| >= 1,000,000 rows | 183 | 3.79 % |
Where the 704 never-seen classes cluster
Most of the unrepresented classes belong to exotic inorganic chemistry or to lanthanide and actinide oxoanionic chemistry, which is not present at scale in the contributing catalogs.
| Children at zero coverage | Parent class |
|---|---|
| 44 | Actinide oxoanionic compounds |
| 32 | Metalloid oxoanionic compounds |
| 29 | Lanthanide oxoanionic compounds |
| 24 | Organic acids and derivatives |
| 22 | Post-transition metal oxoanionic compounds |
| 14 | Miscellaneous inorganic compounds |
| 14 | Alkaline earth metal oxoanionic compounds |
| 13 | Triterpenoids |
| 12 | Organic oxoanionic compounds |
| 11 | Nucleosides, nucleotides, and analogues |
Call for contributions
If your group has run ClassyFire on compounds that fall into any of the 704 unrepresented classes or into rare classes with ten or fewer examples, please contribute them to the next release. The list of zero-coverage parent groups above and the Coverage breakdown section indicate which classes are most in need of additional evidence.
The most useful contribution is a TSV of (InChIKey, SMILES, ClassyFire JSON response) tuples covering one or more of those classes.
Files in this release
| File | Description |
|---|---|
classyfire_dedup_inchikey_smiles.enriched.tsv.zst |
The labelled dataset, zstd-compressed TSV with one row per InChIKey |
train.parquet |
Training split (58,684,999 rows), stratified 80 % |
validation.parquet |
Validation split (7,335,446 rows), 10 % |
test.parquet |
Test split (7,335,784 rows), 10 % |
vocabulary.json |
Per-field ordered list of ChemOnt class names that decodes the uint16 label indices in the Parquet files |
chemont_dictionary.tsv |
ChemOnt numeric-ID, name, and parent map needed to decode the TSV JSON columns |
Dataset schema (TSV)
| Column | Description |
|---|---|
inchikey |
27-character standard InChIKey |
cid |
PubChem CID (empty when the compound is not in PubChem) |
zinc_id |
ZINC20 identifier (empty when the compound is not in ZINC) |
smiles |
Representative SMILES |
chemont_tree_json |
Five-element JSON array of numeric ChemOnt IDs for [kingdom, superclass, class, subclass, direct_parent]. null indicates an unresolved slot |
chemont_other_json |
JSON object with optional keys intermediate_nodes, alternative_parents, geometric_descriptor, substituents, mapped_features. Each value holds sorted, unique numeric ChemOnt IDs |
Parquet split
The three Parquet files share one schema. Identity and input columns are cid (int64, null when absent), zinc_id (uint32, null when absent), smiles (string), and inchikey (string). The nine label columns are kingdom_ids, superclass_ids, class_ids, subclass_ids, direct_parent_ids, intermediate_nodes_ids, alternative_parents_ids, substituents_ids, and mapped_features_ids, each a list<uint16>. The five tree-slot columns hold zero or one element, the four out-of-tree columns hold zero or more. Every value is a 0-based index into the matching list in vocabulary.json, for example vocabulary["class"][class_ids[0]] gives the class name. The geometric_descriptor field is not included in the Parquet label set.
Files
ChemOnt_2_1.obo.zip
Files
(4.8 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:0616fc94c8963764830fa552eb02092e
|
307.9 kB | Preview Download |
|
md5:312aa8b25735de8849345cbf1e7540d8
|
362.2 kB | Download |
|
md5:f327be279d8655a518be7a7fdd7dcf25
|
2.1 GB | Download |
|
md5:888982b9d1365a39b62d3a382e256d0c
|
294.4 MB | Download |
|
md5:ba88e29a2403504e4e60f14759faca3a
|
2.1 GB | Download |
|
md5:65c6f11cc41bbbcb815f92e10cbb1dc4
|
294.5 MB | Download |
|
md5:55667d9adb56519fcca4f364cdd7a600
|
321.3 kB | Preview Download |