Published June 2, 2026 | Version v4
Dataset Open

InChIKey-Deduplicated ClassyFire/ChemOnt Label Collection

Authors/Creators

Description

An InChIKey-deduplicated aggregation of ClassyFire / ChemOnt chemical class labels for 73,356,229 unique compounds, assembled from PubChem, ZINC20, and Enamine REAL aligned sources contributed by collaborating laboratories. Every row carries the canonical five-tier hierarchical path [kingdom, superclass, class, subclass, direct_parent], and the non-hierarchical ClassyFire labels (intermediate_nodes, alternative_parents, geometric_descriptor, substituents, mapped_features) are exposed in a chemont_other_json column. This release also ships a model-ready Parquet train/validation/test split with a per-field label vocabulary.

  • These labels cannot be fully verified against the live ClassyFire model. The dataset is an aggregation of ClassyFire results contributed by several laboratories across different time periods, not a fresh classification run. The exact source of every individual label is not always recoverable. Re-classifying every compound through the official endpoint at classyfire.wishartlab.com to confirm that the labels match what ClassyFire would emit today is not practically feasible. The service accepts one InChIKey per request and is rate-limited to about 12 requests per minute, so a single re-classification pass over the ~73 M compounds in this release would take more than two decades. Historical version differences or occasional classification errors may therefore be present. However, the same InChIKey is occasionally reported by several of the contributing sources, and their independent ClassyFire classifications agree on the labels we retain. Cross-source agreement therefore acts as a confidence signal in place of direct re-verification, because broad label corruption would have produced systematic disagreement between sources, which is not what we observe.
  • Identity is InChIKey-only. The dataset is keyed by the 27-character standard InChIKey. It is not tautomer-normalized and does not attempt cross-tautomer identity resolution. If you need that, normalize the SMILES yourself before joining.
  • Coverage is not complete. 704 of the 4,824 ChemOnt classes (14.6 %) do not appear on any molecule in this release, mostly in exotic inorganic and lanthanide or actinide chemistry that is genuinely under-represented in the contributing catalogs. See the Coverage breakdown section.

What is new in this release

Structures recovered from Enamine REAL. Additional 250,948 labelles structures where provided by colleagues working on the Enamine dataset.

Model-ready Parquet split. In addition to the full TSV, this release ships train.parquet, validation.parquet, and test.parquet plus vocabulary.json. The split is 80/10/10, stratified by the deepest resolved ChemOnt tree slot so that fine-grained classes are represented across all three partitions. Label columns store compact uint16 indices into a per-field vocabulary, decoded through vocabulary.json. See the Parquet split section.

Lineage

Corrected hierarchical paths. An earlier release inadvertently encoded 3,388,432 rows (4.64 %) in the ClassyFire v2.1 schema layout, in which the meta-root Chemical entities (numeric ID 9999999) occupies the kingdom slot and every other tier is shifted down by one, pushing the rightful subclass out of the five-slot path. The aggregator detects the v2.1 layout at label-mapping time and shifts it back. All rows carry the canonical [kingdom, superclass, class, subclass, direct_parent] layout, and 9999999 never appears in any tree slot.

Out-of-tree labels. Beyond the five-slot hierarchical path, the chemont_other_json column records the non-hierarchical labels ClassyFire reports per molecule, namely intermediate_nodes, alternative_parents, geometric_descriptor, substituents, and mapped_features. Each value resolves to numeric ChemOnt IDs through the bundled dictionary. Including these labels raises class coverage from 3,798 / 4,824 (78.7 %) as a strict hierarchical-path label to 4,120 / 4,824 (85.4 %) anywhere in the classification.

Headline figures

Metric Value
Unique InChIKeys (rows) 73,356,229
Rows with PubChem CID 68,693,295
Rows with ZINC20 ID 30,187,323
Rows carrying both PubChem CID and ZINC20 ID 25,775,344
Rows added from Enamine REAL (no CID or ZINC20 ID) 250,948
Rows with at least one unresolved tree slot 119,839
ChemOnt classes seen at least once 4,120 / 4,824 (85.4 %)
ChemOnt classes never assigned to any molecule 704 / 4,824 (14.6 %)

Coverage breakdown

Distribution of the 4,824 ChemOnt classes by the number of molecules that carry the class as a label anywhere in the classification (five-slot hierarchical path plus the chemont_other_json fields).

Bucket Classes Share of 4,824
0 rows (class never assigned) 704 14.59 %
Exactly 1 row 74 1.53 %
>= 1 row (covered) 4,120 85.41 %
>= 10 rows 3,819 79.17 %
>= 100 rows 3,259 67.56 %
>= 1,000 rows 2,422 50.21 %
>= 10,000 rows 1,490 30.89 %
>= 100,000 rows 684 14.18 %
>= 1,000,000 rows 183 3.79 %

Where the 704 never-seen classes cluster

Most of the unrepresented classes belong to exotic inorganic chemistry or to lanthanide and actinide oxoanionic chemistry, which is not present at scale in the contributing catalogs.

Children at zero coverage Parent class
44 Actinide oxoanionic compounds
32 Metalloid oxoanionic compounds
29 Lanthanide oxoanionic compounds
24 Organic acids and derivatives
22 Post-transition metal oxoanionic compounds
14 Miscellaneous inorganic compounds
14 Alkaline earth metal oxoanionic compounds
13 Triterpenoids
12 Organic oxoanionic compounds
11 Nucleosides, nucleotides, and analogues

Call for contributions

If your group has run ClassyFire on compounds that fall into any of the 704 unrepresented classes or into rare classes with ten or fewer examples, please contribute them to the next release. The list of zero-coverage parent groups above and the Coverage breakdown section indicate which classes are most in need of additional evidence.

The most useful contribution is a TSV of (InChIKey, SMILES, ClassyFire JSON response) tuples covering one or more of those classes.

Files in this release

File Description
classyfire_dedup_inchikey_smiles.enriched.tsv.zst The labelled dataset, zstd-compressed TSV with one row per InChIKey
train.parquet Training split (58,684,999 rows), stratified 80 %
validation.parquet Validation split (7,335,446 rows), 10 %
test.parquet Test split (7,335,784 rows), 10 %
vocabulary.json Per-field ordered list of ChemOnt class names that decodes the uint16 label indices in the Parquet files
chemont_dictionary.tsv ChemOnt numeric-ID, name, and parent map needed to decode the TSV JSON columns

Dataset schema (TSV)

Column Description
inchikey 27-character standard InChIKey
cid PubChem CID (empty when the compound is not in PubChem)
zinc_id ZINC20 identifier (empty when the compound is not in ZINC)
smiles Representative SMILES
chemont_tree_json Five-element JSON array of numeric ChemOnt IDs for [kingdom, superclass, class, subclass, direct_parent]. null indicates an unresolved slot
chemont_other_json JSON object with optional keys intermediate_nodes, alternative_parents, geometric_descriptor, substituents, mapped_features. Each value holds sorted, unique numeric ChemOnt IDs

Parquet split

The three Parquet files share one schema. Identity and input columns are cid (int64, null when absent), zinc_id (uint32, null when absent), smiles (string), and inchikey (string). The nine label columns are kingdom_ids, superclass_ids, class_ids, subclass_ids, direct_parent_ids, intermediate_nodes_ids, alternative_parents_ids, substituents_ids, and mapped_features_ids, each a list<uint16>. The five tree-slot columns hold zero or one element, the four out-of-tree columns hold zero or more. Every value is a 0-based index into the matching list in vocabulary.json, for example vocabulary["class"][class_ids[0]] gives the class name. The geometric_descriptor field is not included in the Parquet label set.

Files

ChemOnt_2_1.obo.zip

Files (4.8 GB)

Name Size Download all
md5:0616fc94c8963764830fa552eb02092e
307.9 kB Preview Download
md5:312aa8b25735de8849345cbf1e7540d8
362.2 kB Download
md5:f327be279d8655a518be7a7fdd7dcf25
2.1 GB Download
md5:888982b9d1365a39b62d3a382e256d0c
294.4 MB Download
md5:ba88e29a2403504e4e60f14759faca3a
2.1 GB Download
md5:65c6f11cc41bbbcb815f92e10cbb1dc4
294.5 MB Download
md5:55667d9adb56519fcca4f364cdd7a600
321.3 kB Preview Download