Published April 19, 2026 | Version v11
Dataset Open

PubChem CID-SMILES topology classification snapshot

Authors/Creators

Description

Topology annotations for the current PubChem CID-SMILES snapshot.

The Parquet artifact stores one row per PubChem CID with connected-component counts, exact diameters for connected molecules, triangle and square motif counts, mean local and square clustering coefficients, and the following topology predicates computed with smiles-parser and geometric-traits: tree, forest, cactus, chordal, planar, outerplanar, k23_homeomorph, k33_homeomorph, k4_homeomorph, bipartite.

The JSON sidecar stores aggregate counts, parse and topology error totals, and run metadata, while the SVG infographic provides an accessible visual summary of the run. Source snapshot URL: https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-SMILES.gz.

Notes

Rows: 123853571. Parsed: 123853571. Parse errors: 0. Topology errors: 0. Estimated in-memory result size: 3.84 GiB. Analysis runtime: 169 seconds.

Files

pubchem-topology-summary.json

Files (569.1 MB)

Name Size Download all
md5:b29c7c0ec3ad575cdb7f725729e4d91d
44.3 kB Download
md5:3e6657d138c99b902d3570e64320d3cd
63.2 kB Preview Download
md5:036e60b44623e31adb7e101c6e0b1f32
569.0 MB Download