Published June 27, 2025 | Version v1
Dataset Open

Beyond benchmarking: an expert-guided consensus approach to spatially aware clustering - Supporting Data

  • 1. ROR icon University Hospital of Lausanne
  • 2. ROR icon ETH Zurich
  • 3. ROR icon Delft University of Technology
  • 4. ROR icon Leiden University Medical Center
  • 5. ROR icon University of Zurich
  • 6. ROR icon RWTH Aachen University
  • 7. ROR icon Allen Institute for Brain Science
  • 8. ROR icon Medical University of Graz
  • 9. University of Heidelberg Bioquant
  • 10. ROR icon Charité - Universitätsmedizin Berlin
  • 11. ROR icon Berlin Institute of Health at Charité - Universitätsmedizin Berlin
  • 12. ROR icon University of Lausanne
  • 13. ROR icon Fred Hutch Cancer Center
  • 14. ROR icon SIB Swiss Institute of Bioinformatics

Description

This Zenodo record consists of:

The following datasets are included:

CosMx human liver liver dataset (cosmx_liver)

The CosMx human liver dataset was obtained from the NanoString website (https://nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/human-liver-rna-ffpe-dataset). The dataset consists of 2 Formalin-Fixed Paraffin-Embedded (FFPE) samples from 2 patients, one being normal liver and the other from a hepatocellular carcinoma patient with grade G3 cancer. Data was generated on the CosMx platform using the Human Universal Cell Characterization Panel 1000 plex. The ground truth annotations were computationally identified using Mclust clustering on the frequency of each cell type among its 200 nearest neighbors. See NanoString_Data_License_Agreement.pdf for the license terms.

CosMx human non-small-cell lung cancer dataset (cosmx_lung)

The CosMx human non-small-cell lung cancer dataset was obtained from the NanoString website (https://staging.nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/nsclc-ffpe-dataset). The dataset consists of 8 FFPE samples from 5 patients presenting with non-small-cell lung cancer grade G1-G3. Data was generated on a CosMx prototype instrument using a 960 gene panel [1]. The ground truth annotations were computationally identified using Mclust clustering on the frequency of each cell type among its 200 nearest neighbors. See NanoString_Data_License_Agreement.pdf for the license terms.

MERSCOPE mouse brain thalamus (abc_atlas_wmb_thalamus)

The MERFISH mouse brain thalamus dataset [2] was obtained from the Brain Knowledge Platform (https://alleninstitute.github.io/abc_atlas_access/descriptions/MERFISH-C57BL6J-638850.html). The dataset consists of 59 fresh frozen (FF) serial full coronal sections at 200-µm intervals spanning one entire mouse brain. Data was generated on a Vizgen MERSCOPE instrument using a custom gene panel of 500 genes. The ground truth annotations were identified by aligning the MERFISH data to the CCFv3 coordinate space and labeling cells with the corresponding CCFv3 anatomical parcellation term [3]. Only the thalamus (TH; CCFv3 structure ID 549) and hypothalamic zona incerta (ZI; CCFv3 structure ID 797) were analyzed in this study. Spatially variable genes in the thalamus were identified by differential gene expression analysis on neighboring consensus clusters.

MERFISH human developmental heart dataset (merfish_devheart)

The MERFISH human developmental heart dataset [4] was obtained from Dryad (https://datadryad.org/stash/dataset/doi:10.5061/dryad.w0vt4b8vp). The dataset consists of 4 FF samples from 2 donors at 13 and 15 post-conception weeks (PCW). Data was generated using MERFISH with a custom 238-gene panel. The ground truth annotations (referred to as cellular communities in the original study) were computationally identified using k-means clustering of relative cell-type composition within 150µm of each cell.

STARmap PLUS mouse brain dataset (STARmap_plus)

The STARmap PLUS mouse brain dataset [5] was obtained from Zenodo (https://zenodo.org/records/8327576). The dataset consists of 20 FF samples from 3 mice. Data was generated using STARmap PLUS using a custom 1,022 gene panel. The ground truth annotations were manually identified by aligning the data to the CCFv3.

Xenium human breast cancer dataset (xenium-ffpe-bc-idc)

The Xenium breast cancer dataset was obtained from the 10x website (https://www.10xgenomics.com/datasets/xenium-ffpe-human-breast-with-custom-add-on-panel-1-standard). The dataset consists of 1 FFPE sample from a patient with infiltrating ductal carcinoma breast cancer. Data was generated on a Xenium Analyzer using the Xenium human breast gene expression panel v1 (280 genes) with 100 additional custom genes. The ground truth annotation was manually identified using the matched histopathology image, annotating for eight region types: ductal carcinoma in-situ, invasive tumor, normal ducts, immune cells, cysts, blood vessels, adipose tissue, and stroma [6].

Xenium mouse brain dataset (xenium-mouse-brain-SergioSalas)

The Xenium mouse brain dataset was obtained from the 10x genomics website (https://www.10xgenomics.com/datasets/fresh-frozen-mouse-brain-replicates-1-standard). The dataset consists of 1 FF sample of a full coronal section. Data was generated on a Xenium Analyzer using the v1 mouse brain gene expression panel (247 genes). The ground truth annotation was manually identified using the mouse coronal P56 sample from Allen Brain Atlas [3] to specify anatomical regions [7].

Slide-seqV2 mouse brain olfactory bulb dataset (slideseq2_olfactory_bulb)

The Slide-seqV2 mouse brain olfactory bulb dataset [8] was obtained from the STOmicsDB website (https://db.cngb.org/stomics/datasets/STDS0000172/data). The dataset consists of 20 samples of a mouse olfactory bulb evenly spaced along the anterior-posterior axis. Data was generated using Slide-seqV2 and sequenced using paired-end reads on an Illumina Novaseq6000 instrument, targeting 200 million reads per sample. The ground truth annotations were manually identified based on the expression of marker genes.

Stereo-seq mouse liver dataset (stereoseq_liver)

The Stereo-seq mouse liver dataset [9] was obtained from the STomicsDB website (https://db.cngb.org/stomics/lista/spatial). The dataset consists of 6 FF samples. Data was generated on Stereo-seq chips and sequenced using paired-end reads on a DIPSEQ T1 instrument. The ground truth annotations were computationally identified where zonation layers were annotated based on the differences between the scores of pericentral and periportal hepatocyte landmark genes.

Stereo-seq mouse embryo dataset (stereoseq_mouse_embryo)

The Stereo-seq mouse embryo dataset [10] was obtained from the StOmicsDB website (https://ftp.cngb.org/pub/SciRAID/stomics/STDS0000058/stomics). The dataset consists of 53 FF samples from mouse embryos spanning E9.5–E16.5 with one-day intervals. Data was generated on Stereo-seq chips and sequenced using paired-end reads on a MGI DNBSEQ-Tx sequencer. The ground truth annotations were computationally identified using Spatially Constrained Clustering (SCC), which is built on top of the Leiden clustering algorithm.

Visium human brain LIBD DLPFC dataset 1 (libd_dlpfc)

The Visium human brain LIBD DLPFC dataset 1 [11] was obtained from the spatialLIBD Bioconductor package (https://research.libd.org/spatialLIBD). The dataset consists of 12 FF samples from 3 donors. The data was generated on Visium chips and sequenced using paired-end reads on an Illumina NovaSeq 6000 instrument. The ground truth annotations were manually identified based on cytoarchitecture and selected gene markers.

osmFISH mouse brain somatosensory cortex dataset (osmfish_Ssp)

The osmFISH mouse brain somatosensory cortex dataset [12] was obtained from the Linnarsson Lab website (https://linnarssonlab.org/osmFISH). The dataset consists of a single FF sample from the mouse brain somatosensory cortex. Data was generated using osmFISH using a custom 33-gene panel. The ground truth annotation was computationally identified using an iterative graph-based algorithm.

Visium human breast cancer (visium_breast_cancer_SEDR)

The Visium human breast cancer dataset, originally from 10x Genomics (https://www.10xgenomics.com/resources/datasets/human-breast-cancer-block-a-section-1-1-standard-1-1-0), was obtained from GitHub (https://github.com/JinmiaoChenLab/SEDR_analyses). The dataset consists of a single FF sample of invasive ductal carcinoma breast tissue. The data was generated on a Visium chip and sequenced using paired-end reads on an Illumina NovaSeq 6000 instrument. The ground truth annotation was manually identified based on the H&E image.

Visium chicken heart (visium_chicken_heart)

The Visium chicken heart dataset [13] was obtained from GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE149457). The dataset consists of 11 FF samples from four hearts at different stages of ventricular development. The data was generated on a Visium chip and sequenced using paired-end reads on an Illumina NextSeq 500/550 instrument. The ground truth annotations were computationally identified using Louvain clustering as implemented in Seurat v3.

Visium HD human colorectal cancer dataset (Visium_hd_cancer_colon)

The Visium HD human colorectal cancer dataset [14] was obtained from the 10x Genomics website (https://www.10xgenomics.com/datasets/visium-hd-cytassist-gene-expression-libraries-of-human-crc). The dataset consists of 1 FFPE sample of colorectal cancer obtained from the sigmoid colon of a patient. Data was generated on a Visium HD chip and sequenced using paired-end reads on an Illumina NovaSeq 6000 instrument.

 

  1. He, S., Bhatt, R., Brown, C. et al. High-plex imaging of RNA and proteins at subcellular resolution in fixed tissue by spatial molecular imaging. Nat Biotechnol 40, 1794–1806 (2022). https://doi.org/10.1038/s41587-022-01483-z
  2. Yao, Z., van Velthoven, C.T.J., Kunst, M. et al. A high-resolution transcriptomic and spatial atlas of cell types in the whole mouse brain. Nature 624, 317–332 (2023). https://doi.org/10.1038/s41586-023-06812-z
  3. Wang, Q., Ding, S., Li, Y. et al. The Allen Mouse Brain Common Coordinate Framework: A 3D Reference Atlas. Cell Volume 181, Issue 4 (2020). https://doi.org/10.1016/j.cell.2020.04.007
  4. Farah, E.N., Hu, R.K., Kern, C. et al. Spatially organized cellular communities form the developing human heart. Nature 627, 854–864 (2024). https://doi.org/10.1038/s41586-024-07171-z
  5. Shi, H., He, Y., Zhou, Y. et al. Spatial atlas of the mouse central nervous system at molecular resolution. Nature 622, 552–561 (2023). https://doi.org/10.1038/s41586-023-06569-5
  6. Bhuva, D.D., Tan, C.W., Salim, A. et al. Library size confounds biology in spatial transcriptomics data. Genome Biol 25, 99 (2024). https://doi.org/10.1186/s13059-024-03241-7
  7. Marco Salas, S., Kuemmerle, L.B., Mattsson-Langseth, C. et al. Optimizing Xenium In Situ data utility by quality assessment and best-practice analysis workflows. Nat Methods 22, 813–823 (2025). https://doi.org/10.1038/s41592-025-02617-2
  8. Wang, IH., Murray, E., Andrews, G. et al. Spatial transcriptomic reconstruction of the mouse olfactory glomerular map suggests principles of odor processing. Nat Neurosci 25, 484–492 (2022). https://doi.org/10.1038/s41593-022-01030-8
  9. Xu, J., Guo, P., Hao, S. et al. A spatiotemporal atlas of mouse liver homeostasis and regeneration. Nat Genet 56, 953–969 (2024). https://doi.org/10.1038/s41588-024-01709-7
  10. Chen, A., Liao, S., Cheng, M. et al. Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell Volume 185, Issue 10 (2022). https://doi.org/10.1016/j.cell.2022.04.003
  11. Maynard, K.R., Collado-Torres, L., Weber, L.M. et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nat Neurosci 24, 425–436 (2021). https://doi.org/10.1038/s41593-020-00787-0
  12. Codeluppi, S., Borm, L.E., Zeisel, A. et al. Spatial organization of the somatosensory cortex revealed by osmFISH. Nat Methods 15, 932–935 (2018). https://doi.org/10.1038/s41592-018-0175-z
  13. Mantri, M., Scuderi, G.J., Abedini-Nassab, R. et al. Spatiotemporal single-cell RNA sequencing of developing chicken hearts identifies interplay between cellular differentiation and morphogenesis. Nat Commun 12, 1771 (2021). https://doi.org/10.1038/s41467-021-21892-z
  14. Oliveira, M. F., Romero, J. P., Chung, M. et al. Characterization of immune cell populations in the tumor microenvironment of colorectal cancer using high definition spatial profiling. bioRxiv (2024). https://doi.org/10.1101/2024.06.04.597233

Files

clusters.zip

Files (47.4 GB)

Name Size Download all
md5:2f622c0511948b85330da3222845a2c4
34.2 MB Preview Download
md5:92f96782ba2c0fe022dd10e588ccf367
9.6 kB Download
md5:ce74d44106215b878d76a6439a0c1eba
751.4 MB Preview Download
md5:2179f034e8716f67a9efc7e1689676f6
503.5 MB Preview Download
md5:b39a51eef96ede6415c1fe7dc9d108b6
285.4 MB Preview Download
md5:f62178631982405a416d0e12920d56fe
250.4 MB Preview Download
md5:811aef68897f3f2e6778c5f11806f191
71.3 MB Preview Download
md5:7085be117636a1cf91229b6d73a41a7c
11.4 MB Preview Download
md5:25c95f9721b6e5a34b1814ff1100c1d9
179.5 kB Preview Download
md5:133dd798a8765505354d9d6aa511ccd4
946.8 kB Preview Download
md5:afe978e825eb8527fab30862951b37a8
1.1 GB Preview Download
md5:a199e0252c280975577d3ab5b305ed63
312.6 MB Preview Download
md5:389216de7871a49afd34f68324918269
1.4 GB Preview Download
md5:f67379f9ea8b4ec3dfc0ab00500ffb87
36.0 GB Preview Download
md5:f85c9c1eec8a5471649a120f18959bcc
72.7 MB Preview Download
md5:70516b6199abd8a922be39accf21bbcb
50.3 MB Preview Download
md5:58115fe14451f102687153148ea2a6f7
3.0 GB Preview Download
md5:9bd9bb944e954bc111647c2fcf211bfd
1.0 GB Preview Download
md5:21b120c5c2ff18aa368c92d89a61a828
2.3 GB Preview Download
md5:7c21d028378b21544299805b4f7ed683
205.4 MB Preview Download
md5:df53498bd052f620ce5e157caa62ecd5
45.6 MB Preview Download

Additional details

Funding

Federal Ministry of Education and Research
Dutch Research Council
BRAINSCAPES: A Roadmap from Neurogenetics to Neurobiology 34522
Swiss National Science Foundation
National Institute of Neurological Disorders and Stroke

Software

Repository URL
https://github.com/SpatialHackathon/SACCELERATOR
Programming language
Python, R
Development Status
Active