Published June 16, 2023 | Version v2
Dataset Open

CimpleG DNAm benchmarking datasets for cell-type classification and deconvolution

  • 1. Institute for Computational Genomics, Joint Research Center for Computational Biomedicine, RWTH Aachen University Medical School, 52074 Aachen, Germany
  • 2. Helmholtz Institute for Biomedical Engineering, RWTH Aachen University, 52074 Aachen, Germany

Description

Two large DNAm benchmarking datasets specifically gathered and curated for cell-type classification and deconvolution problems.

It includes a leukocytes dataset and a somatic cells dataset in the GenomicRatioSet format from the minfi package.

These can be easily loaded into R with the readRDS function:

my_data <- readRDS("CimpleG_benchmarking_datasets_2/leukocytes/tidy_leuk_data.rds")

Each dataset includes therein sample data like GEO accession numbers, sample name or ID in their original dataset, cell-type label, one-hot encoded data for each cell-type, preferred train/test splits, and others.

Alternatively, you can also load the individual .csv files. If you choose this option, I recommend using the function fread from the package data.table. Below I briefly describe these (.csv and .txt) files for the leukocytes dataset, the same logic applies to the somatic cells dataset:

  • tidy_leuk_data_beta-values.csv
    • Methylation Beta values matrix
  • tidy_leuk_data_m-values.csv
    • Methylation M values matrix
  • tidy_leuk_data_probe-annotation.txt
    • Note regarding probe annotation
  • tidy_leuk_data_probe-metadata.csv
    • Probe metadata matrix (chr and location)
  • tidy_leuk_data_sample-metadata.csv
    • Sample metadata matrix (sample ID, cell type labels, one-hot encoded labels, etc.)

Files

CimpleG_benchmarking_datasets_v2.zip

Files (5.8 GB)

Name Size Download all
md5:9579e7da2dd2554b9b3b217590e6a548
5.8 GB Preview Download

Additional details

Related works

Is supplement to
Software: 10.5281/zenodo.8045462 (DOI)
Requires
Software: 10.5281/zenodo.8045495 (DOI)