CimpleG DNAm benchmarking datasets for cell-type classification and deconvolution
Authors/Creators
- 1. Institute for Computational Genomics, Joint Research Center for Computational Biomedicine, RWTH Aachen University Medical School, 52074 Aachen, Germany
- 2. Helmholtz Institute for Biomedical Engineering, RWTH Aachen University, 52074 Aachen, Germany
Description
Two large DNAm benchmarking datasets specifically gathered and curated for cell-type classification and deconvolution problems.
It includes a leukocytes dataset and a somatic cells dataset in the GenomicRatioSet format from the minfi package.
These can be easily loaded into R with the readRDS function:
my_data <- readRDS("CimpleG_benchmarking_datasets_2/leukocytes/tidy_leuk_data.rds")
Each dataset includes therein sample data like GEO accession numbers, sample name or ID in their original dataset, cell-type label, one-hot encoded data for each cell-type, preferred train/test splits, and others.
Alternatively, you can also load the individual .csv files. If you choose this option, I recommend using the function fread from the package data.table. Below I briefly describe these (.csv and .txt) files for the leukocytes dataset, the same logic applies to the somatic cells dataset:
-
tidy_leuk_data_beta-values.csv
- Methylation Beta values matrix
-
tidy_leuk_data_m-values.csv
- Methylation M values matrix
-
tidy_leuk_data_probe-annotation.txt
- Note regarding probe annotation
-
tidy_leuk_data_probe-metadata.csv
- Probe metadata matrix (chr and location)
-
tidy_leuk_data_sample-metadata.csv
- Sample metadata matrix (sample ID, cell type labels, one-hot encoded labels, etc.)
Files
CimpleG_benchmarking_datasets_v2.zip
Files
(5.8 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:9579e7da2dd2554b9b3b217590e6a548
|
5.8 GB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: 10.5281/zenodo.8045462 (DOI)
- Requires
- Software: 10.5281/zenodo.8045495 (DOI)