Published March 15, 2021 | Version v1.0.0
Dataset Open

Single-Cell Gene Expression Profiles for Classification Problems

  • 1. University of Pavia

Description

This repository contains a collection of three datasets we use to introduce the Gene Mover Distance in [1] and described below. The three datasets are exported with a basic text-based format (.csv file) like other public datasets largely used in the Machine Learning community.

The three datasets are extracted from the Gene Expression Omnibus (GEO) database [2], where they appear, respectively, with access number GSE116256 (blood leukemia, [3]), GSE84133 (human pancreas, [4]), and GSE67835 (human brain, [5]). In GEO, the datasets are decomposed into several files, which contain much more details than those reported in this version.

However, the proposed format should facilitate other researchers in using this data.

The Gene Mover's Distance is a measure of similarity between a pair of cells based on their gene expression profiles obtained via single-cell RNA sequencing. The underlying idea of GMD is to interpret the gene expression array of a single cell as a discrete probability measure. The distance between two cells is hence computed by solving an Optimal Transport problem between the two corresponding discrete measures. The Gene Mover's Distance can be used, for instance, to solve two classification problems: the classification of cells according to their condition and according to their type.

The repository contains a python script to check the basic statistics of the data.

 

[1] Bellazzi, R., Codegoni, A., Gualandi, S., Nicora, G., Vercesi, E. The Gene Mover's Distance: Single-cell similarity via Optimal Transport. https://arxiv.org/abs/2102.01218

[2] Gene Expression Omnibus (GEO) database, http://www.ncbi.nlm.nih.gov/geo

[3] van Galen, P., Hovestadt, V., Wadsworth II, M.H., Hughes, T.K., Griffin, G.K., Battaglia, S., Verga, J.A., Stephansky, J., Pastika, T.J., Story, J.L. and Pinkus, G.S., 2019. Single-cell RNA-seq reveals AML hierarchies relevant to disease progression and immunity. Cell, 176(6), pp.1265-1281.

[4] Baron, M., Veres, A., Wolock, S.L., Faust, A.L., Gaujoux, R., Vetere, A., Ryu, J.H., Wagner, B.K., Shen-Orr, S.S., Klein, A.M. and Melton, D.A., 2016. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell systems, 3(4), pp.346-360.

[5] Darmanis, S., Sloan, S.A., Zhang, Y., Enge, M., Caneda, C., Shuer, L.M., Gephart, M.G.H., Barres, B.A. and Quake, S.R., 2015. A survey of human brain transcriptome diversity at the single cell level. Proceedings of the National Academy of Sciences, 112(23), pp.7285-7290.

Files

gmd_v1.0.0.zip

Files (78.7 MB)

Name Size Download all
md5:91b47965e75517ed653f139774ac2e0e
78.7 MB Preview Download

Additional details

Related works

Is supplement to
http://arxiv.org/abs/2102.01218 (URL)