Published May 10, 2019 | Version 1.0.0
Journal article Open

Biological data sets for SMBA

  • 1. Parthenope University

Description

In the following, a brief description of all data sets employed in the experiments.

1. ALLAML data set contains in total 72 samples in 2 classes, ALL and AML, which have 47 and 25 samples, respectively. Every sample contains 7,129 gene expression values.

2. LEUKEMIA data set contains in total 72 samples in 2 classes: acute lymphoblastic and acute myeloid. From 7,129 genes, the baseline genes were cut off before further analysis. The number of genes that are used in the multiclass classification task is 7,070.

3. CLL_SUB_111 data set has gene expressions from high density oligonucleotide arrays containing genetically and clinically distinct subgroups of B-cell chronic lymphocytic leukemia (B-CLL). The data set consists of 11,340 attributes, 111 instances and 3 classes.

4. GLIOMA data set contains in total 50 samples in 4 classes: cancer glioblastomas, non-cancer glioblastomas, cancer oligodendrogliomas and non-cancer oligodendrogliomas, which have 14, 14, 7, 15 samples, respectively. Each sample has 12,625 genes. After a preprocessing, the data set has been shrunk to 50 samples and 4,433 genes.

5. LUNG data set contains in total 203 samples in 5 classes, adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinoids, small-cell lung carcinomas and normal lung, with 139, 21, 20, 6, 17 samples, respectively. The genes with standard deviations smaller than 50 expression units were removed getting a data set with 203 samples and 3,312 genes.

6. LUNG_DISCRETE data set contains 73 samples in 7 classes where, each sample consists of 325 gene expressions. The cardinalities of each sample in the LUNG_DISCRETE data set are 6, 5, 5, 16, 7, 13, 21, respectively.

7. DLBCL data set is a modified version of the original DLBCL data set. It consists of 96 samples in 9 classes, where each sample is defined by the expression of 4,026 genes. The cardinalities of each sample in the DLBCL data set are 46, 10, 9, 11, 6, 6, 4, 2, 2, respectively.

8. CARCINOM data set contains 174 samples in 11 classes, prostate, bladder/ureter, breast, colorectal, gastroesophagus, kidney, liver, ovary, pancreas, lung adenocarcinomas and lung squamous cell carcinoma, with 26, 8, 26, 23, 12, 11, 7, 27, 6, 14, 14 samples, respectively. After a preprocessing the data set has been shrunk to 174 samples and 9,182 genes.

9. The GCM data set contains 190 samples in 14 classes, breast, prostate, lung, colorectal, lymphoma, bladder, melanoma, uterus, leukemia, renal, pancreas, ovary, mesothelioma and central nervous system, where each sample consist of 16,063 gene expression signatures. The cardinalities of each sample in the data set are 11,11,20,11,30,11,22,10,11,11,11,10,11,10, respectively.

Files

Files (29.8 MB)

Name Size Download all
md5:c725a199381adfe1c6367bf5196d6a49
3.6 MB Download
md5:620f7f2330a112703c82d72c59e1f6a9
6.9 MB Download
md5:9088ade7546d4c26961f102f3dd7dd3e
5.9 MB Download
md5:bc55e35bcc0a3edd1bf12c30ef25f282
6.8 MB Download
md5:933648d6e07a33cad6cabfd7ff14ef87
1.5 MB Download
md5:3504abad747b1c70749a290a58372241
154.7 kB Download
md5:65ac206d8c050198a9d17bad9f67ab02
4.8 MB Download
md5:c3c8e002f65c653d9b14b1408c9c231b
7.5 kB Download
md5:07f5f37a21453bb1a223c99ee0ee5df0
110.2 kB Download