GEO gene expression dataset recompute for selected tumor samples

doi:10.5281/zenodo.10893923

Published March 29, 2024 | Version v4

Dataset Open

GEO gene expression dataset recompute for selected tumor samples

Visentin, Luca¹

1. University of Turin

Project member:

Ruffinatti, Federico Alessandro¹

1. University of Turin

We aligned and quantified RNA-Seq data present in GEO with a standardized pipeline to homogenize data preprocessing for downstream applications.

All uploaded files are UTF-8, `.csv`-formatted matrices. The `*_expected_count.csv.gz` files are unlogged, raw expression counts as reported by `rsem-quantify-expression` (see details below). The associated `*_metadata.csv.gz` files contain metadata pertinent to each column of the corresponding expression matrix.
Some metadata files may have more rows than the associated number of columns. This is for series that were only partially RNA-Seq based (e.g. combinated RNA-Seq plus miRNA-Seq samples in the same GEO accession ID).

Metadata columns are derived from GEO series files, and follow their definitions. See each GEO entry directly to determine metadata meaning.

Each recompute has at least the `gene_id` column holding Ensembl Gene IDs. The remaining columns are ENA run accession IDs of the specific recomputed samples.
Each associated metadata has at least the following columns:
- `geo_accession`: The GEO sample ID of the sample.
- `sample_accession`: The ENA sample ID of the sample.
- `run_accession`: The ENA run accession ID of the sample, to be cross-referenced with the expression matrices.

## Pipeline Details

The alignment and quantification was made with the `x.FASTQ` tool available [on Github](https://github.com/TCP-Lab/x.FASTQ) installed locally on an Arch Linux machine running the Linux `6.7.8-zen1-1-zen` kernel with a `11th Gen Intel i7-1185G7 (8)` CPU and a `Intel TigerLake-LP GT2 [Iris Xe Graphics]` GPU.

Files

Files (9.1 MB)

Name	Size	Download all
GSE121842_expected_count.csv.gz md5:7a7e362b6874baae7dfdcec425614a61	270.0 kB	Download
GSE121842_metadata.csv.gz md5:61a48d9699d05c330c1aadf4fbeada0d	1.8 kB	Download
GSE159857_expected_count.csv.gz md5:962b514222f894665c5aa00a23aa0b34	2.3 MB	Download
GSE159857_metadata.csv.gz md5:6f97560e55b6add6cbc8487e098817ce	4.2 kB	Download
GSE22260_expected_count.csv.gz md5:f668552471a99fcb563023adac4bc24e	1.7 MB	Download
GSE22260_metadata.csv.gz md5:bbda66b06c198df1f6f336fa18b4f197	2.4 kB	Download
GSE29580_expected_count.csv.gz md5:bb6edde4350c8e9005610a074c0e7aab	414.3 kB	Download
GSE29580_metadata.csv.gz md5:65d6762bc4531660488e9044ab47054d	1.2 kB	Download
GSE60052_expected_count.csv.gz md5:6a4bf3c125f40b5b3af7b26272df5fe6	4.3 MB	Download
GSE60052_metadata.csv.gz md5:caa6126f2c5fa104135bde24a712f7c1	3.3 kB	Download

Additional details

Is derived from: Dataset: GSE22260 (Other); Dataset: GSE29580 (Other); Dataset: GSE121842 (Other); Dataset: GSE159857 (Other); Dataset: GSE60052 (Other)

	All versions	This version
Views	131	43
Downloads	116	62
Data volume	85.5 MB	60.8 MB

GEO gene expression dataset recompute for selected tumor samples

Creators

Contributors

Project member:

Description

Files

Files (9.1 MB)

Additional details

Related works