Predicting Phenotypic Traits Using a Massive RNA-seq Dataset
Description
Abstract
The included datasets are a conglomerate of all available Arabidopsis thaliana RNA-seq data available from NCBI as of November 2022 processed to count data. In addition, the associated annotation files from NCBI BioProject database and processed versions of this data is included. Data has been processed according to the "Data Description Methods" in the manuscript titled "Predicting Phenotypic Traits Using a Massive RNA-seq Dataset" (in publication). The associated Methods can be found at this repository: https://gitlab.com/ficklinlab-public/modeling-with-transcriptomics. These datasets can be used for exploring machine learning methods for predicting both continuous (Age) and categorical (Tissue) phenotypic traits using gene expression. Additionally, the gene expression data can be used on its own for the investigation of gene expression in Arabidopsis thaliana.
Note to Researchers
This repository contains all of the datasets and information necessary to recreate the experiments in our paper. However, if may be that you are interested in our dataset for testing your own hypotheses/programs. If this is the case, we predict that you are looking for one or more of the the following 5 datasets
Note on File Compression
All files in this repository are compressed using bzip2 to conserve space and allow for easier file transfer. The unzip command on linux systems is `bzip2 -d FILE_NAME`. For other computer systems (Windows and Apple) please consult your user manual.
Description All Datasets:
Title: Gene Expression Count Data of all Arabidopsis thaliana data available from NCBI SRA as of November 2022
Abstract: Gene Expression Count data was created using the workflow GEMmaker. The resulting Gene Expression Matrix (GEM) was then normalized and thresholded. The following 4 files are normalizations of the same data for Trimmed Mean of M values (TMM), Median Ratios Normalization (MRN), Transcripts Per kilobase Million (TPM), and No Normalization (NoNo) respectively. Additionally, Each file is included as a tsv and a python pickle. The tsv file is human readable, whereas the pickle file can be read into memory substantially faster. Format for tsv is each row represents a sample and each column represents a gene. NCBI_Nov2022_SRR_runinfo.csv is the starting file from NCBI which reports SRR information for each sample. Note 1 to Researchers: MRN normalization performed the best in our experiments and is likely what you want to use if you are doing additional expermentation with this dataset. Otherwise start with NoNo and perform your own normalizations. Note 2 to Researchers: the 54547 dataset will need to be thresholded prior to use. We include it in addition to the 32432 datasets in case you wish to try a different thresholding to the one outlined in our manuscript.
Author: John Anthony Hadish
Data Type: Gene Expression Count Data
Organism: Arabidopsis thaliana
Files:
NCBI_Nov2022_SRR_runinfo.csv - Arabidopsis RNA-seq SRA RunInfo Retrieved from NCBI November 2022. This is the unprocessed data.
Dataset_54547_NoFilter_raw.pkl - Raw File Before thresholding (".pkl" format). Same as NoNo normalization without thresholding.
Dataset_54547_NoFilter_raw.tsv - Raw File Before thresholding (".tsv" format). Same as NoNo normalization without thresholding.
Dataset_32432_MRN.pkl - MRN normalized (".pkl" format)
Dataset_32432_MRN.tsv - MRN normalized (".tsv" format)
Dataset_32432_NoNo.pkl - NoNo normalized (".pkl" format)
Dataset_32432_NoNo.tsv - NoNo normalized (".tsv" format)
Dataset_32432_TMM.pkl - TMM normalized (".pkl" format)
Dataset_32432_TMM.tsv - TMM normalized (".tsv" format)
Dataset_32432_TPM.pkl - TPM normalized (".pkl" format)
Dataset_32432_TPM.tsv - TPM normalized (".tsv" format)
Title: Meta Data Arabidopsis Age and Tissue
Abstract: Meta Data for Age and Tissue after processing. In our experiment this was used as response variable to gene expression. Shared columns are "bio_sample", "bioproject_name", "experiment". In addition to these processed datasets, NCBI_Nov2022_BioSample_data.tsv is the unprocessed starting material for these two data frames.
Author: John Anthony Hadish
Data Type: Metadata on phenotypes. ".tsv" format
Organism: Arabidopsis thaliana
Files:
NCBI_Nov2022_BioSample_data.tsv - Arabidopsis BioSample data retrieved from NCBI November 2022. This is the unprocessed data.
df_metadata_tissue.tsv - Tissue Annotations for 24876 samples
df_metadata_age.tsv - Age Annotations for 16078 samples. In addition to shared columns includes "days_age"(how many days old the sample is converted to days) and "annotation_age" (how the annotation was reported for this sample in the raw data file-- i.e. "days", "weeks" etc.)
Title: Machine Learning Dataset for Arabidopsis thaliana Age
Abstract: The dataset used for Machine learning on the phenotype Age that is a combination of the Gene Expression Matrix and the Annotation Matrix. Consists of a list of 4 for the train and test splits.
Author: John Anthony Hadish
Data Type: Gene Expression Matrix and Annotations Combined, split into train and test
Organism: Arabidopsis thaliana
Files:
Dataset_Age_TrainTestSplits_mrn.pkl - MRN normalized
Dataset_Age_TrainTestSplits_NoNo.pkl - NoNo normalized
Dataset_Age_TrainTestSplits_tmm.pkl - TMM normalized
Dataset_Age_TrainTestSplits_tpm.pkl - TPM normalized
Title: Machine Learning Dataset for Arabidopsis thaliana Tissue
Abstract: The dataset used for Machine learning on the phenotype Tissue that is a combination of the Gene Expression Matrix and the Annotation Matrix. Consists of a list of 4 for the train and test splits. Saved as python ".pkl" files.
Author: John Anthony Hadish
Data Type: Gene Expression Matrix and Annotations Combined, split into train and test. Saved as python ".pkl" files.
Organism: Arabidopsis thaliana
Files:
Dataset_Tissue_TrainTestSplits_mrn.pkl - MRN normalized
Dataset_Tissue_TrainTestSplits_NoNo.pkl - NoNo normalized
Dataset_Tissue_TrainTestSplits_tmm.pkl - TMM normalized
Dataset_Tissue_TrainTestSplits_tpm.pkl - TPM normalized
Dataset_Tissue_TrainTestSplits_mrn_4category.pkl - MRN for the tissue-4 dataset
Title: BioProject Names
Abstract: Three Column File With BioProject Name, BioSample Name, and Experiment Name
Author: John Anthony Hadish
Data Type: ".tsv"
Organism: Arabidopsis thaliana
Files:
BioProject_Names_All.tsv
Title: Manuscript Supplemental Material
Abstract: Supplemental tables and figures described in the manuscript (included with manuscript and here for convenience). Please see manuscript for additional information.
Author: John Anthony Hadish
Data Type: ".tsv", ".png".pdf"
Organism: Arabidopsis thaliana
Files:
Supplemental_Figures.zip - Supplemental figures from the manuscript. Includes description of each figure.
Supplemental_Tables.zip - Supplemental tables from the manuscript. Includes description of each table.
Title: Splits of data for 3 experiments
Abstract: 2 column tsv files. The first column is the experiment (sample) name, and the second column is if it is included in the train or test data. Included here to make sure pkl files are reproducible in case the pkl package breaks in the future. Not used by scripts, included to prevent future potential loss of data.
Author: John Anthony Hadish
Data Type: ".tsv"
Organism: Arabidopsis thaliana
Files:
Dataset_Tissue_TrainTestSplits_4category_namesOnly.tsv
Dataset_Tissue_TrainTestSplits_namesOnly.tsv
Dataset_Age_TrainTestSplits_namesOnly.tsv
Title: Git Code Repository
Abstract: A tar bz2 compression of the git repository containing all of the code created for this manuscript. The same code found in this file is also avalible on GitLab at the link: https://gitlab.com/ficklinlab-public/modeling-with-transcriptomics
Author: John Anthony Hadish
Data Type: Git Repository, python code
Files:
modeling-with-transcriptomics-main.tar.bz2 - Compressed Git repository of all code used in paper.
Files
Supplemental_Figures.zip
Files
(71.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:407ec35ca433b1a3d1ca3ba26ef6e47f
|
311.5 kB | Download |
|
md5:be36f395adafab181170a114d0f62df2
|
8.4 GB | Download |
|
md5:5b0c650780350839c064eb73b57d3658
|
7.3 GB | Download |
|
md5:b1ba4115e8177990fa0d280bf9eabcdc
|
3.1 GB | Download |
|
md5:3f5ef5127ab054f822bfab19589c2f48
|
3.1 GB | Download |
|
md5:9ec8c60036771f2d263e5239370c791f
|
8.4 GB | Download |
|
md5:1635ca747ff17a409e763116f9a37bd6
|
7.3 GB | Download |
|
md5:c9c8bba86c1d98ef9cdbba6d8c03c380
|
4.7 GB | Download |
|
md5:82eec7a36f277116ac34e6c9aa468d67
|
3.7 GB | Download |
|
md5:5a2f8aad213d1541cf73b32f93307b07
|
5.0 GB | Download |
|
md5:574bb44ca01918c95f0f2aaccfd92d30
|
707.1 MB | Download |
|
md5:680b8e3aa18b41dfb3f4566f29ff5d6b
|
1.7 GB | Download |
|
md5:cd054b067b52dada5e6a3114447b39e3
|
15.0 kB | Download |
|
md5:6f7d49afc4760b029c0ed867ee944455
|
662.4 MB | Download |
|
md5:9117ad15d39cddcaf47d1ce5a096530b
|
1.7 GB | Download |
|
md5:38318b92669d22bdfeb5834eea251b93
|
960.9 MB | Download |
|
md5:706e188cabd47da9698e7308104c9fb9
|
22.6 kB | Download |
|
md5:a4d6ffffa7617927dc1eef8eede31c82
|
4.2 GB | Download |
|
md5:c9af8b151983dd0f426436810f73eba6
|
2.4 GB | Download |
|
md5:874ee8d710d65f05fec5dbfabb694364
|
40.9 kB | Download |
|
md5:51801816a0dbd74ab077a4a42d2e8f69
|
1.6 GB | Download |
|
md5:b6c7edd71eda39ba9a5c230365474fad
|
4.2 GB | Download |
|
md5:a59942347a87f1c92567488ed0a5462d
|
2.4 GB | Download |
|
md5:3ff2c35b6625a8235c436cec681d83c4
|
97.1 kB | Download |
|
md5:b2f6a89db81bf6561831b368ae3618f1
|
142.0 kB | Download |
|
md5:07a7aa2d48e5a6aad09ab95542fcc521
|
2.2 MB | Download |
|
md5:b2505c9f0c1e737c191173ba4566820f
|
813.4 kB | Download |
|
md5:49347c9c09118bda9f880c7c995ba702
|
5.6 MB | Download |
|
md5:065f8047719923794a452a15fdb2453f
|
2.1 MB | Preview Download |
|
md5:72eae7e5e5c197b873cd277940e6e951
|
784.9 kB | Preview Download |
Additional details
Dates
- Created
-
2023-11Files Uploaded to Zenodo