Published November 22, 2023 | Version v1
Dataset Open

Predicting Phenotypic Traits Using a Massive RNA-seq Dataset

  • 1. ROR icon Washington State University

Description

Abstract

The included datasets are a conglomerate of all available Arabidopsis thaliana RNA-seq data available from NCBI as of November 2022 processed to count data. In addition, the associated annotation files from NCBI BioProject database and processed versions of this data is included. Data has been processed according to the "Data Description Methods" in the manuscript titled "Predicting Phenotypic Traits Using a Massive RNA-seq Dataset" (in publication). The associated Methods can be found at this repository: https://gitlab.com/ficklinlab-public/modeling-with-transcriptomics. These datasets can be used for exploring machine learning methods for predicting both continuous (Age) and categorical (Tissue) phenotypic traits using gene expression. Additionally, the gene expression data can be used on its own for the investigation of gene expression in Arabidopsis thaliana.

Note to Researchers

This repository contains all of the datasets and information necessary to recreate the experiments in our paper. However, if may be that you are interested in our dataset for testing your own hypotheses/programs. If this is the case, we predict that you are looking for one or more of the the following 5 datasets
 

Note on File Compression

All files in this repository are compressed using bzip2 to conserve space and allow for easier file transfer. The unzip command on linux systems is `bzip2 -d FILE_NAME`. For other computer systems (Windows and Apple) please consult your user manual.

Description All Datasets:

Title: Gene Expression Count Data of all Arabidopsis thaliana data available from NCBI SRA as of November 2022
Abstract: Gene Expression Count data was created using the workflow GEMmaker. The resulting Gene Expression Matrix (GEM) was then normalized and thresholded. The following 4 files are normalizations of the same data for Trimmed Mean of M values (TMM), Median Ratios Normalization (MRN), Transcripts Per kilobase Million (TPM), and No Normalization (NoNo) respectively. Additionally, Each file is included as a tsv and a python pickle. The tsv file is human readable, whereas the pickle file can be read into memory substantially faster. Format for tsv is each row represents a sample and each column represents a gene. NCBI_Nov2022_SRR_runinfo.csv is the starting file from NCBI which reports SRR information for each sample. Note 1 to Researchers: MRN normalization performed the best in our experiments and is likely what you want to use if you are doing additional expermentation with this dataset. Otherwise start with NoNo and perform your own normalizations. Note 2 to Researchers: the 54547 dataset will need to be thresholded prior to use. We include it in addition to the 32432 datasets in case you wish to try a different thresholding to the one outlined in our manuscript. 
Author: John Anthony Hadish
Data Type: Gene Expression Count Data
Organism: Arabidopsis thaliana
Files:

NCBI_Nov2022_SRR_runinfo.csv - Arabidopsis RNA-seq SRA RunInfo Retrieved from NCBI November 2022. This is the unprocessed data.
Dataset_54547_NoFilter_raw.pkl - Raw File Before thresholding (".pkl" format). Same as NoNo normalization without thresholding.
Dataset_54547_NoFilter_raw.tsv - Raw File Before thresholding (".tsv" format). Same as NoNo normalization without thresholding.
Dataset_32432_MRN.pkl - MRN normalized (".pkl" format)
Dataset_32432_MRN.tsv - MRN normalized (".tsv" format)
Dataset_32432_NoNo.pkl - NoNo normalized (".pkl" format)
Dataset_32432_NoNo.tsv - NoNo normalized (".tsv" format)
Dataset_32432_TMM.pkl - TMM normalized (".pkl" format)
Dataset_32432_TMM.tsv - TMM normalized (".tsv" format)
Dataset_32432_TPM.pkl - TPM normalized (".pkl" format)
Dataset_32432_TPM.tsv - TPM normalized (".tsv" format)


Title: Meta Data Arabidopsis Age and Tissue
Abstract: Meta Data for Age and Tissue after processing. In our experiment this was used as response variable to gene expression. Shared columns are "bio_sample", "bioproject_name", "experiment". In addition to these processed datasets, NCBI_Nov2022_BioSample_data.tsv is the unprocessed starting material for these two data frames.
Author: John Anthony Hadish
Data Type: Metadata on phenotypes. ".tsv" format 
Organism: Arabidopsis thaliana
Files: 
NCBI_Nov2022_BioSample_data.tsv - Arabidopsis BioSample data retrieved from NCBI November 2022. This is the unprocessed data.
df_metadata_tissue.tsv - Tissue Annotations for 24876 samples
df_metadata_age.tsv - Age Annotations for 16078 samples. In addition to shared columns includes "days_age"(how many days old the sample is converted to days) and "annotation_age" (how the annotation was reported for this sample in the raw data file-- i.e. "days", "weeks" etc.)


Title: Machine Learning Dataset for Arabidopsis thaliana Age
Abstract: The dataset used for Machine learning on the phenotype Age that is a combination of the Gene Expression Matrix and the Annotation Matrix. Consists of a list of 4 for the train and test splits.
Author: John Anthony Hadish
Data Type: Gene Expression Matrix and Annotations Combined, split into train and test 
Organism: Arabidopsis thaliana
Files:
Dataset_Age_TrainTestSplits_mrn.pkl - MRN normalized
Dataset_Age_TrainTestSplits_NoNo.pkl - NoNo normalized
Dataset_Age_TrainTestSplits_tmm.pkl - TMM normalized
Dataset_Age_TrainTestSplits_tpm.pkl - TPM normalized


Title: Machine Learning Dataset for Arabidopsis thaliana Tissue
Abstract: The dataset used for Machine learning on the phenotype Tissue that is a combination of the Gene Expression Matrix and the Annotation Matrix. Consists of a list of 4 for the train and test splits. Saved as python ".pkl" files.
Author: John Anthony Hadish
Data Type: Gene Expression Matrix and Annotations Combined, split into train and test. Saved as python ".pkl" files.
Organism: Arabidopsis thaliana
Files:
Dataset_Tissue_TrainTestSplits_mrn.pkl - MRN normalized
Dataset_Tissue_TrainTestSplits_NoNo.pkl - NoNo normalized
Dataset_Tissue_TrainTestSplits_tmm.pkl - TMM normalized
Dataset_Tissue_TrainTestSplits_tpm.pkl - TPM normalized
Dataset_Tissue_TrainTestSplits_mrn_4category.pkl - MRN for the tissue-4 dataset


Title: BioProject Names
Abstract: Three Column File With BioProject Name, BioSample Name, and Experiment Name
Author: John Anthony Hadish
Data Type: ".tsv"
Organism: Arabidopsis thaliana
Files:
BioProject_Names_All.tsv


Title: Manuscript Supplemental Material
Abstract: Supplemental tables and figures described in the manuscript (included with manuscript and here for convenience). Please see manuscript for additional information.
Author: John Anthony Hadish
Data Type: ".tsv", ".png".pdf"
Organism: Arabidopsis thaliana
Files:
Supplemental_Figures.zip - Supplemental figures from the manuscript. Includes description of each figure.
Supplemental_Tables.zip - Supplemental tables from the manuscript. Includes description of each table.

 

Title: Splits of data for 3 experiments
Abstract: 2 column tsv files. The first column is the experiment (sample) name, and the second column is if it is included in the train or test data. Included here to make sure pkl files are reproducible in case the pkl package breaks in the future. Not used by scripts, included to prevent future potential loss of data.
Author: John Anthony Hadish
Data Type: ".tsv"
Organism: Arabidopsis thaliana
Files:
Dataset_Tissue_TrainTestSplits_4category_namesOnly.tsv
Dataset_Tissue_TrainTestSplits_namesOnly.tsv
Dataset_Age_TrainTestSplits_namesOnly.tsv

 

Title: Git Code Repository
Abstract: A tar bz2 compression of the git repository containing all of the code created for this manuscript. The same code found in this file is also avalible on GitLab at the link: https://gitlab.com/ficklinlab-public/modeling-with-transcriptomics
Author: John Anthony Hadish
Data Type: Git Repository, python code
Files:
modeling-with-transcriptomics-main.tar.bz2 - Compressed Git repository of all code used in paper.

Files

Supplemental_Figures.zip

Files (71.5 GB)

Name Size Download all
md5:407ec35ca433b1a3d1ca3ba26ef6e47f
311.5 kB Download
md5:be36f395adafab181170a114d0f62df2
8.4 GB Download
md5:5b0c650780350839c064eb73b57d3658
7.3 GB Download
md5:b1ba4115e8177990fa0d280bf9eabcdc
3.1 GB Download
md5:3f5ef5127ab054f822bfab19589c2f48
3.1 GB Download
md5:9ec8c60036771f2d263e5239370c791f
8.4 GB Download
md5:1635ca747ff17a409e763116f9a37bd6
7.3 GB Download
md5:c9c8bba86c1d98ef9cdbba6d8c03c380
4.7 GB Download
md5:82eec7a36f277116ac34e6c9aa468d67
3.7 GB Download
md5:5a2f8aad213d1541cf73b32f93307b07
5.0 GB Download
md5:574bb44ca01918c95f0f2aaccfd92d30
707.1 MB Download
md5:680b8e3aa18b41dfb3f4566f29ff5d6b
1.7 GB Download
md5:cd054b067b52dada5e6a3114447b39e3
15.0 kB Download
md5:6f7d49afc4760b029c0ed867ee944455
662.4 MB Download
md5:9117ad15d39cddcaf47d1ce5a096530b
1.7 GB Download
md5:38318b92669d22bdfeb5834eea251b93
960.9 MB Download
md5:706e188cabd47da9698e7308104c9fb9
22.6 kB Download
md5:a4d6ffffa7617927dc1eef8eede31c82
4.2 GB Download
md5:c9af8b151983dd0f426436810f73eba6
2.4 GB Download
md5:874ee8d710d65f05fec5dbfabb694364
40.9 kB Download
md5:51801816a0dbd74ab077a4a42d2e8f69
1.6 GB Download
md5:b6c7edd71eda39ba9a5c230365474fad
4.2 GB Download
md5:a59942347a87f1c92567488ed0a5462d
2.4 GB Download
md5:3ff2c35b6625a8235c436cec681d83c4
97.1 kB Download
md5:b2f6a89db81bf6561831b368ae3618f1
142.0 kB Download
md5:07a7aa2d48e5a6aad09ab95542fcc521
2.2 MB Download
md5:b2505c9f0c1e737c191173ba4566820f
813.4 kB Download
md5:49347c9c09118bda9f880c7c995ba702
5.6 MB Download
md5:065f8047719923794a452a15fdb2453f
2.1 MB Preview Download
md5:72eae7e5e5c197b873cd277940e6e951
784.9 kB Preview Download

Additional details

Funding

Washington Tree Fruit Research Commission
AP-22-101 AP-22-101

Dates

Created
2023-11
Files Uploaded to Zenodo