Published November 22, 2023 | Version v1

Dataset Open

Predicting Phenotypic Traits Using a Massive RNA-seq Dataset

Hadish, John (Data collector)¹

1. Washington State University

Abstract

The included datasets are a conglomerate of all available Arabidopsis thaliana RNA-seq data available from NCBI as of November 2022 processed to count data. In addition, the associated annotation files from NCBI BioProject database and processed versions of this data is included. Data has been processed according to the "Data Description Methods" in the manuscript titled "Predicting Phenotypic Traits Using a Massive RNA-seq Dataset" (in publication). The associated Methods can be found at this repository: https://gitlab.com/ficklinlab-public/modeling-with-transcriptomics. These datasets can be used for exploring machine learning methods for predicting both continuous (Age) and categorical (Tissue) phenotypic traits using gene expression. Additionally, the gene expression data can be used on its own for the investigation of gene expression in Arabidopsis thaliana.

Note to Researchers

This repository contains all of the datasets and information necessary to recreate the experiments in our paper. However, if may be that you are interested in our dataset for testing your own hypotheses/programs. If this is the case, we predict that you are looking for one or more of the the following 5 datasets

Note on File Compression

All files in this repository are compressed using bzip2 to conserve space and allow for easier file transfer. The unzip command on linux systems is `bzip2 -d FILE_NAME`. For other computer systems (Windows and Apple) please consult your user manual.

Description All Datasets:

Title: Gene Expression Count Data of all Arabidopsis thaliana data available from NCBI SRA as of November 2022
Abstract: Gene Expression Count data was created using the workflow GEMmaker. The resulting Gene Expression Matrix (GEM) was then normalized and thresholded. The following 4 files are normalizations of the same data for Trimmed Mean of M values (TMM), Median Ratios Normalization (MRN), Transcripts Per kilobase Million (TPM), and No Normalization (NoNo) respectively. Additionally, Each file is included as a tsv and a python pickle. The tsv file is human readable, whereas the pickle file can be read into memory substantially faster. Format for tsv is each row represents a sample and each column represents a gene. NCBI_Nov2022_SRR_runinfo.csv is the starting file from NCBI which reports SRR information for each sample. Note 1 to Researchers: MRN normalization performed the best in our experiments and is likely what you want to use if you are doing additional expermentation with this dataset. Otherwise start with NoNo and perform your own normalizations. Note 2 to Researchers: the 54547 dataset will need to be thresholded prior to use. We include it in addition to the 32432 datasets in case you wish to try a different thresholding to the one outlined in our manuscript.
Author: John Anthony Hadish
Data Type: Gene Expression Count Data
Organism: Arabidopsis thaliana
Files:

NCBI_Nov2022_SRR_runinfo.csv - Arabidopsis RNA-seq SRA RunInfo Retrieved from NCBI November 2022. This is the unprocessed data.
Dataset_54547_NoFilter_raw.pkl - Raw File Before thresholding (".pkl" format). Same as NoNo normalization without thresholding.
Dataset_54547_NoFilter_raw.tsv - Raw File Before thresholding (".tsv" format). Same as NoNo normalization without thresholding.
Dataset_32432_MRN.pkl - MRN normalized (".pkl" format)
Dataset_32432_MRN.tsv - MRN normalized (".tsv" format)
Dataset_32432_NoNo.pkl - NoNo normalized (".pkl" format)
Dataset_32432_NoNo.tsv - NoNo normalized (".tsv" format)
Dataset_32432_TMM.pkl - TMM normalized (".pkl" format)
Dataset_32432_TMM.tsv - TMM normalized (".tsv" format)
Dataset_32432_TPM.pkl - TPM normalized (".pkl" format)
Dataset_32432_TPM.tsv - TPM normalized (".tsv" format)

Title: Meta Data Arabidopsis Age and Tissue
Abstract: Meta Data for Age and Tissue after processing. In our experiment this was used as response variable to gene expression. Shared columns are "bio_sample", "bioproject_name", "experiment". In addition to these processed datasets, NCBI_Nov2022_BioSample_data.tsv is the unprocessed starting material for these two data frames.
Author: John Anthony Hadish
Data Type: Metadata on phenotypes. ".tsv" format
Organism: Arabidopsis thaliana
Files:
NCBI_Nov2022_BioSample_data.tsv - Arabidopsis BioSample data retrieved from NCBI November 2022. This is the unprocessed data.
df_metadata_tissue.tsv - Tissue Annotations for 24876 samples
df_metadata_age.tsv - Age Annotations for 16078 samples. In addition to shared columns includes "days_age"(how many days old the sample is converted to days) and "annotation_age" (how the annotation was reported for this sample in the raw data file-- i.e. "days", "weeks" etc.)

Title: Machine Learning Dataset for Arabidopsis thaliana Age
Abstract: The dataset used for Machine learning on the phenotype Age that is a combination of the Gene Expression Matrix and the Annotation Matrix. Consists of a list of 4 for the train and test splits.
Author: John Anthony Hadish
Data Type: Gene Expression Matrix and Annotations Combined, split into train and test
Organism: Arabidopsis thaliana
Files:
Dataset_Age_TrainTestSplits_mrn.pkl - MRN normalized
Dataset_Age_TrainTestSplits_NoNo.pkl - NoNo normalized
Dataset_Age_TrainTestSplits_tmm.pkl - TMM normalized
Dataset_Age_TrainTestSplits_tpm.pkl - TPM normalized

Title: Machine Learning Dataset for Arabidopsis thaliana Tissue
Abstract: The dataset used for Machine learning on the phenotype Tissue that is a combination of the Gene Expression Matrix and the Annotation Matrix. Consists of a list of 4 for the train and test splits. Saved as python ".pkl" files.
Author: John Anthony Hadish
Data Type: Gene Expression Matrix and Annotations Combined, split into train and test. Saved as python ".pkl" files.
Organism: Arabidopsis thaliana
Files:
Dataset_Tissue_TrainTestSplits_mrn.pkl - MRN normalized
Dataset_Tissue_TrainTestSplits_NoNo.pkl - NoNo normalized
Dataset_Tissue_TrainTestSplits_tmm.pkl - TMM normalized
Dataset_Tissue_TrainTestSplits_tpm.pkl - TPM normalized
Dataset_Tissue_TrainTestSplits_mrn_4category.pkl - MRN for the tissue-4 dataset

Title: BioProject Names
Abstract: Three Column File With BioProject Name, BioSample Name, and Experiment Name
Author: John Anthony Hadish
Data Type: ".tsv"
Organism: Arabidopsis thaliana
Files:
BioProject_Names_All.tsv

Title: Manuscript Supplemental Material
Abstract: Supplemental tables and figures described in the manuscript (included with manuscript and here for convenience). Please see manuscript for additional information.
Author: John Anthony Hadish
Data Type: ".tsv", ".png".pdf"
Organism: Arabidopsis thaliana
Files:
Supplemental_Figures.zip - Supplemental figures from the manuscript. Includes description of each figure.
Supplemental_Tables.zip - Supplemental tables from the manuscript. Includes description of each table.

Title: Splits of data for 3 experiments
Abstract: 2 column tsv files. The first column is the experiment (sample) name, and the second column is if it is included in the train or test data. Included here to make sure pkl files are reproducible in case the pkl package breaks in the future. Not used by scripts, included to prevent future potential loss of data.
Author: John Anthony Hadish
Data Type: ".tsv"
Organism: Arabidopsis thaliana
Files:
Dataset_Tissue_TrainTestSplits_4category_namesOnly.tsv
Dataset_Tissue_TrainTestSplits_namesOnly.tsv
Dataset_Age_TrainTestSplits_namesOnly.tsv

Title: Git Code Repository
Abstract: A tar bz2 compression of the git repository containing all of the code created for this manuscript. The same code found in this file is also avalible on GitLab at the link: https://gitlab.com/ficklinlab-public/modeling-with-transcriptomics
Author: John Anthony Hadish
Data Type: Git Repository, python code
Files:
modeling-with-transcriptomics-main.tar.bz2 - Compressed Git repository of all code used in paper.

Files

Supplemental_Figures.zip

Files (71.5 GB)

Name	Size
BioProject_Names_All.tsv.bz2 md5:407ec35ca433b1a3d1ca3ba26ef6e47f	311.5 kB	Download
Dataset_32432_MRN.pkl.bz2 md5:be36f395adafab181170a114d0f62df2	8.4 GB	Download
Dataset_32432_MRN.tsv.bz2 md5:5b0c650780350839c064eb73b57d3658	7.3 GB	Download
Dataset_32432_NoNo.pkl.bz2 md5:b1ba4115e8177990fa0d280bf9eabcdc	3.1 GB	Download
Dataset_32432_NoNo.tsv.bz2 md5:3f5ef5127ab054f822bfab19589c2f48	3.1 GB	Download
Dataset_32432_TMM.pkl.bz2 md5:9ec8c60036771f2d263e5239370c791f	8.4 GB	Download
Dataset_32432_TMM.tsv.bz2 md5:1635ca747ff17a409e763116f9a37bd6	7.3 GB	Download
Dataset_32432_TPM.pkl.bz2 md5:c9c8bba86c1d98ef9cdbba6d8c03c380	4.7 GB	Download
Dataset_32432_TPM.tsv.bz2 md5:82eec7a36f277116ac34e6c9aa468d67	3.7 GB	Download
Dataset_54547_NoFilter_raw.pkl.bz2 md5:5a2f8aad213d1541cf73b32f93307b07	5.0 GB	Download
Dataset_54547_NoFilter_raw.tsv.bz2 md5:574bb44ca01918c95f0f2aaccfd92d30	707.1 MB	Download
Dataset_Age_TrainTestSplits_mrn.pkl.bz2 md5:680b8e3aa18b41dfb3f4566f29ff5d6b	1.7 GB	Download
Dataset_Age_TrainTestSplits_namesOnly.tsv.bz2 md5:cd054b067b52dada5e6a3114447b39e3	15.0 kB	Download
Dataset_Age_TrainTestSplits_NoNo.pkl.bz2 md5:6f7d49afc4760b029c0ed867ee944455	662.4 MB	Download
Dataset_Age_TrainTestSplits_tmm.pkl.bz2 md5:9117ad15d39cddcaf47d1ce5a096530b	1.7 GB	Download
Dataset_Age_TrainTestSplits_tpm.pkl.bz2 md5:38318b92669d22bdfeb5834eea251b93	960.9 MB	Download
Dataset_Tissue_TrainTestSplits_4category_namesOnly.tsv.bz2 md5:706e188cabd47da9698e7308104c9fb9	22.6 kB	Download
Dataset_Tissue_TrainTestSplits_mrn.pkl.bz2 md5:a4d6ffffa7617927dc1eef8eede31c82	4.2 GB	Download
Dataset_Tissue_TrainTestSplits_mrn_4category.pkl.bz2 md5:c9af8b151983dd0f426436810f73eba6	2.4 GB	Download
Dataset_Tissue_TrainTestSplits_namesOnly.tsv.bz2 md5:874ee8d710d65f05fec5dbfabb694364	40.9 kB	Download
Dataset_Tissue_TrainTestSplits_NoNo.pkl.bz2 md5:51801816a0dbd74ab077a4a42d2e8f69	1.6 GB	Download
Dataset_Tissue_TrainTestSplits_tmm.pkl.bz2 md5:b6c7edd71eda39ba9a5c230365474fad	4.2 GB	Download
Dataset_Tissue_TrainTestSplits_tpm.pkl.bz2 md5:a59942347a87f1c92567488ed0a5462d	2.4 GB	Download
df_metadata_age.tsv.bz2 md5:3ff2c35b6625a8235c436cec681d83c4	97.1 kB	Download
df_metadata_tissue.tsv.bz2 md5:b2f6a89db81bf6561831b368ae3618f1	142.0 kB	Download
modeling-with-transcriptomics-main.tar.bz2 md5:07a7aa2d48e5a6aad09ab95542fcc521	2.2 MB	Download
NCBI_Nov2022_BioSample_data.tsv.bz2 md5:b2505c9f0c1e737c191173ba4566820f	813.4 kB	Download
NCBI_Nov2022_SRR_runinfo.csv.bz2 md5:49347c9c09118bda9f880c7c995ba702	5.6 MB	Download
Supplemental_Figures.zip md5:065f8047719923794a452a15fdb2453f	2.1 MB	Preview Download
Supplemental_Tables.zip md5:72eae7e5e5c197b873cd277940e6e951	784.9 kB	Preview Download

Additional details

Washington Tree Fruit Research Commission
AP-22-101 AP-22-101

Created: 2023-11

Files Uploaded to Zenodo

193

Views

Downloads

Show more details

	All versions	This version
Views	193	193
Downloads	1,148	1,148
Data volume	2.9 TB	2.9 TB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Published in

In Publication, 2023.

Thesis

Washington State University

Languages

English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: December 4, 2023
Modified: April 24, 2025

Abstract

Note to Researchers

Note on File Compression

Description All Datasets:

Supplemental_Figures.zip

Files (71.5 GB)

Funding

Dates

Predicting Phenotypic Traits Using a Massive RNA-seq Dataset

Authors/Creators

Description

Abstract

Note to Researchers

Note on File Compression

Description All Datasets:

Files

Supplemental_Figures.zip

Files (71.5 GB)

Additional details

Funding

Dates