TagSeq for gene expression in non-model plants: a pilot study at the Santa Rita Experimental Range NEON core site

Marx, Hannah; Scheidt, Stephen; Barker, Michael; Dlugosch, Katrina

doi:10.5281/zenodo.3740232

Published April 4, 2020 | Version v1

Dataset Open

TagSeq for gene expression in non-model plants: a pilot study at the Santa Rita Experimental Range NEON core site

1. University of Michigan
2. NASA Goddard Space Flight Center
3. University of Arizona

TagSeq analysis scripts and assembled transcriptomes for four vascular plant species from the Santa Rita Experimental Range, AZ. Transcriptomes for each species were sequenced and assembled as described below. Additional details available in the associated manuscript: MS LINK. Raw reads for each available at NCBI BioProject #PRJNA599443.

Taxon selection and sampling

This study focused on four commonly-occurring species at the Santa Rita Experimental Range Long Term Research and Core NEON site (SRER). These include the native species Tidestromia lanuginosa (Nutt.) Standl. (Amaranthaceae; ‘woolly tidestromia’), Parkinsonia florida (Benth. ex A. Gray) S. Watson. (Fabaceae; ‘blue palo verde’), and Bouteloua aristidoides (Kunth) Griseb. (Poaceae; ‘needle grama’), as well as the introduced species Eragrostis lehmanniana Nees (Poaceae; ‘Lehmann lovegrass’; native to southern Africa). All species were identified using a combination of the historical flora of the Santa Rita Experimental Range (Medina, 2003), the Arizona Flora (Kearney et al., 1960), and the Flora of North America (Flora of North America Editorial Committee, eds. 1993). Vouchers were deposited in the University of Arizona herbarium (ARIZ). Tissue from mature plants was collected from an apparently healthy individual representing each target species during the 2017 growing season. An entire stem was sampled for B. aristidoides (with flowers and fruits) and E. lehmanniana (without flowers or fruits). Leaves and leaflets only were sampled for P. florida and T. lanuginosa.

RNA extraction and RNA-seq

Total RNA was extracted from tissue using the Spectrum Plant Total RNA Kit (Sigma-Aldrich Co., St. Louis, MO, USA) following Protocol A. RNA was used to prepare cDNA using Nugen’s Ovation RNA-Seq System via single primer isothermal amplification (Catalogue # 7102-A01) and automated on the Apollo 324 liquid handler (Wafergen). cDNA was quantified on the Nanodrop (Thermo Fisher Scientific) and was sheared to approximately 300 bp fragments using the Covaris M220 ultrasonicator. Libraries were generated using Kapa Biosystem’s library preparation kit (KK8201). Fragments were end repaired and A-tailed, and individual indexes and adapters (Bioo, catalogue #520999) were ligated on each separate sample. The adapter ligated molecules were cleaned using AMPure beads (Agencourt Bioscience/Beckman Coulter, A63883), and amplified with Kapa’s HIFI enzyme (KK2502). Each library was then analyzed for fragment size on an Agilent’s Tapestation, and quantified by qPCR (KAPA Library Quantification Kit, KK4835) on Thermo Fisher Scientific’s Quantstudio 5 before multiplex pooling (13-16 samples per lane) and paired-end sequencing at 2x150 bp on the Illumina NextSeq500 platform at Arizona State University’s CLAS Genomics Core facility. Raw read quality was assessed using fastQC (Andrews, 2010).

De novo transcriptome assembly

Raw sequence reads were processed using the SnoWhite pipeline (Barker et al., 2010a; Dlugosch et al., 2013), which included trimming adapter sequences and bases with a quality score below 20 from the 3' ends of all reads, removing reads that are entirely primer and/or adapter fragments using TagDust (Lassmann et al., 2009), and removing polyA/T tails with SeqClean (https://sourceforge.net/projects/seqclean/). All transcriptomes were assembled with SOAPdenovo-Trans v1.03 (Xie et al., 2014) using a k-mer of 57. Assembled sequences for each species are in the files ending ".scafSeq".

Protein Translations

We used TransPipe (Barker et al., 2010) to identify plant proteins within the assembled transcripts for each reference transcriptome and provide protein and in-frame nucleic acid sequences for each species. The reading frame and protein translation for each sequence was identified by comparison to protein sequences from 25 sequenced and annotated plant genomes from Phytozome (Goodstein et al., 2012). Using BLASTX (Wheeler et al., 2008), best hit proteins were paired with each gene at a minimum cutoff of 30% sequence similarity over at least 150 sites. Genes that did not have a best hit protein at this level were removed. To determine the reading frame and generate estimated amino acid sequences, each gene was aligned against its best hit protein by Genewise 2.2.2 (Birney et al., 2004). Based on the highest scoring Genewise DNA-protein alignments, stop and 'N' containing codons were removed to produce estimated amino acid sequences for each gene. Output included paired DNA and protein sequences with the DNA sequence reading frame corresponding to each protein sequence. Nucleic acid sequence files end in “.fna”, whereas amino acid sequence files end in “.faa”. Numbers of sequences in each of these files correspond to the position of the sequence in the associated assembly file.

Custom scripts

“removePCRdups57.pl” is a Perl script that takes an input FASTQ file and removes exact duplicates identified over a supplied length at the beginning (3’ end) of the read.

Run: perl removePCRdups57.pl <inputFASTQ> <length>

“create_GTF.pl” is a Perl script that takes an input FASTA file and creates a GTF file suitable for input into HtSeq-count v.0.5.4 (Anders et al., 2015).

Run: perl create_GTF.pl <inputFASTA>

“combine_HtSeq.pl” is a Perl script that takes a set of htseq output files and makes a tab delim table of counts with header of sample names and first col of row names. The input file list file should be a text file with lists of Htseq files to combine on each line, where lines are tab delimited of the following form:

Run: perl combine_HtSeq.pl <inputFileList>

Files

Files (659.3 MB)

Name	Size	Download all
combine_HtSeq.pl md5:6ea290344bfb7373bf8eecd7a64e7100	2.3 kB	Download
create_GTF.pl md5:7012a2f020a0e4682417cd4eeb84ab05	2.3 kB	Download
out.Bouteloua.kmer57.scafSeq md5:c9ab70e9b9766311abc5d58175bfc1dc	93.5 MB	Download
out.Bouteloua.kmer57.scafSeq.faa md5:6ceadccfc0e66e07bea6628daa19e6da	4.6 MB	Download
out.Bouteloua.kmer57.scafSeq.fna md5:2eec6fb8b18dc5562ffb29e586ebcaa1	13.2 MB	Download
out.Eragrostis.kmer57.scafSeq md5:16bd01d77aa4a102d5b9ac4066027b76	104.0 MB	Download
out.Eragrostis.kmer57.scafSeq.faa md5:affa168b540f895c103c0dc380bd4af6	3.0 MB	Download
out.Eragrostis.kmer57.scafSeq.fna md5:dd372cb5758abe1569c1ba6a2835eeda	8.8 MB	Download
out.Parkinsonia.kmer57.scafSeq md5:8d7bfd39bb407c2aa871fab46c734df2	101.8 MB	Download
out.Parkinsonia.kmer57.scafSeq.faa md5:3f3bb3ec6fde99cb26aa71aaf8021857	6.3 MB	Download
out.Parkinsonia.kmer57.scafSeq.fna md5:e1cb5c910ace09b538445335b1c4c5bf	18.3 MB	Download
out.Tidestromia.kmer57.scafSeq md5:6816f4579da0508f15fcfc726d47b3a9	278.8 MB	Download
out.Tidestromia.kmer57.scafSeq.faa md5:9a1f1171d726cabb9fff8761b22e6308	6.9 MB	Download
out.Tidestromia.kmer57.scafSeq.fna md5:d23cb818876135ca1e215cb79fda77ff	20.1 MB	Download
removePCRdups57.pl md5:7f772d6866c37307e02a945e36d36926	2.1 kB	Download

Additional details

Cites: Dataset: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA599443 (URL)

U.S. National Science Foundation
EAGER-NEON: Genomic Plasticity in Response to Variable Environments 1550838

	All versions	This version
Views	314	314
Downloads	253	253
Data volume	10.6 GB	10.6 GB

TagSeq for gene expression in non-model plants: a pilot study at the Santa Rita Experimental Range NEON core site

Files

Files (659.3 MB)

Additional details

Related works

Funding

TagSeq for gene expression in non-model plants: a pilot study at the Santa Rita Experimental Range NEON core site

Creators

Description

Files

Files (659.3 MB)

Additional details

Related works

Funding