Published April 4, 2020 | Version v1
Dataset Open

TagSeq for gene expression in non-model plants: a pilot study at the Santa Rita Experimental Range NEON core site

  • 1. University of Michigan
  • 2. NASA Goddard Space Flight Center
  • 3. University of Arizona


TagSeq analysis scripts and assembled transcriptomes for four vascular plant species from the Santa Rita Experimental Range, AZ. Transcriptomes for each species were sequenced and assembled as described below. Additional details available in the associated manuscript: MS LINK. Raw reads for each available at NCBI BioProject #PRJNA599443.


Taxon selection and sampling 

This study focused on four commonly-occurring species at the Santa Rita Experimental Range Long Term Research and Core NEON site (SRER). These include the native species Tidestromia lanuginosa (Nutt.) Standl. (Amaranthaceae; ‘woolly tidestromia’), Parkinsonia florida (Benth. ex A. Gray) S. Watson. (Fabaceae; ‘blue palo verde’), and Bouteloua aristidoides (Kunth) Griseb. (Poaceae; ‘needle grama’), as well as the introduced species Eragrostis lehmanniana Nees (Poaceae; ‘Lehmann lovegrass’; native to southern Africa). All species were identified using a combination of the historical flora of the Santa Rita Experimental Range (Medina, 2003), the Arizona Flora (Kearney et al., 1960), and the Flora of North America (Flora of North America Editorial Committee, eds. 1993). Vouchers were deposited in the University of Arizona herbarium (ARIZ). Tissue from mature plants was collected from an apparently healthy individual representing each target species during the 2017 growing season. An entire stem was sampled for B. aristidoides (with flowers and fruits) and E. lehmanniana (without flowers or fruits). Leaves and leaflets only were sampled for P. florida and T. lanuginosa.


RNA extraction and RNA-seq

Total RNA was extracted from tissue using the Spectrum Plant Total RNA Kit (Sigma-Aldrich Co., St. Louis, MO, USA) following Protocol A. RNA was used to prepare cDNA using Nugen’s Ovation RNA-Seq System via single primer isothermal amplification (Catalogue # 7102-A01) and automated on the Apollo 324 liquid handler (Wafergen). cDNA was quantified on the Nanodrop (Thermo Fisher Scientific) and was sheared to approximately 300 bp fragments using the Covaris M220 ultrasonicator. Libraries were generated using Kapa Biosystem’s library preparation kit (KK8201). Fragments were end repaired and A-tailed, and individual indexes and adapters (Bioo, catalogue #520999) were ligated on each separate sample. The adapter ligated molecules were cleaned using AMPure beads (Agencourt Bioscience/Beckman Coulter, A63883), and amplified with Kapa’s HIFI enzyme (KK2502). Each library was then analyzed for fragment size on an Agilent’s Tapestation, and quantified by qPCR (KAPA Library Quantification Kit, KK4835) on Thermo Fisher Scientific’s Quantstudio 5 before multiplex pooling (13-16 samples per lane) and paired-end sequencing at 2x150 bp on the Illumina NextSeq500 platform at Arizona State University’s CLAS Genomics Core facility. Raw read quality was assessed using fastQC (Andrews, 2010).


De novo transcriptome assembly

Raw sequence reads were processed using the SnoWhite pipeline (Barker et al., 2010a; Dlugosch et al., 2013), which included trimming adapter sequences and bases with a quality score below 20 from the 3' ends of all reads, removing reads that are entirely primer and/or adapter fragments using TagDust (Lassmann et al., 2009), and removing polyA/T tails with SeqClean ( All transcriptomes were assembled with SOAPdenovo-Trans v1.03 (Xie et al., 2014) using a k-mer of 57. Assembled sequences for each species are in the files ending ".scafSeq".


Protein Translations

We used TransPipe (Barker et al., 2010) to identify plant proteins within the assembled transcripts for each reference transcriptome and provide protein and in-frame nucleic acid sequences for each species. The reading frame and protein translation for each sequence was identified by comparison to protein sequences from 25 sequenced and annotated plant genomes from Phytozome (Goodstein et al., 2012). Using BLASTX (Wheeler et al., 2008), best hit proteins were paired with each gene at a minimum cutoff of 30% sequence similarity over at least 150 sites. Genes that did not have a best hit protein at this level were removed. To determine the reading frame and generate estimated amino acid sequences, each gene was aligned against its best hit protein by Genewise 2.2.2 (Birney et al., 2004). Based on the highest scoring Genewise DNA-protein alignments, stop and 'N' containing codons were removed to produce estimated amino acid sequences for each gene. Output included paired DNA and protein sequences with the DNA sequence reading frame corresponding to each protein sequence. Nucleic acid sequence files end in “.fna”, whereas amino acid sequence files end in “.faa”. Numbers of sequences in each of these files correspond to the position of the sequence in the associated assembly file.


Custom scripts

“” is a Perl script that takes an input FASTQ file and removes exact duplicates identified over a supplied length at the beginning (3’ end) of the read. 

Run: perl <inputFASTQ> <length>


“” is a Perl script that takes an input FASTA file and creates a GTF file suitable for input into HtSeq-count v.0.5.4 (Anders et al., 2015).

Run: perl <inputFASTA>


“” is a Perl script that takes a set of htseq output files and makes a tab delim table of counts with header of sample names and first col of row names. The input file list file should be a text file with lists of Htseq files to combine on each line, where lines are tab delimited of the following form:

   <NameForOutputFile> <firstHtseqFile> <NextHtseqFile> <...etc...>

Run: perl <inputFileList>



Files (659.3 MB)

Name Size Download all
2.3 kB Download
2.3 kB Download
93.5 MB Download
4.6 MB Download
13.2 MB Download
104.0 MB Download
3.0 MB Download
8.8 MB Download
101.8 MB Download
6.3 MB Download
18.3 MB Download
278.8 MB Download
6.9 MB Download
20.1 MB Download
2.1 kB Download

Additional details


EAGER-NEON: Genomic Plasticity in Response to Variable Environments 1550838
National Science Foundation