Published November 1, 2018 | Version 1
Dataset Open

Phaeocystis globosa colonial gene expression

Description

Data and analysis for the paper: 

Differential gene expression supports a resource-intensive, defensive role for colony production in the bloom-forming haptophyte, Phaeocystis globosa

by: Margaret Mars Brisbin and Satoshi Mitarai

The Phaeocystis globosa CCMP1528 transcriptome used in the study (phaeocystisglobosa_euk_seqs.fasta or pg_euk_seqs_altnames.fasta) was assembled with trimmed sequencing reads from 8 biological replicates (4 colonial replicates and 4 solitary replicates) with the Trinity software (v2.3.2).

Raw sequencing reads are available from the NCBI SRA with accession numbers: SRR7811979–SRR7811986.

Before assembling the transcriptome, reads were quality filtered and trimmed with the Trimmomatic software (v3.36) using the command:

java -jar $TRIM/trimmomatic-0.36.jar PE -phred33 $DATA2/S${SLURM_ARRAY_TASK_ID}_S*_R1_001.fastq.gz \
$DATA2/S${SLURM_ARRAY_TASK_ID}_S*_R2_001.fastq.gz \
$OUT/S${SLURM_ARRAY_TASK_ID}_1_paired.fq $OUT/S${SLURM_ARRAY_TASK_ID}_1_unpaired.fq \
$OUT/S${SLURM_ARRAY_TASK_ID}_2_paired.fq $OUT/S${SLURM_ARRAY_TASK_ID}_2_unpaired.fq \
ILLUMINACLIP:$TRIM/adapters/NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Trimmed reads were mapped to the ERCC reference sequences for Mix1 and mapped reads were filtered using the following commands from bowtie2 (v2.2.6), samtools, and bedtools: 

bowtie2 -t -x $REF \
-1 $DATA/S${SLURM_ARRAY_TASK_ID}_1_paired.fq \
-2 $DATA/S${SLURM_ARRAY_TASK_ID}_2_paired.fq \
-S $OUT/S${SLURM_ARRAY_TASK_ID}_ercc.sam

samtools view -bS $DATA/S${SLURM_ARRAY_TASK_ID}_ercc.sam >$DATA/S${SLURM_ARRAY_TASK_ID}.bam

samtools sort $DATA/S${SLURM_ARRAY_TASK_ID}.bam $DATA/S${SLURM_ARRAY_TASK_ID}_sorted

samtools view -b -f 13 S${SLURM_ARRAY_TASK_ID}_sorted.bam > S${SLURM_ARRAY_TASK_ID}_unmapped.bam

samtools sort -n $DATA/S${SLURM_ARRAY_TASK_ID}_unmapped.bam $DATA/S${SLURM_ARRAY_TASK_ID}.qsort

bedtools bamtofastq -i $DATA/S${SLURM_ARRAY_TASK_ID}.qsort.bam -fq $DATA/S${SLURM_ARRAY_TASK_ID}_1_paired.fq -fq2 $DATA/S${SLURM_ARRAY_TASK_ID}_2_paired.fq

The resulting Trimmed reads without ERCC sequences were used to make the transcriptome assembly: 

Trinity --seqType fq --max_memory 475G \
--left  $DATA2/C1_1_paired.fq,$DATA2/C2_1_paired.fq,$DATA2/C3_1_paired.fq,$DATA2/C4_1_paired.fq,$DATA2/S1_1_paired.fq,$DATA2/S2_1_paired.fq,$DATA2/S3_1_paired.fq,$DATA2/S4_1_paired.fq \
--right $DATA2/C1_2_paired.fq,$DATA2/C2_1_paired.fq,$DATA2/C3_2_paired.fq,$DATA2/C4_2_paired.fq,$DATA2/S1_2_paired.fq,$DATA2/S2_2_paired.fq,$DATA2/S3_2_paired.fq,$DATA2/S4_2_paired.fq \
--CPU 12

The Trinity assembly was dereplicated with CD-HIT-EST (v2016-0304) at 95% : 

cd-hit-est -i $DATA/Trinity.fasta -o Trinity_Pg_clustered_95 -c 0.95 -n 8 -p 1 -g 1 -M 200000 -T 8 -d 40

The Trinity assembly was filtered to remove bacterial contamination by first running a blastn(v2.6.0+) against the nr/nt NCBI database:

blastn -query $DATA/Trinity_Pg_clustered_95.fasta -task blastn -db $REF -num_threads 12 -max_target_seqs 1 -outfmt 5 > TrinityBlast.xml

and then removing bacterial reads with custom python scripts included here: TrinityBlastXML.ipynb and FIlterTrinityEukNotEuk.ipynb 

RSEM (v1.2.22) was run with the final transcriptome assembly (phaeocystisglobosa_euk_seqs.fasta or pg_euk_seqs_altnames.fasta): 

rsem-calculate-expression --bowtie2 --paired-end \
$DATA/C${SLURM_ARRAY_TASK_ID}_1_paired.fq \
$DATA/C${SLURM_ARRAY_TASK_ID}_2_paired.fq \
$REF/rsemref_longISO/pg_euks_RSEMref \
$REF/rsemout_longISO/C${SLURM_ARRAY_TASK_ID}

rsem-calculate-expression --bowtie2 --paired-end \
$DATA/S${SLURM_ARRAY_TASK_ID}_1_paired.fq \
$DATA/S${SLURM_ARRAY_TASK_ID}_2_paired.fq \
$REF/rsemref_longISO/pg_euks_RSEMref \
$REF/rsemout_longISO/S${SLURM_ARRAY_TASK_ID}

The resulting data files are: C*.genes.results and S*.genes.results which were used with DESeq2 in the R environment to analyze different gene expression. The code for these analyses is available in html and R markdown (PhaeoColSol_DE.html, PhaeoColSol_DE.Rmd). 

The transcriptome assembly was annotated with the Dammit software (v1.0rc2), which wraps Transdecoder, HMMER, and BUSCO, and by submitting the translated amino acid sequences to GhostKOALA. 

The raw pfam Dammit annotation results are included: pg_euk_seqs.fasta.x.pfam.gff3. These results were parsed with the script: Pfam_gffParsing.ipynb. The resulting file, pfam_parsed_annotation.csv, is used in the script PhaeoColSol_DE.Rmd with pfam2go4R.txt for GO enrichment analysis. The script shinycolsol.Rmd creates an interactive plot of GO enrichment results. 

The GhostKOALA results are user_ko.csv, and are used in the script PhaeoColSol_DE.Rmd for KEGG pathway enrichment analysis. 

Files

FIlterTrinityEukNotEuk.ipynb

Files (537.2 MB)

Name Size Download all
md5:ef359d66cbc61ab78b3670c83c7ea9c5
5.5 MB Download
md5:26570542318338696005f32330eef34b
5.5 MB Download
md5:d864df259fb623fdc81b24691683ffe4
5.5 MB Download
md5:d19dd7c0b8e84e89d9855e1173d80b27
5.5 MB Download
md5:8890c69f79b9cfecf04c532fc2621515
3.3 kB Preview Download
md5:b867e9542524e8c6bfbeb8591bb88072
708.5 kB Preview Download
md5:3c805c64f841b8d55b86d72ba2201031
4.7 kB Preview Download
md5:f9cf4051238bab76c3e3276418d74711
790.4 kB Preview Download
md5:67b098e9997a88d1b4fef8568c56e514
14.1 MB Preview Download
md5:110c4f6fd66a5bb00732078ba52f8c15
15.7 MB Download
md5:9fc875389cc2144f1fdc7376fec3cb35
45.2 MB Download
md5:582544b4fc5efc04abfddf174fbc5c89
2.7 MB Download
md5:e3bb2a155b992edae1e131441847a50b
27.1 kB Download
md5:791eefd7d3157f31d12c834bed20bf7f
57.6 MB Download
md5:955c9ddd7f5cee2edb52a43e0abb1548
72.0 kB Preview Download
md5:d0709f34195f5341b646c28391309073
5.5 MB Download
md5:a3efc2c0179a34261e9c3fb0c8eaa89e
5.5 MB Download
md5:c0c9381153b859416c510bfcb06154e5
5.5 MB Download
md5:3a74f538db5aa806a42d4602c4c04d23
5.5 MB Download
md5:f68e1176f34b4f8fa334f279b078de5a
7.2 kB Download
md5:082e4a177f362a74744561014325417c
352.2 MB Preview Download
md5:60669bb77abfa52703b7ce9315b60470
20.0 kB Preview Download
md5:d74d4705eefee3718a702da4cd7e3f1c
3.7 MB Preview Download