Phaeocystis globosa colonial gene expression
Creators
Description
Data and analysis for the paper:
Differential gene expression supports a resource-intensive, defensive role for colony production in the bloom-forming haptophyte, Phaeocystis globosa
by: Margaret Mars Brisbin and Satoshi Mitarai
The Phaeocystis globosa CCMP1528 transcriptome used in the study (phaeocystisglobosa_euk_seqs.fasta or pg_euk_seqs_altnames.fasta) was assembled with trimmed sequencing reads from 8 biological replicates (4 colonial replicates and 4 solitary replicates) with the Trinity software (v2.3.2).
Raw sequencing reads are available from the NCBI SRA with accession numbers: SRR7811979–SRR7811986.
Before assembling the transcriptome, reads were quality filtered and trimmed with the Trimmomatic software (v3.36) using the command:
java -jar $TRIM/trimmomatic-0.36.jar PE -phred33 $DATA2/S${SLURM_ARRAY_TASK_ID}_S*_R1_001.fastq.gz \
$DATA2/S${SLURM_ARRAY_TASK_ID}_S*_R2_001.fastq.gz \
$OUT/S${SLURM_ARRAY_TASK_ID}_1_paired.fq $OUT/S${SLURM_ARRAY_TASK_ID}_1_unpaired.fq \
$OUT/S${SLURM_ARRAY_TASK_ID}_2_paired.fq $OUT/S${SLURM_ARRAY_TASK_ID}_2_unpaired.fq \
ILLUMINACLIP:$TRIM/adapters/NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Trimmed reads were mapped to the ERCC reference sequences for Mix1 and mapped reads were filtered using the following commands from bowtie2 (v2.2.6), samtools, and bedtools:
bowtie2 -t -x $REF \
-1 $DATA/S${SLURM_ARRAY_TASK_ID}_1_paired.fq \
-2 $DATA/S${SLURM_ARRAY_TASK_ID}_2_paired.fq \
-S $OUT/S${SLURM_ARRAY_TASK_ID}_ercc.sam
samtools view -bS $DATA/S${SLURM_ARRAY_TASK_ID}_ercc.sam >$DATA/S${SLURM_ARRAY_TASK_ID}.bam
samtools sort $DATA/S${SLURM_ARRAY_TASK_ID}.bam $DATA/S${SLURM_ARRAY_TASK_ID}_sorted
samtools view -b -f 13 S${SLURM_ARRAY_TASK_ID}_sorted.bam > S${SLURM_ARRAY_TASK_ID}_unmapped.bam
samtools sort -n $DATA/S${SLURM_ARRAY_TASK_ID}_unmapped.bam $DATA/S${SLURM_ARRAY_TASK_ID}.qsort
bedtools bamtofastq -i $DATA/S${SLURM_ARRAY_TASK_ID}.qsort.bam -fq $DATA/S${SLURM_ARRAY_TASK_ID}_1_paired.fq -fq2 $DATA/S${SLURM_ARRAY_TASK_ID}_2_paired.fq
The resulting Trimmed reads without ERCC sequences were used to make the transcriptome assembly:
Trinity --seqType fq --max_memory 475G \
--left $DATA2/C1_1_paired.fq,$DATA2/C2_1_paired.fq,$DATA2/C3_1_paired.fq,$DATA2/C4_1_paired.fq,$DATA2/S1_1_paired.fq,$DATA2/S2_1_paired.fq,$DATA2/S3_1_paired.fq,$DATA2/S4_1_paired.fq \
--right $DATA2/C1_2_paired.fq,$DATA2/C2_1_paired.fq,$DATA2/C3_2_paired.fq,$DATA2/C4_2_paired.fq,$DATA2/S1_2_paired.fq,$DATA2/S2_2_paired.fq,$DATA2/S3_2_paired.fq,$DATA2/S4_2_paired.fq \
--CPU 12
The Trinity assembly was dereplicated with CD-HIT-EST (v2016-0304) at 95% :
cd-hit-est -i $DATA/Trinity.fasta -o Trinity_Pg_clustered_95 -c 0.95 -n 8 -p 1 -g 1 -M 200000 -T 8 -d 40
The Trinity assembly was filtered to remove bacterial contamination by first running a blastn(v2.6.0+) against the nr/nt NCBI database:
blastn -query $DATA/Trinity_Pg_clustered_95.fasta -task blastn -db $REF -num_threads 12 -max_target_seqs 1 -outfmt 5 > TrinityBlast.xml
and then removing bacterial reads with custom python scripts included here: TrinityBlastXML.ipynb and FIlterTrinityEukNotEuk.ipynb
RSEM (v1.2.22) was run with the final transcriptome assembly (phaeocystisglobosa_euk_seqs.fasta or pg_euk_seqs_altnames.fasta):
rsem-calculate-expression --bowtie2 --paired-end \
$DATA/C${SLURM_ARRAY_TASK_ID}_1_paired.fq \
$DATA/C${SLURM_ARRAY_TASK_ID}_2_paired.fq \
$REF/rsemref_longISO/pg_euks_RSEMref \
$REF/rsemout_longISO/C${SLURM_ARRAY_TASK_ID}
rsem-calculate-expression --bowtie2 --paired-end \
$DATA/S${SLURM_ARRAY_TASK_ID}_1_paired.fq \
$DATA/S${SLURM_ARRAY_TASK_ID}_2_paired.fq \
$REF/rsemref_longISO/pg_euks_RSEMref \
$REF/rsemout_longISO/S${SLURM_ARRAY_TASK_ID}
The resulting data files are: C*.genes.results and S*.genes.results which were used with DESeq2 in the R environment to analyze different gene expression. The code for these analyses is available in html and R markdown (PhaeoColSol_DE.html, PhaeoColSol_DE.Rmd).
The transcriptome assembly was annotated with the Dammit software (v1.0rc2), which wraps Transdecoder, HMMER, and BUSCO, and by submitting the translated amino acid sequences to GhostKOALA.
The raw pfam Dammit annotation results are included: pg_euk_seqs.fasta.x.pfam.gff3. These results were parsed with the script: Pfam_gffParsing.ipynb. The resulting file, pfam_parsed_annotation.csv, is used in the script PhaeoColSol_DE.Rmd with pfam2go4R.txt for GO enrichment analysis. The script shinycolsol.Rmd creates an interactive plot of GO enrichment results.
The GhostKOALA results are user_ko.csv, and are used in the script PhaeoColSol_DE.Rmd for KEGG pathway enrichment analysis.
Files
FIlterTrinityEukNotEuk.ipynb
Files
(537.2 MB)
Name | Size | Download all |
---|---|---|
md5:ef359d66cbc61ab78b3670c83c7ea9c5
|
5.5 MB | Download |
md5:26570542318338696005f32330eef34b
|
5.5 MB | Download |
md5:d864df259fb623fdc81b24691683ffe4
|
5.5 MB | Download |
md5:d19dd7c0b8e84e89d9855e1173d80b27
|
5.5 MB | Download |
md5:8890c69f79b9cfecf04c532fc2621515
|
3.3 kB | Preview Download |
md5:b867e9542524e8c6bfbeb8591bb88072
|
708.5 kB | Preview Download |
md5:3c805c64f841b8d55b86d72ba2201031
|
4.7 kB | Preview Download |
md5:f9cf4051238bab76c3e3276418d74711
|
790.4 kB | Preview Download |
md5:67b098e9997a88d1b4fef8568c56e514
|
14.1 MB | Preview Download |
md5:110c4f6fd66a5bb00732078ba52f8c15
|
15.7 MB | Download |
md5:9fc875389cc2144f1fdc7376fec3cb35
|
45.2 MB | Download |
md5:582544b4fc5efc04abfddf174fbc5c89
|
2.7 MB | Download |
md5:e3bb2a155b992edae1e131441847a50b
|
27.1 kB | Download |
md5:791eefd7d3157f31d12c834bed20bf7f
|
57.6 MB | Download |
md5:955c9ddd7f5cee2edb52a43e0abb1548
|
72.0 kB | Preview Download |
md5:d0709f34195f5341b646c28391309073
|
5.5 MB | Download |
md5:a3efc2c0179a34261e9c3fb0c8eaa89e
|
5.5 MB | Download |
md5:c0c9381153b859416c510bfcb06154e5
|
5.5 MB | Download |
md5:3a74f538db5aa806a42d4602c4c04d23
|
5.5 MB | Download |
md5:f68e1176f34b4f8fa334f279b078de5a
|
7.2 kB | Download |
md5:082e4a177f362a74744561014325417c
|
352.2 MB | Preview Download |
md5:60669bb77abfa52703b7ce9315b60470
|
20.0 kB | Preview Download |
md5:d74d4705eefee3718a702da4cd7e3f1c
|
3.7 MB | Preview Download |