Published November 10, 2021 | Version v1
Dataset Open

Supplemental data for the publication: "Resolving the microalgal gene landscape at the strain level: A novel hybrid transcriptome of Emiliania huxleyi CCMP3266"

  • 1. Department of Plant and Environmental Sciences, Weizmann Institute of Science, Rehovot, Israel
  • 2. Nancy and Stephen Grand Israel National Center for Personalized Medicine, Weizmann Institute of Science, Rehovot, Israel

Description

The dataset is part of a peer-reviewed publication that can be accessed here: https://doi.org/10.1128/aem.01418-21. The dataset includes the following files:

Data S1: Hybrid transcriptome of E. huxleyi CCMP3266 (FASTA format). The FASTA header matches the “TransID.SPAdes”, “GeneID” and “TransID.TSA” columns of Data S2.

Data S2: E. huxleyi CCMP3266 hybrid transcriptome annotation table (tsv format). Column 1 - 4: CCMP3266 gene and transcript IDs; column 5 - 6: gene and transcript length; column 7: longest transcript per gene; column 8: transcripts with protein-coding ORF; column 9 - 10: Illumina short-read counts; column 11 - 13: results of differential gene expression analysis; column 14: PacBio CCS long-read counts; column 15 - 27: blastx/blast2GO functional annotations.

Data S3: E. huxleyi CCMP1516 reference genes used for transcriptome completeness estimates (FASTA format). The FASTA file contains nucleotide sequences of E. huxleyi CCMP1516 core genes supported by expressed sequence tags (ESTs). The set of genes was compiled from data given by (Read et al., 2013; PubMed ID: 23760476).

Data S4: E. huxleyi CCMP3266 sGenome (FASTA format).

Data S5: E. huxleyi CCMP3266 sGenome gene annotation file (GFF3 format).

Data S6: E. huxleyi CCMP3266 novel genes that were absent from the CCMP1516 reference genome (tsv format). Column 1 (GeneID) and column 2 (TransID.SPAdes) include CCMP3266 gene and transcript identifiers, which can be used to retrieve nucleotide sequences from the hybrid transcriptome (Data S1). Column 3 ‑ 4: gene and transcript length; column 5: gene expression levels determined by mapping Illumina QC reads to the sGenome (RPK normalized); column 6: gene expression levels determined by mapping PacBio CCS reads to the sGenome (read counts); column 7: number of publically available E. huxleyi transcriptomes (n = 17; available at TSA database) that produced a significant BLAT hit; column 8 ‑ 15: blastx/blast2GO functional annotations.

Files

Files (341.9 MB)

Name Size Download all
md5:dade50ee524b6bbed3397dc45101d9af
64.0 MB Download
md5:84ac4af29ce45bde9968af6bcbb2abde
30.3 MB Download
md5:5b6d797e4549283378603f4e1750239f
26.8 MB Download
md5:8cd6b8c38cda114cd24c47ffc711b206
187.3 MB Download
md5:8fafc6ed790758560eea6cc865d65b45
33.2 MB Download
md5:26e854b4283a4b171cf3d0c8b87cc362
276.2 kB Download