The North Pacific Eukaryotic Gene Catalog: metatranscriptome assemblies with taxonomy, function and abundance annotations

Groussman, Mora; Blaskowski, Stephen; Coesel, Sacha; Armbrust, E. Virginia

doi:10.5281/zenodo.12630398

Published January 31, 2024 | Version 0.91

Dataset Open

The North Pacific Eukaryotic Gene Catalog: metatranscriptome assemblies with taxonomy, function and abundance annotations

1. University of Washington

Contributors

Project manager:

Groussman, Mora Jove

Researcher:

Coesel, Sacha¹

1. University of Washington

This data continues with the development of the unprocessed NPEGC Trinity de novo metatranscriptome assemblies, uploaded to this Zenodo repository for raw assemblies: The North Pacific Eukaryotic Gene Catalog: Raw assemblies from Gradients 1, 2 and 3

A full description of this data is published in Scientific Data, available here: The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Please cite this publication if your research uses this data:

Groussman, R. D., Coesel, S. N., Durham, B. P., Schatz, M. J., & Armbrust, E. V. (2024). The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Scientific Data, 11(1), 1161.

Excerpts of key processing steps are sampled below with links to the detailed code on the main github code repository: https://github.com/armbrustlab/NPac_euk_gene_catalog

Processing and annotation of protein-level NPEGC metatranscripts is done in 6 primary steps:
1. Six-frame translation into protein sequences
2. Frame-selection of protein-coding translation frames
3. Clustering of protein sequences at 99% sequence identity
4. Taxonomic annotation against MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library with DIAMOND
5. Functional annotation against Pfam 35.0 protein family HMM profiles using HMMER3
6. Functional annotation against KOfam HMM profiles (KEGG release 104.0) using KofamScan v1.3.0

# Define local NPEGC base directory here:
NPEGC_DIR="/mnt/nfs/projects/armbrust-metat"

# Raw assemblies are located in the /assemblies/raw/ directory
# for each of the metatranscriptome projects
PROJECT_LIST="D1PA G1PA G2PA G3PA G3PA_diel"

# raw Trinity assemblies:
RAW_ASSEMBLY_DIR="${NPEGC_DIR}/${PROJECT}/assemblies/raw"

Translation
We began processing the raw metatranscriptome assemblies by six-frame translation from nucleotide transcripts into three forward and three reverse reading frame translations, using the transeq function in the EMBOSS package. We add a cruise and sample prefix to the sequence IDs to ensure unique identification downstream (ex, `>TRINITY_DN2064353_c0_g1_i1_1` to `>G1PA_S09C1_3um_TRINITY_DN2064353_c0_g1_i1_1` for the S09C1_3um sample in the G1PA assemblies). See NPEGC.6tr_frame_selection_clustering.sh for full code description.

Example of six-frame translation using transeq
transeq -auto -sformat pearson -frame 6 -sequence 6tr/${PREFIX}.Trinity.fasta -outseq 6tr/${PREFIX}.Trinity.6tr.fasta

Frame selection
We use a custom frame-selection python script keep_longest_frame.py to determine the longest coding length in each open reading frame and retain this sequence (or multiple sequences if there is a tie) for downstream analyses. See NPEGC.6tr_frame_selection_clustering.sh for full code description.

Clustering by sequence identity
To reduce sequence redundancy and near-identical sequences, we cluster protein sequences at the 99% sequence identity level and retain the sequence cluster representative in a reduced-size FASTA output file. See NPEGC.6tr_frame_selection_clustering.sh for full code description of linclust/mmseqs clustering.

Sample of linclust clustering script: core mmseqs function
function NPEGC_linclust {
# make an index of the fasta file:
$MMSEQS_DIR/mmseqs createdb $FASTA_PATH/$FASTA_FILE NPac.$STUDY.bf100.db
# cluster sequences at $MIN_SEQ_ID
$MMSEQS_DIR/mmseqs linclust NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.db NPac_tmp --min-seq-id ${MIN_SEQ_ID}
# retieve cluster representatives:
$MMSEQS_DIR/mmseqs result2repseq NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.db NPac.${STUDY}.clusters.rep
# generate flat FASTA output with cluster reps
$MMSEQS_DIR/mmseqs result2flat NPac.${STUDY}.bf100.db NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.rep NPac.${STUDY}.bf100.id99.fasta --use-fasta-header
}

Corresponding files uploaded to this repository: Gzip-compressed FASTA files after translation, frame-selection, and clustering at 99% sequence identity (.bf100.id99.aa.fasta.gz)
NPac.G1PA.bf100.id99.aa.fasta.gz
NPac.G2PA.bf100.id99.aa.fasta.gz
NPac.G3PA.bf100.id99.aa.fasta.gz
NPac.G3PA_diel.bf100.id99.aa.fasta.gz
NPac.D1PA.bf100.id99.aa.fasta.gz

MarFERReT + MARMICRODB taxonomic annotation with DIAMOND

Taxonomy was inferred for the NPEGC metatranscripts with the DIAMOND fast read alignment software against the MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library (v1.1), a combined database of the MarFERReT v1.1 marine microbial eukaryote sequence library and MARMICRODB v1.0 prokaryote-focused marine genome database. See NPEGC.diamond_taxonomy.log.sh for full description of DIAMOND annotation.

Excerpt of core DIAMOND function:
function NPEGC_diamond {
# FASTA filename for $STUDY
FASTER_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"
# Output filename for LCA results in lca.tab file:
LCA_TAB="NPac.${STUDY}.MarFERReT_v1.1_MMDB.lca.tab"
echo "Beginning ${STUDY}"
singularity exec --no-home --bind ${DATA_DIR} \
"${CONTAINER_DIR}/diamond.sif" diamond blastp \
-c 4 --threads $N_THREADS \
--db $MFT_MMDB_DMND_DB -e $EVALUE --top 10 -f 102 \
--memory-limit 110 \
--query ${FASTER_FASTA} -o ${LCA_TAB} >> "${STUDY}.MarFERReT_v1.1_MMDB.log" 2>&1
}

Corresponding files uploaded to this repository: Gzip-compressed diamond lowest common ancestor predictions with NCBI Taxonomy against a combined MarFERReT + MARMICRODB taxonomic library (*.Pfam35.domtblout.tab.gz)
NPac.G1PA.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.G2PA.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.G3PA.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.G3PA_diel.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.D1PA.MarFERReT_v1.1_MMDB.lca.tab.gz

Pfam 35.0 functional annotation using HMMER3
Clustered protein sequences were annotated against the Pfam 35.0 collection of 19,179 protein family Hidden Markov Models (HMMs) using HMMER 3.3 with the Pfam 35.0 protein family database. Pfam annotation code is documented here: NPEGC.hmmer_function.sh

Excerpt of core hmmsearch function:

function NPEGC_hmmer {
# Define input FASTA
INPUT_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"
# hmmsearch call:
hmmsearch --cut_tc --cpu $NCORES --domtblout $ANNOTATION_DIR/${STUDY}.Pfam35.domtblout.tab $HMM_PROFILE ${INPUT_FASTA}
# compress output file:
gzip $ANNOTATION_DIR/${STUDY}.Pfam35.domtblout.tab
}

Corresponding files uploaded to this repository: Gzip-compressed hmmsearch domain table files for Pfam35 queries (*.Pfam35.domtblout.tab.gz)
G1PA.Pfam35.domtblout.tab.gz
G2PA.Pfam35.domtblout.tab.gz
G3PA.Pfam35.domtblout.tab.gz
G3PA_diel.Pfam35.domtblout.tab.gz
D1PA.Pfam35.domtblout.tab.gz

KEGG functional annotation using KofamScan v1.3.0

Clustered protein sequences were annotated against the KEGG collection (release 104.0) of 20,819 protein family Hidden Markov Models (HMMs) using KofamScan and KofamKOALA. Kofam annotation code is documented here: NPEGC.kofamscan_function.sh

Excerpt of core NPEGC_kofam function:

# Core function to perform KofamScan annotation
function NPEGC_kofam {
# Define input FASTA
local INPUT_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"

# KofamScan call
${KOFAM_DIR}/kofam_scan-1.3.0/exec_annotation -f detail-tsv -E ${EVALUE} -o ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.tsv ${FASTA_DIR}/${INPUT_FASTA}

# Keep best hit (data is already sorted by KofamScan)
sort -uk1,1 ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.tsv > ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.best.kofam.tsv

# Compress output file
gzip ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.tsv

# Compress best.kofam output file
gzip ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.best.kofam.tsv
}

# filter hits with a score > 30 in R

Corresponding files uploaded to this repository: Gzip-compressed KofamScan domain table files for Kofam queries (*.best.Kofam.incT30.csv.gz):
NPac.G1PA.bf100.id99.aa.best.Kofam.incT30.csv.gz
NPac.G2PA.bf100.id99.aa.best.Kofam.incT30.csv.gz
NPac.G3PA.UW.bf100.id99.aa.best.Kofam.incT30.csv.gz
NPac.G3PA.diel.bf100.id99.aa.best.kofam.incT30.csv.gz
NPac.D1PA.diel.bf100.id99.aa.best.kofam.incT30.csv.gz

The full kofamscan tables with score >30 are deposited here: https://zenodo.org/records/13743267

Files

Files (29.0 GB)

Name	Size	Download all
D1PA.Pfam35.domtblout.tab.gz md5:0c5da5f7d452ba75772f95d539b4afe5	596.7 MB	Download
G1PA.Pfam35.domtblout.tab.gz md5:d4535def95a5de06ed142f1ab1a15b8d	477.8 MB	Download
G2PA.Pfam35.domtblout.tab.gz md5:4ce03cab6e4cab21dd976063b865dc4f	616.0 MB	Download
G3PA.Pfam35.domtblout.tab.gz md5:f6bb78ba55eeb10104f8b4ce45b285e9	569.5 MB	Download
G3PA_diel.Pfam35.domtblout.tab.gz md5:3fae82dae3f986957dbd52a5545a1deb	371.8 MB	Download
NPac.D1PA.bf100.id99.aa.fasta.gz md5:c7f99dd654070d79d236647838170dfe	5.1 GB	Download
NPac.D1PA.diel.bf100.id99.aa.best.kofam.incT30.csv.gz md5:f1cdc989a5b391c051c737b1a89a9b5e	344.1 MB	Download
NPac.D1PA.MarFERReT_v1.1_MMDB.lca.tab.gz md5:30e0c16af89a4735aaff7240b530b4f9	461.6 MB	Download
NPac.G1PA.bf100.id99.aa.best.Kofam.incT30.csv.gz md5:9877247ccd9e833f4b95610d1d3f9b67	255.2 MB	Download
NPac.G1PA.bf100.id99.aa.fasta.gz md5:e914cdc30518a2555b932996a10747e1	4.4 GB	Download
NPac.G1PA.MarFERReT_v1.1_MMDB.lca.tab.gz md5:ea7bce18bce8470a62075aba0f2e3017	351.3 MB	Download
NPac.G2PA.bf100.id99.aa.best.Kofam.incT30.csv.gz md5:7820e3c522bfb92450644447de12f474	312.9 MB	Download
NPac.G2PA.bf100.id99.aa.fasta.gz md5:7c1e14fb1f2a6cc54160e7ceb53e0f7e	5.8 GB	Download
NPac.G2PA.MarFERReT_v1.1_MMDB.lca.tab.gz md5:92739bfc47d23937770f4f08ef3c8372	463.0 MB	Download
NPac.G3PA.bf100.id99.aa.fasta.gz md5:90f85fcaf34e176964e3a30b861b38ca	4.7 GB	Download
NPac.G3PA.diel.bf100.id99.aa.best.kofam.incT30.csv.gz md5:495bf4d59a2c00605722d4bde54c67f2	211.9 MB	Download
NPac.G3PA.MarFERReT_v1.1_MMDB.lca.tab.gz md5:83637e8d4177cb562003b2b5111f593b	367.3 MB	Download
NPac.G3PA.UW.bf100.id99.aa.best.Kofam.incT30.csv.gz md5:4723fe8dccd0cb9e40ef11e5a3555a22	214.3 MB	Download
NPac.G3PA_diel.bf100.id99.aa.fasta.gz md5:c5a5a9431e39eb58dcb8b50eba49c615	3.1 GB	Download
NPac.G3PA_diel.MarFERReT_v1.1_MMDB.lca.tab.gz md5:0c08ef9bba4297016e8c051bf238e25e	234.6 MB	Download

Additional details

Is derived from: Dataset: 10.5281/zenodo.7332795 (DOI)

	All versions	This version
Views	382	223
Downloads	965	604
Data volume	1.5 TB	848.9 GB

The North Pacific Eukaryotic Gene Catalog: metatranscriptome assemblies with taxonomy, function and abundance annotations

Creators

Contributors

Project manager:

Researcher:

Description

Files

Files (29.0 GB)

Additional details

Related works