Published January 31, 2024 | Version 0.91
Dataset Open

The North Pacific Eukaryotic Gene Catalog: metatranscriptome assemblies with taxonomy, function and abundance annotations

Contributors

Project manager:

Researcher:

  • 1. University of Washington

Description

This data continues with the development of the unprocessed NPEGC Trinity de novo metatranscriptome assemblies, uploaded to this Zenodo repository for raw assemblies: The North Pacific Eukaryotic Gene Catalog: Raw assemblies from Gradients 1, 2 and 3

A full description of this data is published in Scientific Data, available here: The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Please cite this publication if your research uses this data:

Groussman, R. D., Coesel, S. N., Durham, B. P., Schatz, M. J., & Armbrust, E. V. (2024). The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Scientific Data, 11(1), 1161.


Excerpts of key processing steps are sampled below with links to the detailed code on the main github code repository: https://github.com/armbrustlab/NPac_euk_gene_catalog


Processing and annotation of protein-level NPEGC metatranscripts is done in 6 primary steps:
1. Six-frame translation into protein sequences
2. Frame-selection of protein-coding translation frames
3. Clustering of protein sequences at 99% sequence identity
4. Taxonomic annotation against MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library with DIAMOND
5. Functional annotation against Pfam 35.0 protein family HMM profiles using HMMER3
6. Functional annotation against KOfam HMM profiles (KEGG release 104.0) using KofamScan v1.3.0

# Define local NPEGC base directory here:
NPEGC_DIR="/mnt/nfs/projects/armbrust-metat"

# Raw assemblies are located in the /assemblies/raw/ directory
# for each of the metatranscriptome projects
PROJECT_LIST="D1PA G1PA G2PA G3PA G3PA_diel"

# raw Trinity assemblies:
RAW_ASSEMBLY_DIR="${NPEGC_DIR}/${PROJECT}/assemblies/raw"

Translation
We began processing the raw metatranscriptome assemblies by six-frame translation from nucleotide transcripts into three forward and three reverse reading frame translations, using the transeq function in the EMBOSS package. We add a cruise and sample prefix to the sequence IDs to ensure unique identification downstream (ex, `>TRINITY_DN2064353_c0_g1_i1_1`  to `>G1PA_S09C1_3um_TRINITY_DN2064353_c0_g1_i1_1` for the S09C1_3um sample in the G1PA assemblies). See NPEGC.6tr_frame_selection_clustering.sh for full code description.

Example of six-frame translation using transeq
transeq -auto -sformat pearson -frame 6 -sequence 6tr/${PREFIX}.Trinity.fasta -outseq 6tr/${PREFIX}.Trinity.6tr.fasta

Frame selection
We use a custom frame-selection python script keep_longest_frame.py to determine the longest coding length in each open reading frame and retain this sequence (or multiple sequences if there is a tie) for downstream analyses.  See NPEGC.6tr_frame_selection_clustering.sh for full code description.

Clustering by sequence identity
To reduce sequence redundancy and near-identical sequences, we cluster protein sequences at the 99% sequence identity level and retain the sequence cluster representative in a reduced-size FASTA output file. See NPEGC.6tr_frame_selection_clustering.sh for full code description of linclust/mmseqs clustering.

Sample of linclust clustering script: core mmseqs function
function NPEGC_linclust {
# make an index of the fasta file:
$MMSEQS_DIR/mmseqs createdb $FASTA_PATH/$FASTA_FILE NPac.$STUDY.bf100.db
# cluster sequences at $MIN_SEQ_ID
$MMSEQS_DIR/mmseqs linclust NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.db NPac_tmp --min-seq-id ${MIN_SEQ_ID}
# retieve cluster representatives:
$MMSEQS_DIR/mmseqs result2repseq NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.db NPac.${STUDY}.clusters.rep
# generate flat FASTA output with cluster reps
$MMSEQS_DIR/mmseqs result2flat NPac.${STUDY}.bf100.db NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.rep NPac.${STUDY}.bf100.id99.fasta --use-fasta-header
}

Corresponding files uploaded to this repository: Gzip-compressed FASTA files after translation, frame-selection, and clustering at 99% sequence identity (.bf100.id99.aa.fasta.gz)
    NPac.G1PA.bf100.id99.aa.fasta.gz
    NPac.G2PA.bf100.id99.aa.fasta.gz
    NPac.G3PA.bf100.id99.aa.fasta.gz
    NPac.G3PA_diel.bf100.id99.aa.fasta.gz
    NPac.D1PA.bf100.id99.aa.fasta.gz

MarFERReT + MARMICRODB taxonomic annotation with DIAMOND

Taxonomy was inferred for the NPEGC metatranscripts with the DIAMOND fast read alignment software against the MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library (v1.1), a combined database of the MarFERReT v1.1 marine microbial eukaryote sequence library and MARMICRODB v1.0 prokaryote-focused marine genome database. See NPEGC.diamond_taxonomy.log.sh for full description of DIAMOND annotation.

Excerpt of core DIAMOND function:
function NPEGC_diamond {
# FASTA filename for $STUDY
FASTER_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"
# Output filename for LCA results in lca.tab file:
LCA_TAB="NPac.${STUDY}.MarFERReT_v1.1_MMDB.lca.tab"
echo "Beginning ${STUDY}"
singularity exec --no-home --bind ${DATA_DIR} \
        "${CONTAINER_DIR}/diamond.sif" diamond blastp \
        -c 4 --threads $N_THREADS \
        --db $MFT_MMDB_DMND_DB -e $EVALUE --top 10 -f 102 \
        --memory-limit 110 \
        --query ${FASTER_FASTA} -o ${LCA_TAB} >> "${STUDY}.MarFERReT_v1.1_MMDB.log" 2>&1
}

Corresponding files uploaded to this repository: Gzip-compressed diamond lowest common ancestor predictions with NCBI Taxonomy against a combined MarFERReT + MARMICRODB taxonomic library (*.Pfam35.domtblout.tab.gz)
    NPac.G1PA.MarFERReT_v1.1_MMDB.lca.tab.gz
    NPac.G2PA.MarFERReT_v1.1_MMDB.lca.tab.gz
    NPac.G3PA.MarFERReT_v1.1_MMDB.lca.tab.gz
    NPac.G3PA_diel.MarFERReT_v1.1_MMDB.lca.tab.gz
    NPac.D1PA.MarFERReT_v1.1_MMDB.lca.tab.gz

Pfam 35.0 functional annotation using HMMER3
Clustered protein sequences were annotated against the Pfam 35.0 collection of 19,179 protein family Hidden Markov Models (HMMs) using HMMER 3.3  with the Pfam 35.0 protein family database. Pfam annotation code is documented here: NPEGC.hmmer_function.sh

Excerpt of core hmmsearch function:

function NPEGC_hmmer {
# Define input FASTA
INPUT_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"
# hmmsearch call:
hmmsearch --cut_tc --cpu $NCORES --domtblout $ANNOTATION_DIR/${STUDY}.Pfam35.domtblout.tab $HMM_PROFILE ${INPUT_FASTA}
# compress output file:
gzip $ANNOTATION_DIR/${STUDY}.Pfam35.domtblout.tab
}

Corresponding files uploaded to this repository: Gzip-compressed hmmsearch domain table files for Pfam35 queries (*.Pfam35.domtblout.tab.gz)
    G1PA.Pfam35.domtblout.tab.gz
    G2PA.Pfam35.domtblout.tab.gz
    G3PA.Pfam35.domtblout.tab.gz
    G3PA_diel.Pfam35.domtblout.tab.gz
    D1PA.Pfam35.domtblout.tab.gz

KEGG functional annotation using KofamScan v1.3.0

Clustered protein sequences were annotated against the KEGG collection (release 104.0) of 20,819 protein family Hidden Markov Models (HMMs) using KofamScan and KofamKOALA. Kofam annotation code is documented here: NPEGC.kofamscan_function.sh

Excerpt of core NPEGC_kofam function:

# Core function to perform KofamScan annotation
function NPEGC_kofam {
    # Define input FASTA
    local INPUT_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"

    # KofamScan call
    ${KOFAM_DIR}/kofam_scan-1.3.0/exec_annotation -f detail-tsv -E ${EVALUE} -o ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.tsv ${FASTA_DIR}/${INPUT_FASTA}

    # Keep best hit (data is already sorted by KofamScan)
    sort -uk1,1 ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.tsv > ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.best.kofam.tsv

    # Compress output file
    gzip ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.tsv

    # Compress best.kofam output file
    gzip ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.best.kofam.tsv
}


# filter hits with a score > 30 in R

Corresponding files uploaded to this repository: Gzip-compressed KofamScan domain table files for Kofam queries (*.best.Kofam.incT30.csv.gz):
    NPac.G1PA.bf100.id99.aa.best.Kofam.incT30.csv.gz

    NPac.G2PA.bf100.id99.aa.best.Kofam.incT30.csv.gz

    NPac.G3PA.UW.bf100.id99.aa.best.Kofam.incT30.csv.gz
    NPac.G3PA.diel.bf100.id99.aa.best.kofam.incT30.csv.gz
    NPac.D1PA.diel.bf100.id99.aa.best.kofam.incT30.csv.gz

The full kofamscan tables with score >30 are deposited here: https://zenodo.org/records/13743267

Files

Files (29.0 GB)

Name Size Download all
md5:0c5da5f7d452ba75772f95d539b4afe5
596.7 MB Download
md5:d4535def95a5de06ed142f1ab1a15b8d
477.8 MB Download
md5:4ce03cab6e4cab21dd976063b865dc4f
616.0 MB Download
md5:f6bb78ba55eeb10104f8b4ce45b285e9
569.5 MB Download
md5:3fae82dae3f986957dbd52a5545a1deb
371.8 MB Download
md5:c7f99dd654070d79d236647838170dfe
5.1 GB Download
md5:f1cdc989a5b391c051c737b1a89a9b5e
344.1 MB Download
md5:30e0c16af89a4735aaff7240b530b4f9
461.6 MB Download
md5:9877247ccd9e833f4b95610d1d3f9b67
255.2 MB Download
md5:e914cdc30518a2555b932996a10747e1
4.4 GB Download
md5:ea7bce18bce8470a62075aba0f2e3017
351.3 MB Download
md5:7820e3c522bfb92450644447de12f474
312.9 MB Download
md5:7c1e14fb1f2a6cc54160e7ceb53e0f7e
5.8 GB Download
md5:92739bfc47d23937770f4f08ef3c8372
463.0 MB Download
md5:90f85fcaf34e176964e3a30b861b38ca
4.7 GB Download
md5:495bf4d59a2c00605722d4bde54c67f2
211.9 MB Download
md5:83637e8d4177cb562003b2b5111f593b
367.3 MB Download
md5:4723fe8dccd0cb9e40ef11e5a3555a22
214.3 MB Download
md5:c5a5a9431e39eb58dcb8b50eba49c615
3.1 GB Download
md5:0c08ef9bba4297016e8c051bf238e25e
234.6 MB Download

Additional details

Related works

Is derived from
Dataset: 10.5281/zenodo.7332795 (DOI)