The North Pacific Eukaryotic Gene Catalog: metatranscriptome assemblies with taxonomy, function and abundance annotations
Creators
- 1. University of Washington
Description
This data continues with the development of the unprocessed NPEGC Trinity de novo metatranscriptome assemblies, uploaded to this Zenodo repository for raw assemblies: The North Pacific Eukaryotic Gene Catalog: Raw assemblies from Gradients 1, 2 and 3
A full description of this data is published in Scientific Data, available here: The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Please cite this publication if your research uses this data:
Groussman, R. D., Coesel, S. N., Durham, B. P., Schatz, M. J., & Armbrust, E. V. (2024). The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Scientific Data, 11(1), 1161.
Excerpts of key processing steps are sampled below with links to the detailed code on the main github code repository: https://github.com/armbrustlab/NPac_euk_gene_catalog
Processing and annotation of protein-level NPEGC metatranscripts is done in 6 primary steps:
1. Six-frame translation into protein sequences
2. Frame-selection of protein-coding translation frames
3. Clustering of protein sequences at 99% sequence identity
4. Taxonomic annotation against MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library with DIAMOND
5. Functional annotation against Pfam 35.0 protein family HMM profiles using HMMER3
6. Functional annotation against KOfam HMM profiles (KEGG release 104.0) using KofamScan v1.3.0# Define local NPEGC base directory here:
NPEGC_DIR="/mnt/nfs/projects/armbrust-metat"
# Raw assemblies are located in the /assemblies/raw/ directory
# for each of the metatranscriptome projects
PROJECT_LIST="D1PA G1PA G2PA G3PA G3PA_diel"
# raw Trinity assemblies:
RAW_ASSEMBLY_DIR="${NPEGC_DIR}/${PROJECT}/assemblies/raw"
Translation
We began processing the raw metatranscriptome assemblies by six-frame translation from nucleotide transcripts into three forward and three reverse reading frame translations, using the transeq function in the EMBOSS package. We add a cruise and sample prefix to the sequence IDs to ensure unique identification downstream (ex, `>TRINITY_DN2064353_c0_g1_i1_1` to `>G1PA_S09C1_3um_TRINITY_DN2064353_c0_g1_i1_1` for the S09C1_3um sample in the G1PA assemblies). See NPEGC.6tr_frame_selection_clustering.sh for full code description.
Example of six-frame translation using transeqtranseq -auto -sformat pearson -frame 6 -sequence 6tr/${PREFIX}.Trinity.fasta -outseq 6tr/${PREFIX}.Trinity.6tr.fasta
Frame selection
We use a custom frame-selection python script keep_longest_frame.py to determine the longest coding length in each open reading frame and retain this sequence (or multiple sequences if there is a tie) for downstream analyses. See NPEGC.6tr_frame_selection_clustering.sh for full code description.
Clustering by sequence identity
To reduce sequence redundancy and near-identical sequences, we cluster protein sequences at the 99% sequence identity level and retain the sequence cluster representative in a reduced-size FASTA output file. See NPEGC.6tr_frame_selection_clustering.sh for full code description of linclust/mmseqs clustering.
Sample of linclust clustering script: core mmseqs functionfunction NPEGC_linclust {
# make an index of the fasta file:
$MMSEQS_DIR/mmseqs createdb $FASTA_PATH/$FASTA_FILE NPac.$STUDY.bf100.db
# cluster sequences at $MIN_SEQ_ID
$MMSEQS_DIR/mmseqs linclust NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.db NPac_tmp --min-seq-id ${MIN_SEQ_ID}
# retieve cluster representatives:
$MMSEQS_DIR/mmseqs result2repseq NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.db NPac.${STUDY}.clusters.rep
# generate flat FASTA output with cluster reps
$MMSEQS_DIR/mmseqs result2flat NPac.${STUDY}.bf100.db NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.rep NPac.${STUDY}.bf100.id99.fasta --use-fasta-header
}
Corresponding files uploaded to this repository: Gzip-compressed FASTA files after translation, frame-selection, and clustering at 99% sequence identity (.bf100.id99.aa.fasta.gz)
NPac.G1PA.bf100.id99.aa.fasta.gz
NPac.G2PA.bf100.id99.aa.fasta.gz
NPac.G3PA.bf100.id99.aa.fasta.gz
NPac.G3PA_diel.bf100.id99.aa.fasta.gz
NPac.D1PA.bf100.id99.aa.fasta.gz
MarFERReT + MARMICRODB taxonomic annotation with DIAMOND
Taxonomy was inferred for the NPEGC metatranscripts with the DIAMOND fast read alignment software against the MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library (v1.1), a combined database of the MarFERReT v1.1 marine microbial eukaryote sequence library and MARMICRODB v1.0 prokaryote-focused marine genome database. See NPEGC.diamond_taxonomy.log.sh for full description of DIAMOND annotation.
Excerpt of core DIAMOND function:function NPEGC_diamond {
# FASTA filename for $STUDY
FASTER_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"
# Output filename for LCA results in lca.tab file:
LCA_TAB="NPac.${STUDY}.MarFERReT_v1.1_MMDB.lca.tab"
echo "Beginning ${STUDY}"
singularity exec --no-home --bind ${DATA_DIR} \
"${CONTAINER_DIR}/diamond.sif" diamond blastp \
-c 4 --threads $N_THREADS \
--db $MFT_MMDB_DMND_DB -e $EVALUE --top 10 -f 102 \
--memory-limit 110 \
--query ${FASTER_FASTA} -o ${LCA_TAB} >> "${STUDY}.MarFERReT_v1.1_MMDB.log" 2>&1
}
Corresponding files uploaded to this repository: Gzip-compressed diamond lowest common ancestor predictions with NCBI Taxonomy against a combined MarFERReT + MARMICRODB taxonomic library (*.Pfam35.domtblout.tab.gz)
NPac.G1PA.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.G2PA.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.G3PA.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.G3PA_diel.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.D1PA.MarFERReT_v1.1_MMDB.lca.tab.gz
Pfam 35.0 functional annotation using HMMER3
Clustered protein sequences were annotated against the Pfam 35.0 collection of 19,179 protein family Hidden Markov Models (HMMs) using HMMER 3.3 with the Pfam 35.0 protein family database. Pfam annotation code is documented here: NPEGC.hmmer_function.sh
Excerpt of core hmmsearch function:function NPEGC_hmmer {
# Define input FASTA
INPUT_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"
# hmmsearch call:
hmmsearch --cut_tc --cpu $NCORES --domtblout $ANNOTATION_DIR/${STUDY}.Pfam35.domtblout.tab $HMM_PROFILE ${INPUT_FASTA}
# compress output file:
gzip $ANNOTATION_DIR/${STUDY}.Pfam35.domtblout.tab
}
Corresponding files uploaded to this repository: Gzip-compressed hmmsearch domain table files for Pfam35 queries (*.Pfam35.domtblout.tab.gz)
G1PA.Pfam35.domtblout.tab.gz
G2PA.Pfam35.domtblout.tab.gz
G3PA.Pfam35.domtblout.tab.gz
G3PA_diel.Pfam35.domtblout.tab.gz
D1PA.Pfam35.domtblout.tab.gz
KEGG functional annotation using KofamScan v1.3.0
Clustered protein sequences were annotated against the KEGG collection (release 104.0) of 20,819 protein family Hidden Markov Models (HMMs) using KofamScan and KofamKOALA. Kofam annotation code is documented here: NPEGC.kofamscan_function.sh
Excerpt of core NPEGC_kofam function:
# Core function to perform KofamScan annotation
function NPEGC_kofam {
# Define input FASTA
local INPUT_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"
# KofamScan call
${KOFAM_DIR}/kofam_scan-1.3.0/exec_annotation -f detail-tsv -E ${EVALUE} -o ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.tsv ${FASTA_DIR}/${INPUT_FASTA}
# Keep best hit (data is already sorted by KofamScan)
sort -uk1,1 ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.tsv > ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.best.kofam.tsv
# Compress output file
gzip ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.tsv
# Compress best.kofam output file
gzip ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.best.kofam.tsv
}
# filter hits with a score > 30 in R
Corresponding files uploaded to this repository: Gzip-compressed KofamScan domain table files for Kofam queries (*.best.Kofam.incT30.csv.gz):
NPac.G1PA.bf100.id99.aa.best.Kofam.incT30.csv.gz
NPac.G2PA.bf100.id99.aa.best.Kofam.incT30.csv.gz
NPac.G3PA.UW.bf100.id99.aa.best.Kofam.incT30.csv.gz
NPac.G3PA.diel.bf100.id99.aa.best.kofam.incT30.csv.gz
NPac.D1PA.diel.bf100.id99.aa.best.kofam.incT30.csv.gz
The full kofamscan tables with score >30 are deposited here: https://zenodo.org/records/13743267
Files
Files
(29.0 GB)
Name | Size | Download all |
---|---|---|
md5:0c5da5f7d452ba75772f95d539b4afe5
|
596.7 MB | Download |
md5:d4535def95a5de06ed142f1ab1a15b8d
|
477.8 MB | Download |
md5:4ce03cab6e4cab21dd976063b865dc4f
|
616.0 MB | Download |
md5:f6bb78ba55eeb10104f8b4ce45b285e9
|
569.5 MB | Download |
md5:3fae82dae3f986957dbd52a5545a1deb
|
371.8 MB | Download |
md5:c7f99dd654070d79d236647838170dfe
|
5.1 GB | Download |
md5:f1cdc989a5b391c051c737b1a89a9b5e
|
344.1 MB | Download |
md5:30e0c16af89a4735aaff7240b530b4f9
|
461.6 MB | Download |
md5:9877247ccd9e833f4b95610d1d3f9b67
|
255.2 MB | Download |
md5:e914cdc30518a2555b932996a10747e1
|
4.4 GB | Download |
md5:ea7bce18bce8470a62075aba0f2e3017
|
351.3 MB | Download |
md5:7820e3c522bfb92450644447de12f474
|
312.9 MB | Download |
md5:7c1e14fb1f2a6cc54160e7ceb53e0f7e
|
5.8 GB | Download |
md5:92739bfc47d23937770f4f08ef3c8372
|
463.0 MB | Download |
md5:90f85fcaf34e176964e3a30b861b38ca
|
4.7 GB | Download |
md5:495bf4d59a2c00605722d4bde54c67f2
|
211.9 MB | Download |
md5:83637e8d4177cb562003b2b5111f593b
|
367.3 MB | Download |
md5:4723fe8dccd0cb9e40ef11e5a3555a22
|
214.3 MB | Download |
md5:c5a5a9431e39eb58dcb8b50eba49c615
|
3.1 GB | Download |
md5:0c08ef9bba4297016e8c051bf238e25e
|
234.6 MB | Download |
Additional details
Related works
- Is derived from
- Dataset: 10.5281/zenodo.7332795 (DOI)