MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes

doi:10.5281/zenodo.10170983

Published November 21, 2023 | Version 1.1

Dataset Open

MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes

1. University of Washington

Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of only about half of eukaryotic environmental transcripts. Here, we introduce version 1.0 of the Marine Functional EukaRyotic Reference Taxa (MarFERReT), an updated marine microbial eukaryotic sequence library with a version-controlled framework designed for taxonomic annotation of eukaryotic metatranscriptomes. We gathered 902 marine eukaryote genomes and transcriptomes from multiple sources and assessed these candidate entries for sequence quality and cross-contamination issues, selecting 800 validated entries for inclusion in the library. MarFERReT v1 contains reference sequences from 800 marine eukaryotic genomes and transcriptomes, covering 453 species- and strain-level taxa, totaling nearly 28 million protein sequences with associated NCBI and PR2 Taxonomy identifiers and Pfam functional annotations. An accompanying MarFERReT project repository hosts containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT.

MarFERReT is linked to a code repository hosting containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT here: https://github.com/armbrustlab/marferret

The raw source data for the 902 candidate entries considered for MarFERReT v1, including the 800 accepted entries, are available for download from their respective online locations. The source URL for each of the entries is listed here in MarFERReT.v1.entry_curation.csv, and detailed instructions and code for downloading the raw sequence data from source are available in the MarFERReT code repository (link).

The following MarFERReT data products are available in this repository:

MarFERReT.v1.metadata.csv
This CSV file contains descriptors of each of the 902 database entries, including data source, taxonomy, and sequence descriptors. Data fields are as follows:

entry_id: Unique MarFERReT sequence entry identifier.
accepted: Acceptance into the final MarFERReT build (Y/N). The Y/N values can be adjusted to customize the final build output according to user-specific needs.
marferret_name: A human and machine friendly string derived from the NCBI Taxonomy organism name; maintaining strain-level designation wherever possible.
tax_id: The NCBI Taxonomy ID (taxID).
pr2_accession: Best-matching PR2 accession ID associated with entry
pr2_rank: The lowest shared rank between the entry and the pr2_accession
pr2_taxonomy: PR2 Taxonomy classification scheme of the pr2_accession
data_type: Type of sequence data; transcriptome shotgun assemblies (TSA), gene models from assembled genomes (genome), and single-cell amplified genomes (SAG) or transcriptomes (SAT).
data_source: Online location of sequence data; the Zenodo data repository (Zenodo), the datadryad.org repository (datadryad.org), MMETSP re-assemblies on Zenodo (MMETSP)17, NCBI GenBank (NCBI), JGI Phycocosm (JGI-Phycocosm), the TARA Oceans portal on Genoscope (TARA), or entries from the Roscoff Culture Collection through the METdb database repository (METdb).
source_link: URL where the original sequence data and/or metadata was collected.
pub_year: Year of data release or publication of linked reference.
ref_link: Pubmed URL directs to the published reference for entry, if available.
ref_doi: DOI of entry data from source, if available.
source_filename: Name of the original sequence file name from the data source.
seq_type: Entry sequence data retrieved in nucleotide (nt) or amino acid (aa) alphabets.
n_seqs_raw: Number of sequences in the original sequence file.
source_name: Full organism name from entry source
original_taxID: Original NCBI taxID from entry data source metadata, if available
alias: Additional identifiers for the entry, if available

MarFERReT.v1.entry_curation.csv
This CSV file contains curation and quality-control information on the 902 candidate entries considered for incorporation into MarFERReT v1, including curated NCBI Taxonomy IDs and entry validation statistics. Data fields are as follows:

entry_id: Unique MarFERReT sequence entry identifier
marferret_name: Organism name in human and machine friendly format, including additional NCBI taxonomy strain identifiers if available.
tax_id: Verified NCBI taxID used in MarFERReT
taxID_status: Status of the final NCBI taxID (Assigned, Updated, or Unchanged)
taxID_notes: Notes on the original_taxID
n_seqs_raw: Number of sequences in the original sequence file
n_pfams: Number of Pfam domains identified in protein sequences
qc_flag: Early validation quality control flags for the following: LOW_SEQS; less than 1,200 raw sequences; LOW_PFAMS; less than 500 Pfam domain annotations.
flag_Lasek: Flag notes from Lasek-Nesselquist and Johnson (2019); contains the flag 'FLAG_LASEK' indicating ciliate samples reported as contaminated in this study.
VV_contam_pct: Estimated contamination reported for MMETSP entries in Van Vlierberghe et al., (2021).
flag_VanVlierberghe: Flag for a high level of estimated contamination, from 'flag_VanVlierberghe' values over 50%: FLAG_VV.
rp63_npfams: Number of ribosomal protein Pfam domains out of 63 total.
rp63_contam_pct: Percent of total ribosomal protein sequences with an inferred taxonomic identity in any lineage other than the recorded identity, as described in the Technical Validation section from analysis of 63 Pfam ribosomal protein domains.
flag_rp63: Flag for a high level of estimated contamination, from 'rp63_contam_pct' values over 50%: FLAG_RP63.
flag_sum: Count of the number of flag columns (`qc_flag`, `flag_Lasek`, `flag_VanVlierberghe`, and `flag_rp63`). All entries with one or more flag are nominally rejected ('accepted' = N); entries without any flags are validated and accepted ('accepted' = Y).
accepted: Acceptance into the final MarFERReT build (Y or N).

MarFERReT.v1.RP63_QC_estimates.csv
This CSV file contains the results of the 'RP63' cross-contamination check using ribosomal proteins. The lineage bin columns are the taxonomic categories that define whether a query sequence is placed within or outside the expected lineage.

entry_handle: Human-readable tag concatenating the MarFERReT 'entry_id' with the 'marferret_name' (from MarFERReT.v1.metadata.csv)
entry_id: Unique MarFERReT sequence entry identifier
tax_id: The NCBI Taxonomy ID (taxID).
n_seqs: Number of protein sequences annotated as a Pfam ribosomal protein family
n_pfams: Number of unique Pfam protein families
tax_group: The expected lineage of this entry sample from the 'predefined lineage' categories below
contam_pct: The percentage of ribosomal protein sequences identified in a lineage other than the expected 'tax_group' lineage.
[lineage bins]: Series of 21 columns Amoebozoa, Ciliophora, Colpodellida, Cryptophyceae, Dinophyceae, Euglenozoa, Glaucocystophyceae, Haptophyta, Heterolobosea, Opisthokonta, Palpitomonas, Perkinsozoa, Rhizaria, Rhodophyta, Stramenopiles, Viridiplantae, Bacteria, Archaea, Viruses, Other, Unknown.

MarFERReT.v1.proteins.faa.gz
This Gzip-compressed FASTA file contains the 27,951,013 final translated and clustered protein sequences for all 800 accepted MarFERReT entries. The sequence defline contains the unique identifier for the sequence and its reference (mftX, where 'X' is a ten-digit integer value).

MarFERReT.v1.taxonomies.tab.gz
This Gzip-compressed tab-separated file is formatted for interoperability with the DIAMOND protein alignment tool commonly used for downstream analyses and contains some columns without any data. Each row contains an entry for one of the MarFERReT protein sequences in MarFERReT.v1.proteins.faa.gz. Note that 'accession.version' and 'taxid' are populated columns while 'accession' and 'gi' have NA values; the latter columns are required for back-compatibility as input for the DIAMOND alignment software and LCA analysis.

The columns in this file contain the following information:

accession: (NA)
accession.version: The unique MarFERReT sequence identifier ('mftX').
taxid: The NCBI Taxonomy ID associated with this reference sequence.
gi: (NA).

MarFERReT.v1.proteins_info.tab.gz
This Gzip-compressed tab-separated file contains a row for each final MarFERReT protein sequence with the following columns:

aa_id: the unique identifier for each MarFERReT protein sequence.
entry_id: The unique numeric identifier for each MarFERReT entry.
source_defline: The original, unformatted sequence identifier

MarFERReT.candidate_entry_Pfam_annotations.tar.gz
This Gzip-compressed archive contains the raw HMMER3297 output from the search of Pfam 34.0 HMM profiles against the full set of protein sequences from candidate entries. The archive contains files for each entry with the suffix '.Pfam34.domtblout.tab' and a prefix with the 'entry_id' and 'marferret_name' values from MarFERReT.v1.metadata.csv. The 'domtblout.tab' files are the output from hmmsearch using the --domtblout parameter containing 3 header and 10 footer rows beginning with '#' and rows for each hmmsearch match with 22 whitespace-delimited fields and a target sequence description (see here for more information on the hmmsearch output file formats). The 'target name' (original sequence identifier from MarFERReT.v1.proteins_info.tab.gz), 'query name' (Pfam name), 'accession' (Pfam ID), 'E-value' and 'score' (full sequence match scores) are retained in downstream data products.

MarFERReT.v1.best_pfam.csv.gz

This Gzip-compressed CSV file contains the best-scoring Pfam annotation for intra-species clustered protein sequences from the 800 validated MarFERReT entries; derived from the raw hmmsearch annotations in MarFERReT.candidate_entry_Pfam_annotations.tar.gz. This file contains the following fields:

aa_id: The unique MarFERReT protein sequence ID ('mftX').
entry_id: Unique MarFERReT sequence entry identifier
source_defline: Original FASTA sequence identifier
pfam_name: The shorthand Pfam protein family name.
pfam_id: The Pfam identifier.

MarFERReT.v1.entry_pfam_sums.csv.gz

This Gzip-compressed CSV file contains a reduced version of MarFERReT.v1.best_pfam.csv.gz; grouped by `entry_id` and `pfam_id` to summarize the number of sequences (`n_seqs`) with each unique entry_id-pfam_id pair. Contains the `entry_id`, `pfam_id`, `pfam_name` and `n_seqs` columns.

MarFERReT.v1.core_genes.csv

This CSV file contains the core transcribed gene (CTG) catalog derived from MarFERReT transcribed reference sequence data (see Methods) to be used in environmental metatranscriptome analysis in conjunction with other MarFERReT data products. The columns contain the following values:

lineage: Name of major marine microbial eukaryote lineage
n_taxa: Number of species- and strain-level taxa this Pfam observed in
pfam_id: Pfam protein family identifier
frequency: Proportion of species (n_species) in lineage where pfam_id is observed

Files

MarFERReT.v1.core_genes.csv

Files (6.4 GB)

Name	Size	Download all
MarFERReT.candidate_entry_Pfam_annotations.tar.gz md5:14b78f946098eca6e12c2253ea264436	1.2 GB	Download
MarFERReT.v1.best_pfam.csv.gz md5:1b5ab5ee3a79be8a11092b6795b98120	131.9 MB	Download
MarFERReT.v1.core_genes.csv md5:b77d4feaf88bf005129ae245de70d40b	316.1 kB	Preview Download
MarFERReT.v1.curation.csv md5:2e6321bb878c30edbfa4412a9203ba3b	96.4 kB	Preview Download
MarFERReT.v1.entry_pfam_sums.csv.gz md5:1219ba6b951c57c9d336f2f4fd9d9479	18.6 MB	Download
MarFERReT.v1.metadata.csv md5:cf438beef5924e0051f728528a4e7165	384.3 kB	Preview Download
MarFERReT.v1.proteins.faa.gz md5:db804c5d3071343367949203e7f5e59d	4.7 GB	Download
MarFERReT.v1.proteins_info.tab.gz md5:a02716eed7ef6e8a4b05c72a889ebef3	246.2 MB	Download
MarFERReT.v1.RP63_QC_estimates.csv md5:68a05fc270c78360f554e5b3fc137472	94.8 kB	Preview Download
MarFERReT.v1.taxonomies.tab.gz md5:b36ed2da2921e1cd4eaf1126c6dc9294	90.6 MB	Download

	All versions	This version
Views	851	304
Downloads	1,269	422
Data volume	1.6 TB	172.9 GB

MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes

Creators

Description

Files

MarFERReT.v1.core_genes.csv

Files (6.4 GB)