Published January 22, 2024 | Version 1.1.1
Dataset Open

MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes

  • 1. University of Washington

Description

Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of only about half of eukaryotic environmental transcripts. Here, we introduce version 1.0 of the Marine Functional EukaRyotic Reference Taxa (MarFERReT), an updated marine microbial eukaryotic sequence library with a version-controlled framework designed for taxonomic annotation of eukaryotic metatranscriptomes. We gathered 902 marine eukaryote genomes and transcriptomes from multiple sources and assessed these candidate entries for sequence quality and cross-contamination issues, selecting 800 validated entries for inclusion in the library. MarFERReT v1 contains reference sequences from 800 marine eukaryotic genomes and transcriptomes, covering 453 species- and strain-level taxa, totaling nearly 28 million protein sequences with associated NCBI and PR2 Taxonomy identifiers and Pfam functional annotations. An accompanying MarFERReT project repository hosts containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT.

MarFERReT is linked to a code repository hosting containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT here: https://github.com/armbrustlab/marferret

The raw source data for the 902 candidate entries considered for MarFERReT v1.1.1, including the 800 accepted entries, are available for download from their respective online locations. The source URL for each of the entries is listed here in MarFERReT.v1.1.1.entry_curation.csv, and detailed instructions and code for downloading the raw sequence data from source are available in the MarFERReT code repository (link).   

This repository release contains MarFERReT database files from the v1.1.1 MarFERReT release using the following MarFERReT library build scripts: assemble_marferret.sh, pfam_annotate.sh, and build_diamond_db.sh

The following MarFERReT data products are available in this repository:

MarFERReT.v1.1.1.metadata.csv
This CSV file contains descriptors of each of the 902 database entries, including data source, taxonomy, and sequence descriptors. Data fields are as follows:

  1. entry_id: Unique MarFERReT sequence entry identifier.
  2. accepted: Acceptance into the final MarFERReT build (Y/N). The Y/N values can be adjusted to customize the final build output according to user-specific needs.
  3. marferret_name: A human and machine friendly string derived from the NCBI Taxonomy organism name; maintaining strain-level designation wherever possible.
  4. tax_id: The NCBI Taxonomy ID (taxID).
  5. pr2_accession: Best-matching PR2 accession ID associated with entry
  6. pr2_rank: The lowest shared rank between the entry and the pr2_accession
  7. pr2_taxonomy: PR2 Taxonomy classification scheme of the pr2_accession
  8. data_type: Type of sequence data; transcriptome shotgun assemblies (TSA), gene models from assembled genomes (genome), and single-cell amplified genomes (SAG) or transcriptomes (SAT).
  9. data_source: Online location of sequence data; the Zenodo data repository (Zenodo), the datadryad.org repository (datadryad.org), MMETSP re-assemblies on Zenodo (MMETSP)17, NCBI GenBank (NCBI), JGI Phycocosm (JGI-Phycocosm), the TARA Oceans portal on Genoscope (TARA), or entries from the Roscoff Culture Collection through the METdb database repository (METdb).
  10. source_link: URL where the original sequence data and/or metadata was collected.
  11. pub_year: Year of data release or publication of linked reference.
  12. ref_link: Pubmed URL directs to the published reference for entry, if available.
  13. ref_doi: DOI of entry data from source, if available.
  14. source_filename: Name of the original sequence file name from the data source.
  15. seq_type: Entry sequence data retrieved in nucleotide (nt) or amino acid (aa) alphabets.
  16. n_seqs_raw: Number of sequences in the original sequence file.
  17. source_name: Full organism name from entry source
  18. original_taxID: Original NCBI taxID from entry data source metadata, if available
  19. alias: Additional identifiers for the entry, if available


MarFERReT.v1.1.1.curation.csv
This CSV file contains curation and quality-control information on the 902 candidate entries considered for incorporation into MarFERReT v1, including curated NCBI Taxonomy IDs and entry validation statistics. Data fields are as follows:

  1. entry_id: Unique MarFERReT sequence entry identifier
  2. marferret_name: Organism name in human and machine friendly format, including additional NCBI taxonomy strain identifiers if available.
  3. tax_id: Verified NCBI taxID used in MarFERReT
  4. taxID_status: Status of the final NCBI taxID (Assigned, Updated, or Unchanged)
  5. taxID_notes: Notes on the original_taxID
  6. n_seqs_raw: Number of sequences in the original sequence file
  7. n_pfams: Number of Pfam domains identified in protein sequences
  8. qc_flag: Early validation quality control flags for the following: LOW_SEQS; less than 1,200 raw sequences; LOW_PFAMS; less than 500 Pfam domain annotations.
  9. flag_Lasek: Flag notes from Lasek-Nesselquist and Johnson (2019); contains the flag 'FLAG_LASEK' indicating ciliate samples reported as contaminated in this study.
  10. VV_contam_pct: Estimated contamination reported for MMETSP entries in Van Vlierberghe et al., (2021).
  11. flag_VanVlierberghe: Flag for a high level of estimated contamination, from 'flag_VanVlierberghe'  values over 50%: FLAG_VV.
  12. rp63_npfams: Number of ribosomal protein Pfam domains out of 63 total.
  13. rp63_contam_pct: Percent of total ribosomal protein sequences with an inferred taxonomic identity in any lineage other than the recorded identity, as described in the Technical Validation section from analysis of 63 Pfam ribosomal protein domains.
  14. flag_rp63: Flag for a high level of estimated contamination, from 'rp63_contam_pct'  values over 50%: FLAG_RP63.
  15. flag_sum: Count of the number of flag columns (`qc_flag`, `flag_Lasek`, `flag_VanVlierberghe`, and `flag_rp63`). All entries with one or more flag are nominally rejected ('accepted' = N); entries without any flags are validated and accepted ('accepted' = Y).
  16. accepted: Acceptance into the final MarFERReT build (Y or N).

 

MarFERReT.v1.1.1.proteins.faa.gz
This Gzip-compressed FASTA file contains the 27,951,013 final translated and clustered protein sequences for all 800 accepted MarFERReT entries. The sequence defline contains the unique identifier for the sequence and its reference (mftX, where 'X' is a ten-digit integer value). 

 

MarFERReT.v1.1.1.taxonomies.tab.gz
This Gzip-compressed tab-separated file is formatted for interoperability with the DIAMOND protein alignment tool commonly used for downstream analyses and contains some columns without any data. Each row contains an entry for one of the MarFERReT protein sequences in MarFERReT.v1.proteins.faa.gz. Note that 'accession.version' and 'taxid' are populated columns while 'accession' and 'gi' have NA values; the latter columns are required for back-compatibility as input for the DIAMOND alignment software and LCA analysis. 

The columns in this file contain the following information:

  1. accession: (NA)
  2. accession.version: The unique MarFERReT sequence identifier ('mftX').
  3. taxid: The NCBI Taxonomy ID associated with this reference sequence.
  4. gi: (NA).

 

MarFERReT.v1.1.1.proteins_info.tab.gz
This Gzip-compressed tab-separated file contains a row for each final MarFERReT protein sequence with the following columns:

  1. aa_id: the unique identifier for each MarFERReT protein sequence.
  2. entry_id: The unique numeric identifier for each MarFERReT entry.
  3. source_defline: The original, unformatted sequence identifier

 

MarFERReT.v1.1.1.best_pfam_annotations.csv.gz
This Gzip-compressed CSV file contains the best-scoring Pfam annotation for intra-species clustered protein sequences from the 800 validated MarFERReT entries; derived from the hmmsearch annotations against Pfam 34.0  functional domains. This file contains the following fields:

  1. aa_id: The unique MarFERReT protein sequence ID ('mftX').
  2. pfam_name: The shorthand Pfam protein family name.
  3. pfam_id: The Pfam identifier.
  4. pfam_eval: hmm profile match e-value score
  5. pfam_score: hmm profile match bitscore


MarFERReT.v1.1.1.dmnd
This binary file is the indexed database of the MarFERReT protein library with embedded NCBI taxonomic information generated by the DIAMOND makedb tool using the build_diamond_db.sh script from the MarFERReT /scripts/ library. This can be used as the reference DIAMOND database for annotating environment sequences from eukaryotic metatranscriptomes. 

Files

MarFERReT.v1.1.1.curation.csv

Files (14.1 GB)

Name Size Download all
md5:3a8789b1134574a96a518aaccd275320
121.9 MB Download
md5:2e6321bb878c30edbfa4412a9203ba3b
96.4 kB Preview Download
md5:9b0f1b9bc07920beda6cbc816e39d114
8.9 GB Download
md5:cf438beef5924e0051f728528a4e7165
384.3 kB Preview Download
md5:87b72276ad7dc369185e7978b230c297
4.7 GB Download
md5:0379b3773c6e0a43eb9f1767d3b1e2ca
246.1 MB Download
md5:1911e85928287c13bddd46eb418b5cf1
90.6 MB Download

Additional details

Related works

Is version of
Dataset: 10.5281/zenodo.10170983 (DOI)

Dates

Submitted
2024-01-22