Published November 21, 2023 | Version 1.1
Dataset Open

MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes

  • 1. University of Washington


Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of only about half of eukaryotic environmental transcripts. Here, we introduce version 1.0 of the Marine Functional EukaRyotic Reference Taxa (MarFERReT), an updated marine microbial eukaryotic sequence library with a version-controlled framework designed for taxonomic annotation of eukaryotic metatranscriptomes. We gathered 902 marine eukaryote genomes and transcriptomes from multiple sources and assessed these candidate entries for sequence quality and cross-contamination issues, selecting 800 validated entries for inclusion in the library. MarFERReT v1 contains reference sequences from 800 marine eukaryotic genomes and transcriptomes, covering 453 species- and strain-level taxa, totaling nearly 28 million protein sequences with associated NCBI and PR2 Taxonomy identifiers and Pfam functional annotations. An accompanying MarFERReT project repository hosts containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT.

MarFERReT is linked to a code repository hosting containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT here:

The raw source data for the 902 candidate entries considered for MarFERReT v1, including the 800 accepted entries, are available for download from their respective online locations. The source URL for each of the entries is listed here in MarFERReT.v1.entry_curation.csv, and detailed instructions and code for downloading the raw sequence data from source are available in the MarFERReT code repository (link).   

The following MarFERReT data products are available in this repository:

This CSV file contains descriptors of each of the 902 database entries, including data source, taxonomy, and sequence descriptors. Data fields are as follows:

  1. entry_id: Unique MarFERReT sequence entry identifier.
  2. accepted: Acceptance into the final MarFERReT build (Y/N). The Y/N values can be adjusted to customize the final build output according to user-specific needs.
  3. marferret_name: A human and machine friendly string derived from the NCBI Taxonomy organism name; maintaining strain-level designation wherever possible.
  4. tax_id: The NCBI Taxonomy ID (taxID).
  5. pr2_accession: Best-matching PR2 accession ID associated with entry
  6. pr2_rank: The lowest shared rank between the entry and the pr2_accession
  7. pr2_taxonomy: PR2 Taxonomy classification scheme of the pr2_accession
  8. data_type: Type of sequence data; transcriptome shotgun assemblies (TSA), gene models from assembled genomes (genome), and single-cell amplified genomes (SAG) or transcriptomes (SAT).
  9. data_source: Online location of sequence data; the Zenodo data repository (Zenodo), the repository (, MMETSP re-assemblies on Zenodo (MMETSP)17, NCBI GenBank (NCBI), JGI Phycocosm (JGI-Phycocosm), the TARA Oceans portal on Genoscope (TARA), or entries from the Roscoff Culture Collection through the METdb database repository (METdb).
  10. source_link: URL where the original sequence data and/or metadata was collected.
  11. pub_year: Year of data release or publication of linked reference.
  12. ref_link: Pubmed URL directs to the published reference for entry, if available.
  13. ref_doi: DOI of entry data from source, if available.
  14. source_filename: Name of the original sequence file name from the data source.
  15. seq_type: Entry sequence data retrieved in nucleotide (nt) or amino acid (aa) alphabets.
  16. n_seqs_raw: Number of sequences in the original sequence file.
  17. source_name: Full organism name from entry source
  18. original_taxID: Original NCBI taxID from entry data source metadata, if available
  19. alias: Additional identifiers for the entry, if available

This CSV file contains curation and quality-control information on the 902 candidate entries considered for incorporation into MarFERReT v1, including curated NCBI Taxonomy IDs and entry validation statistics. Data fields are as follows:

  1. entry_id: Unique MarFERReT sequence entry identifier
  2. marferret_name: Organism name in human and machine friendly format, including additional NCBI taxonomy strain identifiers if available.
  3. tax_id: Verified NCBI taxID used in MarFERReT
  4. taxID_status: Status of the final NCBI taxID (Assigned, Updated, or Unchanged)
  5. taxID_notes: Notes on the original_taxID
  6. n_seqs_raw: Number of sequences in the original sequence file
  7. n_pfams: Number of Pfam domains identified in protein sequences
  8. qc_flag: Early validation quality control flags for the following: LOW_SEQS; less than 1,200 raw sequences; LOW_PFAMS; less than 500 Pfam domain annotations.
  9. flag_Lasek: Flag notes from Lasek-Nesselquist and Johnson (2019); contains the flag 'FLAG_LASEK' indicating ciliate samples reported as contaminated in this study.
  10. VV_contam_pct: Estimated contamination reported for MMETSP entries in Van Vlierberghe et al., (2021).
  11. flag_VanVlierberghe: Flag for a high level of estimated contamination, from 'flag_VanVlierberghe'  values over 50%: FLAG_VV.
  12. rp63_npfams: Number of ribosomal protein Pfam domains out of 63 total.
  13. rp63_contam_pct: Percent of total ribosomal protein sequences with an inferred taxonomic identity in any lineage other than the recorded identity, as described in the Technical Validation section from analysis of 63 Pfam ribosomal protein domains.
  14. flag_rp63: Flag for a high level of estimated contamination, from 'rp63_contam_pct'  values over 50%: FLAG_RP63.
  15. flag_sum: Count of the number of flag columns (`qc_flag`, `flag_Lasek`, `flag_VanVlierberghe`, and `flag_rp63`). All entries with one or more flag are nominally rejected ('accepted' = N); entries without any flags are validated and accepted ('accepted' = Y).
  16. accepted: Acceptance into the final MarFERReT build (Y or N).

This CSV file contains the results of the 'RP63' cross-contamination check using ribosomal proteins. The lineage bin columns are the taxonomic categories that define whether a query sequence is placed within or outside the expected lineage.  

  1. entry_handle: Human-readable tag concatenating the MarFERReT 'entry_id' with the 'marferret_name' (from MarFERReT.v1.metadata.csv)
  2. entry_id: Unique MarFERReT sequence entry identifier
  3. tax_id: The NCBI Taxonomy ID (taxID).
  4. n_seqs: Number of protein sequences annotated as a Pfam ribosomal protein family
  5. n_pfams: Number of unique Pfam protein families
  6. tax_group: The expected lineage of this entry sample from the 'predefined lineage' categories below
  7. contam_pct: The percentage of ribosomal protein sequences identified in a lineage other than the expected 'tax_group' lineage.
  8. [lineage bins]: Series of 21 columns Amoebozoa, Ciliophora, Colpodellida, Cryptophyceae, Dinophyceae, Euglenozoa, Glaucocystophyceae, Haptophyta, Heterolobosea, Opisthokonta, Palpitomonas, Perkinsozoa, Rhizaria, Rhodophyta, Stramenopiles, Viridiplantae, Bacteria, Archaea, Viruses, Other, Unknown.


This Gzip-compressed FASTA file contains the 27,951,013 final translated and clustered protein sequences for all 800 accepted MarFERReT entries. The sequence defline contains the unique identifier for the sequence and its reference (mftX, where 'X' is a ten-digit integer value).
This Gzip-compressed tab-separated file is formatted for interoperability with the DIAMOND protein alignment tool commonly used for downstream analyses and contains some columns without any data. Each row contains an entry for one of the MarFERReT protein sequences in MarFERReT.v1.proteins.faa.gz. Note that 'accession.version' and 'taxid' are populated columns while 'accession' and 'gi' have NA values; the latter columns are required for back-compatibility as input for the DIAMOND alignment software and LCA analysis. 

The columns in this file contain the following information:

  1. accession: (NA)
  2. accession.version: The unique MarFERReT sequence identifier ('mftX').
  3. taxid: The NCBI Taxonomy ID associated with this reference sequence.
  4. gi: (NA).
This Gzip-compressed tab-separated file contains a row for each final MarFERReT protein sequence with the following columns:

  1. aa_id: the unique identifier for each MarFERReT protein sequence.
  2. entry_id: The unique numeric identifier for each MarFERReT entry.
  3. source_defline: The original, unformatted sequence identifier


This Gzip-compressed archive contains the raw HMMER3297 output from the search of Pfam 34.0 HMM profiles against the full set of protein sequences from candidate entries. The archive contains files for each entry with the suffix '' and a prefix with the 'entry_id' and 'marferret_name' values from MarFERReT.v1.metadata.csv. The '' files are the output from hmmsearch using the --domtblout parameter containing 3 header and 10 footer rows beginning with '#' and rows for each hmmsearch match with  22 whitespace-delimited fields and a target sequence description (see here for more information on the hmmsearch output file formats). The 'target name' (original sequence identifier from, 'query name' (Pfam name), 'accession' (Pfam ID), 'E-value' and 'score' (full sequence match scores) are retained in downstream data products.



This Gzip-compressed CSV file contains the best-scoring Pfam annotation for intra-species clustered protein sequences from the 800 validated MarFERReT entries; derived from the raw hmmsearch annotations in MarFERReT.candidate_entry_Pfam_annotations.tar.gz. This file contains the following fields:

  1. aa_id: The unique MarFERReT protein sequence ID ('mftX').
  2. entry_id: Unique MarFERReT sequence entry identifier
  3. source_defline: Original FASTA sequence identifier
  4. pfam_name: The shorthand Pfam protein family name.
  5. pfam_id: The Pfam identifier.



This Gzip-compressed CSV file contains a reduced version of MarFERReT.v1.best_pfam.csv.gz; grouped by `entry_id` and `pfam_id` to summarize the number of sequences (`n_seqs`) with each unique entry_id-pfam_id pair. Contains the `entry_id`,  `pfam_id`, `pfam_name` and `n_seqs` columns.



This CSV file contains the core transcribed gene (CTG) catalog derived from MarFERReT transcribed reference sequence data (see Methods) to be used in environmental metatranscriptome analysis in conjunction with other MarFERReT data products.  The columns contain the following values:

  1. lineage: Name of major marine microbial eukaryote lineage
  2. n_taxa: Number of species- and strain-level taxa this Pfam observed in
  3. pfam_id: Pfam protein family identifier
  4. frequency: Proportion of species (n_species) in lineage where pfam_id is observed



