EukRibo: a manually curated eukaryotic 18S rDNA reference database

Berney, Cédric

doi:10.5281/zenodo.6896896

Published July 24, 2022 | Version 2.0

Dataset Open

EukRibo: a manually curated eukaryotic 18S rDNA reference database

Berney, Cédric

EukRibo is a manually curated database of reference small-subunit ribosomal RNA gene (18S rDNA) sequences of eukaryotes, specifically aimed at taxonomic annotation of high-throughput metabarcoding datasets. Unlike other reference databases of ribosomal genes, it is not meant to exhaustively capture all publicly available 18S rDNA sequences from the INSDC repositories, but to represent a subset of highly trustable sequences covering the whole known diversity of eukaryotes, with a focus on protists, manually verified taxonomic identifications, and relatively low genetic redundancy.

EukRibo is part of a suite of public resources generated by the UniEuk project (www.unieuk.org), which are all designed to follow a common taxonomic framework for maximal interoperability. The high level of taxonomic accuracy of EukRibo, together with a newly designed, phylogenetically-informed annotation approach, allow high confidence in the taxonomic annotation of environmental metabarcodes, as well as identification of new eukaryotic diversity at various taxonomic levels using a connected components approach.

* * *

Accompanying preprint available at https://doi.org/10.1101/2022.11.03.515105.

* * *

EukRibo ReadMe file, versions 1 and 2

Each EukRibo release consists of 4 files:
- a tsv table containing the taxonomic and other information about the 18S rDNA sequences included in the release
- a fasta file containing the full sequences as retrieved from the INSDC repositories (NCBI, EMBL-EBI/ENA, DDBJ)
- a fasta file containing the variable region V4 extracted from all these sequences (based on the fragment amplified with the Tara-Oceans V4 primers)
- a fasta file containing the variable region V9 extracted from the subset of sequences where it is present (based on the fragment amplified with the Tara-Oceans V9 primers)

The primary goal of EukRibo was to be used to annotate the EukBank meta-dataset of available V4 metabarcoding datasets, and therefore all sequences included in EukRibo contain the variable region V4.
Only a subset of these sequences (about 75%) also contain the variable region V9; this is because many 18S rDNA sequences in the INSDC repositories stop before the V9 fragment.

Sequences with slightly incomplete V4 or V9 fragments were kept if phylogenetically useful - i.e. if they are the only available representatives of a certain taxonomic lineage.
V4 We allowed up to 50 missing positions in the relatively conserved area at the 5' end of the V4 fragment (for an average fragment length of about 380 bp); no sequence incomplete at the 3' end of the V4 fragment is included.
V9 We allowed up to 30 missing positions in the relatively conserved area at the 3' end of the V9 fragment (for an average length of about 135 bp); no sequence incomplete at the 5' end of the V9 fragment is included.
We allowed a higher proportion of missing positions for the V9 region because being more conservative would imply losing too many sequences, including entire taxonomic lineages.

Version 1 of EukRibo
This is the starting version of EukRibo that was used for the taxonomic annotation of the EukBank dataset, with taxonomy strings that were fixed as of October 2020.
- Contains 46,345 sequences with a sufficiently complete V4 region; 46,299 with the actual complete V4 region and 46 (about 0.1%) with missing positions at the 5' end.
- Of these, 34,438 also include a sufficiently complete V9 region; 23,226 with the actual complete V9 region and 11,206 (about 33%) with missing positions at the 3' end.

Version 2 of EukRibo
This is a version of EukRibo that was made taxonomically compatible with version 3 of the EukProt database (https://doi.org/10.1101/2020.06.30.180687), with taxonomic revisions as of July 2022 as well as additional information on the included selection of sequences that was not provided in the tsv file of version 1.
- Contains the exact same selection of sequences as in version 1, with the addition of genus Meteora, the last remaining known supergroup-level eukaryotic lineage for which an 18S rDNA was not previously available. (The Meteora sequence contains the full V4 fragment but does not include a sufficiently complete V9 fragment.)
- Only 34,432 sequences with a sufficiently complete V9 region are now retained because of 6 previously unrecognised chimeric sequences where the V9 fragment does not originate from the same organism as the V4 fragment.

Files in EukRibo version 1:
46345_EukRibo.tsv.gz
46345_EukRibo_full_seqs.fas.gz
46345_EukRibo_V4.fas.gz
34438_EukRibo_V9.fas.gz

The tsv file contains 6 columns:
gb_accession - INSDC accession number of the sequence
supergroup, taxogroup1, taxogroup2 - binning of the taxa into strictly monophyletic clades of evolutionary and/or ecological significance
UniEuk_taxonomy_string - full UniEuk-compatible taxonomic annotation of the sequence
- an unlimited number of levels is allowed (going down to strain for isolated organisms or to clone for environmental sequences)
- informal names are used for phylogenetically supported clades without formal name
V9 - presence ('Y') or absence ('N') of a sufficiently complete V9 fragment in the sequence

Files in EukRibo version 2:
46346_EukRibo-02.tsv.gz
46346_EukRibo-02_full_seqs.fas.gz
46346_EukRibo-02_V4.fas.gz
34432_EukRibo-02_V9.fas.gz

The tsv file now contains 12 columns:
gb_accession, supergroup, taxogroup1, taxogroup2, UniEuk_taxonomy_string
- same columns as in version 1
alternative_strain_names (new) - provides alternative strain/isolate names when known to help cross-linking genetic data coming from the same organism
V4 (new) - indicates whether the V4 fragment is complete ('yes - complete') or missing positions at the 5' end ('yes - partial')
V9 (emended content) - now contains more precise information than in version 1 about whether it is complete ('yes - complete'), missing positions at the 3' end ('yes - partial'), or was excluded, and the 6 possible reasons why ('no - missing', 'no - too incomplete', 'no - chimera', 'no - bad quality', 'no - deletion in V9', 'no - Ns in V9')
EukProt_ID_same_strain (new) - accession of EukProt datasets from the same isolate
EukProt_ID_different_strain (new) - accession of EukProt datasets from a different isolate of the same species
columns_modified_since_previous_version (new) - lists all of the 6 pre-existing columns that have a modified content compared to version 1
remarks (new) - additional information such as presence of an intron in the V9 fragment, taxonomic identity of the two parts of chimeric sequences, or the presence of Ns or a deletion in the V4 or the V9 fragment (but insufficient to warrant exclusion)

Notes

Accompanying preprint available at https://doi.org/10.1101/2022.11.03.515105

Files

Files (19.9 MB)

Name	Size	Download all
34432_EukRibo-02_V9_2022-07-22.fas.gz md5:198ec9494a040e5fc98efa55f0141a41	1.4 MB	Download
46346_EukRibo-02_2022-07-22.tsv.gz md5:1171c9953f24f40c9f90ddf8b141cc11	825.9 kB	Download
46346_EukRibo-02_full_seqs_2022-07-22.fas.gz md5:11a5613891ce51d267d4db4c0fb03fc1	14.2 MB	Download
46346_EukRibo-02_V4_2022-07-22.fas.gz md5:d3a836ccee254f72b7a63ce33afdc604	3.4 MB	Download

	All versions	This version
Views	3,208	2,226
Downloads	1,156	915
Data volume	8.1 GB	6.4 GB

EukRibo: a manually curated eukaryotic 18S rDNA reference database

Creators

Description

Notes

Files

Files (19.9 MB)