pr2database/pr2database: PR2 version 4.14.0

Daniel Vaulot

Main changes A single SSU database

From version 4.14.0, a single SSU database is provided which contains sequences for:

  • 18S rRNA from nuclear and nucleomorph
  • 16S rRNA from plastid, apicoplast, chromatophore, mitochondrion
  • 16S rRNA from a small selection of bacteria

The rationale is that the database can now be used to detect bacterial sequences that are amplified with either 18S rRNA or "universal" primers. These sequences can be further assigned with Silva or GTDB.

In order to allow correct assignation with software such as DECIPHER (IDTax) for organelle, the taxonomy is appended with 4 letters corresponding to the organelle

Organelle Taxonomy suffix nucleus nucleomorph :nucl plastid :plas apicoplast :apic chromatophore :chrom mitochondrion :mito Major groups for which taxonomy has been updated

  • Apicomplexa
  • Labyrinthulids
  • Radiolaria
  • Foraminifera
  • Radiolaria

Quarantined sequences (makes sense in these COVID times...)

We are introducing sequences that have been quarantined. These sequences have been reassigned with DECIPHER IDTax but the bootstrap values were low or they have been flagged as problematic by DECIPHER during the LeaningTax phase. These sequences are not provided with the current version but will be added in the future avec verification of their taxonomic assignement.

List of sequences added or updated

  • Added: 9,710
  • Updated: 25,298
  • Quarantined: 614
  • Removed: 462


Taxonomic groups updated

  • Alveolata - Javier del Campo

    • Apicomplexa
      • 9955 sequences updated or added.
      • 303 sequences quarantined needing phylogeny assignment.
      • 583 taxonomy entries revised
  • Chlorophyta

    • Ostrobium : 2 sequences added
  • Stramenopiles

    • Labyrinthulids - Javier del Campo
      • Sequences updated or added: 1280
      • Sequences quarantined: 133
      • Taxonomy fully revised: 69 species
    • Cafeteria - Alex Schoenle following Schoenle et al. (2020)
      • sequences updated: 30
      • sequences added: 31
      • script
    • Cafileria marina: 8 sequences added
  • Haptophyta

    • Rappephyceae - Kawachi et al. (2021)
      • Rappemonads moved into Rappephyceae
      • 4 sequences added
  • Radiolaria - Miguel Sandin

  • Foraminifera - Raphaël Morard

    • Total number of validated sequences: 3839
    • Taxonomy updated or added: 315 entries
    • Sequences added: 1149
    • Sequences updated (including new sequences): 2164
    • script to upload to PR2
  • Excavata - Javier del Campo and EUkref team

    • EUkref team: Martin Kolisko, Olga Flegontova, Anna Karnkowska, Gordon Lax, Julia M. Maritz, Tomáš Pánek, Petr Táborský, Jane M. Carlton, Ivan Cepička6, Aleš Horák, Julius Lukeš, Alastair G.B. Simpson, and Vera Tai
    • Total number of validated sequences: 6265
    • Taxa updated or added: 735
    • Sequences added from GenBank: 75
    • Sequences updated (existing + new): 1347 + 2875
    • Sequences quarantined: 104
    • Metadata updated with eukref fields: 6091
  • 16S plastid sequences (Ostreobium and Apicomplexa)- Javier del Campo

    • 87 sequences reassigned
    • 482 sequences added
  • Bacteria, Archaea - Daniel Vaulot

    • Sequences added: 7945
    • Taxa added: 1571
    • These sequences originate from Silva seed alignment v. 132 as found on the mothur site
    • They are used as "control" sequences when assigning metabarcodes, especially for primers that are either "universal", i.e. amplify both 18S and 16S or that are "imperfect", in the sense that they also amplify a small fraction of the 16S sequences.

Sequences uploaded but not yet annotated

  • 8763 18S rRNA sequences added from GenBank - 2020-05-27 to 2021-03-23 - Script

Sequences removed

  • Potential chimera in Radiolaria: 343 (M. Sandin)
  • Bad sequences: 6 (F. Mahé)
  • chimeras: 95 (A M Fiore-Donno)
  • ITS: 20 (A M Fiore-Donno)
  • Badly assigned: 6 (A M Fiore-Donno)

Sequences modified (F. Mahé)

  • complemented: 26
  • reverse complemented: 114 + 189 Script

Metadata added

  • A large number of metadata have been downloaded from GenBank such as GebNak taxonomy and references associated with sequences.

Database structure

  • pr2_main
    • quarantined_version: sequences flagged as quarantined will need to be re-assigned latter.
  • pr2_metadata
    • gb_references: removed (empty)
    • gb_locus: removed (empty)
    • gb_division: addede - Three letter code for Genbank division (eg PLN, ENV...)

Metadata added

The following fields were populated from GenBank when the data were missing (413,230 records updated)

  • gb_taxonomy
  • gb_project
  • gb_authors, gb_publication, gb_journal
  • gb_sequence
  • gb_division
  • gb_date


Scripts are just provided to show some of the procedures used to update the PR2 database. Do not try to run them, they will not work as they require access to the MySQL PR2 database.

