PR2 version 5.0.0
Cilates L. Santoferrara et al.

Author

Daniel Vaulot

Published

February 22, 2023

Comments

We only made changes (in yellow) in the “sequences” tab of the attached file.

  • We didn’t fully replace the current Spirotrichea annotation with the EukRef one. Too bad it was not done at the right time, but it would be too much work now and it is not fully justified as there have been updates and EukRef rules resulted too conservative for our target groups.
  • We focused on Oligotrichida and Choreotrichida. Rather than replacing the current annotation, we reconciled it with useful EukRef rules and other published work (Ganser, Santoferrara & Agatha MPE 2022). For these groups we:
    • added new sequences, made some corrections, and updated names (sensu Adl et al. 2019).
    • removed these artificial groups:
      • Leegardiellidae_A and _B: replaced by Leegardiellidae
      • Strobilidiidae_A to _J: replaced by Strobilidiidae
      • Strombidiidae_A to _R: replaced by Strombidiidae
      • Tontoniidae_A and _B: replaced by Tontoniidae
      • Strombidiida_A to _H: replaced by Strombidiida
    • For tintinnids, we included both the order and suborder (Choreotrichida-Tintinnina) in the order column (best compromise, hopefully acceptable)
  • We made only minor changes in other Spirotrichea (Euplotia, etc), and most entries don’t seem problematic. Perhaps Vittorio can have a look when he has the chance, as he is the expert in these groups.
  • We acknowledge that the full Spirotrichea annotation should eventually be re-assessed by re-running the EukRef pipeline, and/or complementing with other methods (e.g. phylogenetic placement of environmental sequences; use of molecular signatures). This is a project for future.

Init

source(here::here("R",'PR2_init.R'), echo=FALSE)
source(here::here("R",'PR2_read_google_quick.R'), echo=FALSE)

  pr2_18S <- pr2 %>% 
    filter(gene == "18S_rRNA")
  
  pr2_18S_active <- pr2_active %>% 
    filter(gene == "18S_rRNA")

Set up the files

target_group =  c("Spirotrichea")
target_level = "class"

dir_pr2_update <- "D - Ciliates"

pr2.env$editor <- "L. Santoferrara et al."

full_path <- function(file_name){here::here("5.0",dir_pr2_update , file_name)}

# create the directory for taxonomy output
dir.create(full_path("taxo"))

Save current state of PR2 for group

pr2_18S_target <- pr2_18S_active  %>% 
  filter(!!as.symbol(target_level) %in% target_group) %>% 
  arrange(across(any_of(pr2.env$taxo_levels[[9]])))

rio::export(pr2_18S_target, 
            full_path("pr2_Spirotrichea_v_5.0_2023-02-22.xlsx"),
            firstActiveRow = 2, 
            firstActiveCol = 2,
            zoom = 80)

Read the original data and reformat

Read the data

  • Number of sequences = 3918
# File has been updated by interactions with Luciana
file_pr2_update_excel = full_path("pr2_Spirotrichea_2022_09_11_LS-MG_2023_02_21.xlsx")

pr2_update <- import(file_pr2_update_excel, sheet = "sequences", guess_max=200000, na=c("", "-"))
  
str_c("Number of sequences : ", nrow(pr2_update))

Add to PR2 missing sequences from Genbank

  • Run the script script_genbank_xml.R on server

  • Run second part PR2-update-GenBank.R

  pr2_new <-  filter(pr2_update, !(genbank_accession %in% pr2_18S$genbank_accession)) 

# Updated sequences that are not present in PR2
  pr2_new
  
  # Use this data frame to download new sequences and then restart from beginning
  pr2_new <- pr2_new %>% 
    pull(genbank_accession)
  
 saveRDS(pr2_new, full_path("accessions_new_ciliates.rds"))

Compare with sequences in PR2

  • Sequences in target group in PR2 that are active: 3877

  • Sequences in target group in PR2 that need update: 3918

  • Sequences in update that are not active in PR2: 41 (this corresponds to sequence which no species_9 field)

  • Sequences in target group in PR2 that are not updated: 0

  • Sequences duplicated (e.g. with and without introns): 1

  # Sequences of target group in pr2
  pr2_18S_target

  
  # Sequences of PR2 that need update
  filter(pr2_18S, (genbank_accession %in% pr2_update$genbank_accession)) 
  
  # Updated sequences that are not active in PR2
  
  filter(pr2_update, !(genbank_accession %in% pr2_18S_active$genbank_accession)) 
  
  # Updated sequences that are not present in PR2  
  filter(pr2_update, !(genbank_accession %in% pr2_18S$genbank_accession))
  
  
  # Sequences from target group in PR2 that are not in update
  pr2_target_not_updated <- filter(pr2_18S_active, (!!as.symbol(target_level) %in% target_group) & 
                                             !(genbank_accession %in% pr2_update$genbank_accession)) 
  
  pr2_target_not_updated  

  #  rio::export(pr2_target_not_updated, full_path("pr2_ciliates_not_updated.xlsx"))
  
  
  # Sequences updated with 2 entries in PR2  (e.g. with and without introns) 
  left_join(select(pr2_update, genbank_accession), 
            select(pr2_18S, genbank_accession, pr2_accession)) %>% 
   count(genbank_accession) %>% 
   filter(n > 1) 

Update pr2_taxonomy

Build and check

  • Taxa to be updated: 428
  • Taxa do be added: 46
  • Taxa total: 474
pr2_taxo_updated <- pr2_update %>% 
  group_by_at(pr2.env$taxo_levels[[pr2.env$taxo_levels_number]]) %>%  
  count() 

pr2_taxo_check(pr2_taxo_updated, pr2.env$taxo_levels[[pr2.env$taxo_levels_number]], full_path("taxo"))

pr2_taxo_raw_targeted <- pr2_taxo_raw %>% 
  filter(class_9 == target_group)

pr2_taxo_updated <- pr2_taxo_updated %>% 
  rename_all(~ str_c(.,"_9_new" )) %>%
  dplyr::rename(species_9 = species_9_new) %>% 
  left_join(pr2_taxo_raw_targeted) %>% 
  rename_at(pr2.env$taxo_levels_9, ~ str_c(.,"_old" )) %>% 
  rename_all( ~ str_replace(.,"_9_new", "_9" )) %>% 
  dplyr::rename(species_9 = species_9_old) %>% 
  mutate(species_8 = species_9, 
         genus_8 = genus_9,
         family_8 = family_9,
         order_8 = order_9,
         taxo_edited_version = str_c(pr2.env$version,"; ",  replace_na(taxo_edited_version, "")),
         taxo_edited_by = str_c("L. Santoferrara; ", replace_na(taxo_edited_by, "")),
         taxo_remark = str_c("5.0: Taxo updated to be compatible EukRef; ", replace_na(taxo_remark, ""))) %>% 
  relocate (contains("_9_old"), .before = domain_9)
  

rio::export(pr2_taxo_updated, 
            full_path("pr2_taxo_added_updated.xlsx"), 
            firstActiveRow = 2, 
            firstActiveCo = 9, 
            zoom = 80)

Find taxa in PR2 that are not included in the update

  • These taxa should be removed
  • Taxa to be removed: 68
pr2_taxo_not_updated <- pr2_taxo_raw_targeted %>%  
  filter(!(species_9 %in% pr2_taxo_updated$species_9)) %>% 
  select(taxo_id, species_9) %>% 
  mutate(taxo_removed_version = pr2.env$version)

export(pr2_taxo_not_updated , full_path("pr2_taxo_removed.xlsx"))

Update of table pr2_main

Sequences that need updating

pr2_update_final <- pr2_update %>% 
  select(genbank_accession, species) %>% 
  dplyr::rename(species_new = species) %>% 
  left_join(select(pr2_main, pr2_accession, genbank_accession, species, edited_version, edited_by))

Sequences without species name or with different species

  • Sequences added: 38
  • Sequences updated: 1988
pr2_main_updated <- pr2_update_final %>% 
  filter((species != species_new)|is.na(species)) 

pr2_main_updated 

glue::glue("Number of updated sequences {nrow(filter(pr2_main_updated, !is.na(species)))}")
glue::glue("Number of new sequences {nrow(filter(pr2_main_updated, is.na(species)))}")

Add fields

pr2_main_updated <- pr2_main_updated %>%  
    select (pr2_accession, 
            species_8 = species_new, 
            species_9 = species_new, 
            species_old = species,
            edited_version,
            edited_by) %>% 
    mutate(edited_version = str_c(pr2.env$version,"; ",  replace_na(edited_version, "")),
           edited_by = str_c("L. Santoferrara; ", replace_na(edited_by, "")) ) 

Save everything to an Excel file

 file_pr2_imports <-  full_path("pr2_imports_final.xlsx")
 onglets <- list("pr2_main_updated" = pr2_main_updated
                 # "pr2_sequences_updated" = pr2_sequence_updated
                 )
 rio::export(onglets, file_pr2_imports, 
            firstActiveRow = 2, 
            firstActiveCol = 2,
            zoom = 80 
             )