mikemc/speedyseq: speedyseq v0.2.0

doi:10.5281/zenodo.3923184

Published June 30, 2020 | Version v0.2.0

Software Open

mikemc/speedyseq: speedyseq v0.2.0

Michael McLaren¹

1. North Carolina State Unversity

Breaking changes

The default ordering of new taxa output by tax_glom() is different from previous versions and from phyloseq::tax_glom() in phyloseq objects that do not have phylogenetic trees. See "Minor improvements and fixes" for more information.

New features New general-purpose vectorized taxa-merging function

The new merge_taxa_vec() function provides a vectorized version of phyloseq::merge_taxa() that can quickly merge arbitrary groups of taxa and now forms the basis of all other merging functions. phyloseq::merge_taxa() takes a phyloseq object or component object x and a set of taxa eqtaxa and merges them into a single taxon. In place of the eqtaxa argument, speedyseq's merge_taxa_vec() takes a vector group of length ntaxa(physeq) that defines how all the taxa in x should be merged into multiple new groups. Its syntax and behavior is patterned after that of base::rowsum(), which it uses to do the merging in the OTU table. When aiming to merge a large number of taxa into a smaller but still large number of groups, it is much faster to do all the merging with one call to merge_taxa_vec() than to loop through many calls to merge_taxa().

A practical example is clustering amplicon sequence variants (ASVs) into OTUs defined by a given similarity threshold. Suppose we have a phyloseq object ps that has the ASV sequences stored in its refseq slot. We can cluster the ASV sequences into 97% OTUs using the DECIPHER package with

dna <- refseq(ps)
nproc <- 1 # Increase to use multiple processors
aln <- DECIPHER::AlignSeqs(dna, processors = nproc)
d <- DECIPHER::DistanceMatrix(aln, processors = nproc)
clusters <- DECIPHER::IdClusters(
  d, 
  method = "complete",
  cutoff = 0.03, # corresponds to 97% OTUs
  processors = nproc
)

Next we merge_taxa_vec() to get the merged phyloseq object,

ps0 <- merge_taxa_vec(
  ps, 
  group = clusters$cluster,
  tax_adjust = 2
)

The names of the new taxa are set to the name of the most abundant taxon within each group (the same behavior as the tax_glom() and tip_glom() functions). Future versions will likely have a names argument to control the naming behavior.

The tax_adjust argument controls how NAs and within-group disagreements in taxonomy are handled to determine the taxonomy of the new taxa (see help(merge_taxa_vec) for details). An example of the difference between tax_adjust = 1 (phyloseq::merge_taxa() behavior) and tax_adjust = 2 can be seen in the following example from the new tip_glom() documentation,

data(GlobalPatterns)

set.seed(20190421)
ps <- prune_taxa(sample(taxa_names(GlobalPatterns), 2e2), GlobalPatterns)

ps1 <- tip_glom(ps, 0.1, tax_adjust = 1)
ps2 <- tip_glom(ps, 0.1, tax_adjust = 2)
tax_table(ps1)[c(108, 136, 45),]
#> Taxonomy Table:     [3 taxa by 7 taxonomic ranks]:
#>        Kingdom    Phylum           Class                 Order               
#> 578831 "Bacteria" "Bacteroidetes"  "Sphingobacteria"     "Sphingobacteriales"
#> 2801   "Bacteria" "Planctomycetes" "Planctomycea"        "Pirellulales"      
#> 185581 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Oceanospirillales" 
#>        Family Genus            Species                             
#> 578831 NA     "Niabella"       NA                                  
#> 2801   NA     "Rhodopirellula" NA                                  
#> 185581 "OM60" NA               "marinegammaproteobacteriumHTCC2080"
tax_table(ps2)[c(108, 136, 45),]
#> Taxonomy Table:     [3 taxa by 7 taxonomic ranks]:
#>        Kingdom    Phylum           Class                 Order               
#> 578831 "Bacteria" "Bacteroidetes"  "Sphingobacteria"     "Sphingobacteriales"
#> 2801   "Bacteria" "Planctomycetes" "Planctomycea"        "Pirellulales"      
#> 185581 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Oceanospirillales" 
#>        Family Genus Species
#> 578831 NA     NA    NA     
#> 2801   NA     NA    NA     
#> 185581 "OM60" NA    NA

Faster and lower-memory implementation of phyloseq::tip_glom()

The new tip_glom() function provides a speedy version of phyloseq::tip_glom(). This function performs a form of indirect phylogenetic merging of taxa using the phylogenetic tree in phy_tree(physeq) by 1) using the tree to create a distance matrix, 2) performing hierarchical clustering on the distance matrix, and 3) defining new taxonomic groups by cutting the dendrogram at the height specified by the h parameter. Speedyseq's tip_glom() provides a faster and less memory-intensive alternative to phyloseq::tip_glom() through the use of vectorized merging (via merge_taxa_vec()) and faster and lower-memory phylogenetic-distance computation (via get_all_pairwise_distances() from the castor package).

Speedyseq's tip_glom() also has the new tax_adjust argument, which is passed on to merge_taxa_vec(). It is set to 1 by default for phyloseq compatibility and should give identical results to phyloseq in this case.

For phyloseq compatibility, the default clustering function is left as cluster::agnes(). However, equivalent but faster results can be obtained by using the hclust function from base R with the method == "average" option.

Direct phylogenetic merging with tree_glom()

The new tree_glom() function performs direct phylogenetic merging of taxa. This function is much faster and arguably more intuitive than tip_glom(). Advantages of direct merging over the indirect merging of tip_glom() are

A merged group of taxa correspond to a clade in the original tree being collapsed to a single taxon.
The resolution parameter that controls the degree of merging has units in terms of the tree's branch lengths, making it potentially more biologically meaningful than the h parameter in tip_glom().
The distance-matrix computation and hierarchical clustering in tip_glom() can be skipped, making tree_glom() much faster and less memory intensive than tip_glom() when the number of taxa is large.

tree_glom() uses functions from the castor package to determine which clades are to be merged using one of three criteria. The default behavior is to merge a clade if the maximum distance from a node to each of its tips is less than the distance resolution.

data(GlobalPatterns)
ps1 <- subset_taxa(GlobalPatterns, Phylum == "Chlamydiae")
ntaxa(ps1)
ps2 <- tree_glom(ps1, 0.05)
ntaxa(ps2)

library(dplyr)
library(ggtree)
library(cowplot)

plot1 <- phy_tree(ps1) %>%
  ggtree +
  geom_tiplab() +
  geom_label(aes(x = branch, label = round(branch.length, 4)))
plot2 <- phy_tree(ps2) %>%
  ggtree +
  geom_tiplab() +
  geom_label(aes(x = branch, label = round(branch.length, 4)))
plot_grid(plot1, plot2)

Minor improvements and fixes

Fixed bug that applied to taxonomic merge functions when an object named new_tax_mat exists outside the function environment; described in Issue #31
Merging functions now maintain the original order of new taxa by default, except in phyloseq objects with phylogenetic trees (for which order is and has always been determined by how archetypes are ordered in phy_tree(ps)$tip.label). This behavior can lead to different taxa orders from past speedyseq versions and from phyloseq::tax_glom() function. However, it makes the resulting taxa order more predictable. New taxa can be be reordered according to group or taxonomy in merge_taxa_vec() and tax_glom() by setting reorder = TRUE.
Merging/glom functions now work on relevant phyloseq components as well as phyloseq objects

Files

mikemc/speedyseq-v0.2.0.zip

Files (1.6 MB)

Name	Size	Download all
mikemc/speedyseq-v0.2.0.zip md5:8e58ea998b1fac750fb40ae01f551070	1.6 MB	Preview Download

Additional details

Is supplement to: https://github.com/mikemc/speedyseq/tree/v0.2.0 (URL)

	All versions	This version
Views	429	379
Downloads	11	6
Data volume	9.8 MB	9.5 MB

mikemc/speedyseq: speedyseq v0.2.0

Creators

Description

Files

mikemc/speedyseq-v0.2.0.zip

Files (1.6 MB)

Additional details

Related works