Microbial community analysis using R and phyloseq

This document is a guideline for community analysis using R, Phyloseq, microbiome, vegan, ggplot2, tidyverse and custom made functions to facilitate the analyses and visualization stored under the DivComAnalyses repository.

We recommend using post-clustering approach for processing raw metabarcoding data and developed an R friendly pipeline named metabaRpipe.

The learning curve of R might be a bit steep but analyzing your metabarcoding data will give you access to state-of-the art approaches - developed in R before getting integrated into user friendly pipelines. Moreover, you will be then able to apply your R skills to any other type of data.

2 Getting ready:

2.1 Install R/ Rstudio:

Install R/ Rstudio from the ETH AppV Kiosk or following the instructions here. It is important to start with an up to date R version otherwise you will likely not been able to install the necessary packages and use the functions.

2.2 Install the required packages:

2.2.1 MAC:

macOS users - might need to install xquartz for some of the function (to do this with homebrew, run the following command in your mac’s Terminal: brew install --cask xquartz)

#devtools
install.packages("devtools")
devtools::install_github("hadley/devtools")

#tidyverse
devtools::install_github("tidyverse/tidyverse")

#phyloseq
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("phyloseq")

2.2.2 Windows:

Windows users - will need to have RTools installed so that your computer can build this package. Follow instructions here.

# devtools
install.packages("devtools")
devtools::install_github("hadley/devtools")

#tidyverse
devtools::install_github("tidyverse/tidyverse")

#phyloseq
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("phyloseq")

2.3 Load the necessary packages:

Tidyverse and phyloseq are the two main packages we are going to use but the functions we will source later will require additional packages you would need to install.

library(tidyverse); packageVersion("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## [1] '1.3.1'
library(phyloseq); packageVersion("phyloseq")
## [1] '1.34.0'

3 Getting familiar with R & phyloseq:

3.1 Data import:

We are going to illustrate some functions using two different datasets generated in the group and processed with the metabaRpipe pipeline. Those files are located under the Github repository. You can download those directly https://github.com/fconstancias/DivComAnalyses/raw/master/data-raw/ps_invivo.RDS.

Mice were subjected with different treatments and 16S metabarcoding was use to describe feces level changes in the microbiota ps_invivo.RDS.

Define the phyloseq object’s location:

ps1 = "../../data-raw/ps_invivo.RDS"

Load the phyloseq object:

ps1 %>% # ps object wich is path of our physeq file on our computer
  readRDS() -> physeq_1  # the readRDS function and then it is directed/stored to a R object we named physeq

3.1.1 The different facets of the phyloseq object:

Call the phyloseq object:

physeq_1
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1872 taxa and 116 samples ]
## sample_data() Sample Data:       [ 116 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 1872 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1872 tips and 1870 internal nodes ]
## refseq()      DNAStringSet:      [ 1872 reference sequences ]

The dataset contains 3704 taxa, ASV and recorded in 116 samples otu_table() and 26 variables as metadata sample_data(). There are 7 levels in the taxonomic path of those ASV and tax_table() an ASV phylogenetic tree is stored phy_tree() in addition to the sequences refseq()

The different facets of the data can be easily access:

OTU/ASV table:

physeq_1 %>%
  otu_table() %>%
  head() # print only the first few rows
## OTU Table:          [6 taxa and 116 samples]
##                      taxa are rows
##         S_219 S_220 S_221 S_222 S_223 S_224 S_225 S_226 S_227 S_228 S_229 S_230
## ASV0002     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0003    12     0     0     3     0    30    84    15    26     2     0     6
## ASV0006     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0008     0     3     0     1     3    10    39     3     0     0     3     0
## ASV0012     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0013     0     0     0     0     0     0     0     0     0     0     0     0
##         S_231 S_232 S_233 S_234 S_251 S_252 S_253 S_254 S_255 S_256 S_257 S_259
## ASV0002     8     0     0     0     0     0     0     0     0     0     0     0
## ASV0003     8     0    43     5     0     0     0     0     0    14    11   185
## ASV0006     0     0     0     0     0     0     1     0     0     0     0     0
## ASV0008     8     0     0     0     1    20    26     6     9     0     0     0
## ASV0012     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0013     0     0     0     0     0     0     0     0     0     0     0     0
##         S_260 S_261 S_262 S_263 S_264 S_265 S_267 S_268 S_269 S_270 S_271 S_272
## ASV0002     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0003    14    13    27    40     0     4   678   823     0     0    30    30
## ASV0006     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0008     0     9     0     1    24     6     0     0     1     5     0     0
## ASV0012     0     0     0     0     0     0     0     0     0     0    81    67
## ASV0013     0     0     0     0     0     0     0     0     0     0     0     0
##         S_273 S_274 S_275 S_276 S_277 S_278 S_279 S_280 S_281 S_298 S_299 S_300
## ASV0002     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0003    18     0     0     0     0   137   311     4   222     6    14    89
## ASV0006     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0008     0     0     2     0     0     0     0     0     0     0     0     0
## ASV0012    45     0     0     0     0    12    12     0     9     8     0     0
## ASV0013     0     0     0     0     0     0     0     0     0     0     0     3
##         S_301 S_302 S_303 S_304 S_305 S_306 S_308 S_309 S_310 S_311 S_312 S_313
## ASV0002     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0003    72    35     0     0    35     5     3     9     3    60    14     0
## ASV0006     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0008     1     0     0     0     0     2     0     0     0     0     0     0
## ASV0012     0     0     0     0     0     3     0     0     0     0     0    15
## ASV0013     0     0     0     0     0     0     0     0     0     0     0     0
##         S_315 S_316 S_317 S_318 S_319 S_320 S_321 S_322 S_323 S_324 S_325 S_326
## ASV0002     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0003     0     0     2     0     0     0     0     0     0     0     0     4
## ASV0006     0     3     2     0     0     0     0     0     0     0     0     0
## ASV0008     0     0     0     0     1     0     7     0     0     0     2     1
## ASV0012    15     0     0     0    99     0     0     0     8     0     0    14
## ASV0013     0     0     0     0     0     0     0     0     0     0     0     0
##         S_327 S_341 S_342 S_343 S_344 S_345 S_346 S_347 S_348 S_349 S_350 S_351
## ASV0002     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0003     0     3    61    13     0     0     3     0     0    17     3    25
## ASV0006     3     1     0     0     0     0     0     2     0     0     0     0
## ASV0008     0     1     2     0     0     0     0     3     0     2     0     4
## ASV0012    71    57    84     0     0     0    13    33     0     0    11     0
## ASV0013     0     0     0     0     0     0     0     0     0     0     0     0
##         S_352 S_353 S_354 S_355 S_356 S_438 S_439 S_440 S_441 S_442 S_443 S_444
## ASV0002     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0003    41    14    19     5    23     6    75    95     7    63     0    98
## ASV0006     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0008     6    15     2     8     0     3     0     0     0     0     1     2
## ASV0012     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0013     0     0     0     0     0     0     0     0     0     0     0     0
##         S_445 S_446 S_447 S_448 S_449 S_450 S_465 S_466 S_467 S_468 S_469 S_470
## ASV0002     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0003     9    70    84    47   263     9   263     0   173    14     6    26
## ASV0006     0     0     1     0     0     0     0     0     0     0     0     0
## ASV0008     0     0     0     0     1     0     1     2     0     0     0     0
## ASV0012     0     0     0     0     0     6     0     0     0     0     0     0
## ASV0013     0     0     0     0     0     0     0     0     0     0     0     0
##         S_471 S_472 S_473 S_474 S_475 S_476 S_477 S_478
## ASV0002     0     0     0     0     0     0     0     0
## ASV0003    12     0     2    14    18     8     0     7
## ASV0006     0     0     0     0     0     0     2     0
## ASV0008     1     0     1     0     8     2     3     3
## ASV0012     0     0     0     0     0     0     0     0
## ASV0013     0     0     0     0     0     0     0     0

N.B.: OTU (Operational Taxonomic Unit)can cover a lot of things including bands on a DGGE profile to OTU based on sequence similarity threshold or ASV. ASV (Amplicon Sequence Variant) is an type of OTU generated using post clustering approaches. See here, here or there for more details.

Check the tax table:

physeq_1 %>%
  tax_table() %>%
  head() # print only the first few rows
## Taxonomy Table:     [6 taxa by 7 taxonomic ranks]:
##         Kingdom    Phylum       Class        Order           Family           
## ASV0002 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0003 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0006 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0008 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0012 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0013 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
##         Genus            Species  
## ASV0002 "Ruminococcus_1" "unknown"
## ASV0003 "Ruminococcus_1" "unknown"
## ASV0006 "unknown"        "unknown"
## ASV0008 "unknown"        "unknown"
## ASV0012 "unknown"        "unknown"
## ASV0013 "unknown"        "unknown"

Our phyloseq object has a ‘Strain’ taxonomic information we created - know taxonomical information at highest level + ASV id

physeq_1 %>%
  tax_table() %>%
  head()# print only the first few rows
## Taxonomy Table:     [6 taxa by 7 taxonomic ranks]:
##         Kingdom    Phylum       Class        Order           Family           
## ASV0002 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0003 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0006 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0008 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0012 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0013 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
##         Genus            Species  
## ASV0002 "Ruminococcus_1" "unknown"
## ASV0003 "Ruminococcus_1" "unknown"
## ASV0006 "unknown"        "unknown"
## ASV0008 "unknown"        "unknown"
## ASV0012 "unknown"        "unknown"
## ASV0013 "unknown"        "unknown"

N.B.: I prefer to use tydiverse’s pipe: %>% which makes the code more readable as compared to classical R. For more info: https://dplyr.tidyverse.org/ & https://rstudio.com/resources/cheatsheets/

Which is similar to:

head(tax_table(physeq_1))
## Taxonomy Table:     [6 taxa by 7 taxonomic ranks]:
##         Kingdom    Phylum       Class        Order           Family           
## ASV0002 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0003 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0006 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0008 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0012 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0013 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
##         Genus            Species  
## ASV0002 "Ruminococcus_1" "unknown"
## ASV0003 "Ruminococcus_1" "unknown"
## ASV0006 "unknown"        "unknown"
## ASV0008 "unknown"        "unknown"
## ASV0012 "unknown"        "unknown"
## ASV0013 "unknown"        "unknown"

Check nucleotide sequences of the ASV:

physeq_1 %>%
  refseq()
## DNAStringSet object of length 1872:
##        width seq                                            names               
##    [1]   253 TACGTAGGGAGCGAGCGTTGTC...AAAGCGTGGGGAGCAAACAGG ASV0002
##    [2]   253 TACGTAGGGAGCGAGCGTTGTC...AAAGCGTGGGGAGCAAACAGG ASV0003
##    [3]   253 TACGTAGGGAGCGAGCGTTGTC...AAAGCGTGGGGAGCAAACAGG ASV0006
##    [4]   253 TACGTAGGTGGCAAGCGTTATC...AAAGTGTGGGGAGCAAACAGG ASV0008
##    [5]   253 TACGTAGGATGCGAGCGTTATC...AAAGTGTGGGGAGCAAACAGG ASV0012
##    ...   ... ...
## [1868]   253 TACGTAGGGGGCAAGCGTTATC...AAAGCGTGGGGAGCAAACAGG ASV5274
## [1869]   253 TACGTAGGGGGCAAGCGTTATC...AAAGCGTGGGGAGCAAACAGG ASV5277
## [1870]   253 TACGTAGGGGGCAAGCGTTATC...AAAGCGTGGGGAGCAAACAGG ASV5278
## [1871]   253 TACGTAGGGGGCAAGCGTTATC...AAAGCGTGGGGAGCAAACAGG ASV5279
## [1872]   253 TACGTAGGGGGCAAGCGTTATC...AAAGCGTGGGGAGCGAATAGG ASV5280

Check the metadata:

physeq_1 %>%
  sample_variables()
##  [1] "Sample"             "input"              "filtered"          
##  [4] "filtered_pc"        "denoisedF"          "denoisedR"         
##  [7] "denoisedF_pc"       "denoisedR_pc"       "merged"            
## [10] "merged_pc"          "tabled"             "chimera_out"       
## [13] "length_filtered"    "tabled_pc"          "chimera_out_pc"    
## [16] "length_filtered_pc" "day"                "treatment"         
## [19] "mouse_label"        "BW"                 "BW_percent"        
## [22] "BW_delta"           "DAI"                "treatment_grouped" 
## [25] "treatment_invivo"   "day_num"
physeq_1 %>%
  sample_data() %>%
  head() # print only the first few rows
##       Sample input filtered filtered_pc denoisedF denoisedR denoisedF_pc
## S_219  S_219 44852    44233        0.99     42985     43681         0.97
## S_220  S_220 72545    72048        0.99     70726     70818         0.98
## S_221  S_221 76560    76051        0.99     74773     75012         0.98
## S_222  S_222 62479    62192        1.00     61528     61201         0.99
## S_223  S_223 78944    78670        1.00     77717     77643         0.99
## S_224  S_224 72930    72622        1.00     71294     71600         0.98
##       denoisedR_pc merged merged_pc tabled chimera_out length_filtered
## S_219         0.99  37694      0.88  37694       26929           26929
## S_220         0.98  63881      0.90  63881       44125           44125
## S_221         0.99  67904      0.91  67904       49814           49814
## S_222         0.98  54905      0.89  54905       42532           42532
## S_223         0.99  68254      0.88  68254       49859           49859
## S_224         0.99  60502      0.85  60502       44187           44187
##       tabled_pc chimera_out_pc length_filtered_pc    day treatment mouse_label
## S_219         1           0.71                  1 Day_-1    TreatA         m_8
## S_220         1           0.69                  1 Day_-1    TreatA        m_19
## S_221         1           0.73                  1 Day_-1    TreatA        m_34
## S_222         1           0.77                  1 Day_-1    TreatA        m_26
## S_223         1           0.73                  1 Day_-1    TreatA        m_39
## S_224         1           0.73                  1 Day_-1    TreatA        m_46
##          BW BW_percent BW_delta DAI treatment_grouped treatment_invivo day_num
## S_219 20.55          0        0   0            TreatA             none      -1
## S_220 21.32          0        0   0            TreatA             none      -1
## S_221 20.11          0        0   0            TreatA             none      -1
## S_222 21.90          0        0   0            TreatA             none      -1
## S_223 20.94          0        0   0            TreatA             none      -1
## S_224 21.12          0        0   0            TreatA             none      -1

The phyloseq package has also some useful functions:

For instance to access sample names:

physeq_1 %>%
  sample_names() %>%
  head() # print only the first few rows
## [1] "S_219" "S_220" "S_221" "S_222" "S_223" "S_224"

which is similar to:

physeq_1 %>%
  otu_table() %>%
  colnames() %>%
  head() # print only the first few rows
## [1] "S_219" "S_220" "S_221" "S_222" "S_223" "S_224"

Some checks:

First top samples with highest highest total number of reads.

physeq_1 %>%
  sample_sums() %>%
  sort(decreasing = TRUE) %>%
  head()
##  S_225  S_231  S_465  S_444  S_466  S_274 
## 124774  89836  63618  60835  59991  59927

First top ASV with highest highest total number of reads:

physeq_1 %>%
  taxa_sums() %>%
  sort(decreasing = TRUE) %>%
  head()
## ASV0840 ASV1253 ASV0411 ASV0385 ASV4998 ASV4836 
##  889239  231459  180720  126535  111485  106082

Getting help on a particular function:

?taxa_sums() 
taxa_sums
# then google
# then R package manual
# then ... colleague/ researchgate

3.2 Manipulating a phyloseq object:

Filtering, subseting samples or OTU based on metadata, taxonomy, abundance and prevalence is very easy in phyloseq:

3.3 Filtering:

3.3.1 prune_taxa() prune_samples():

Two functions which prunes unwanted taxa/samples from a phyloseq object based on a vector of taxa to keep:

#Create a logical vector answering the logical question: does the ASV represent more than 10 reads in total?

physeq_1 %>%
  taxa_sums() > 10 -> taxa_top

#taxa_top is indeed a logical vector

taxa_top %>%
  head()
## ASV0002 ASV0003 ASV0006 ASV0008 ASV0012 ASV0013 
##   FALSE    TRUE    TRUE    TRUE    TRUE   FALSE
#filter the physeq object based on that vector

physeq_1 %>%
  prune_taxa(taxa = taxa_top) -> physeq_1_filtered

Note that all the data in our physeq_filtered object has been filtered: otu_table(), tax_table(), refseq(), …

physeq_1
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1872 taxa and 116 samples ]
## sample_data() Sample Data:       [ 116 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 1872 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1872 tips and 1870 internal nodes ]
## refseq()      DNAStringSet:      [ 1872 reference sequences ]
physeq_1_filtered
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1043 taxa and 116 samples ]
## sample_data() Sample Data:       [ 116 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 1043 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1043 tips and 1041 internal nodes ]
## refseq()      DNAStringSet:      [ 1043 reference sequences ]

It is also possible to filter based on mean proportion, prevalence, variation coefficient:

sum_filter = 10
prev_filter = 1
varcoef_filter = 2
mean_filter = 10

physeq_1  %>%
  filter_taxa(., function(x){mean(x) > mean_filter}, TRUE)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 288 taxa and 116 samples ]
## sample_data() Sample Data:       [ 116 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 288 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 288 tips and 286 internal nodes ]
## refseq()      DNAStringSet:      [ 288 reference sequences ]
physeq_1  %>%
  filter_taxa(., function(x){sum(x > sum_filter) > 1}, prune = TRUE)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 789 taxa and 116 samples ]
## sample_data() Sample Data:       [ 116 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 789 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 789 tips and 787 internal nodes ]
## refseq()      DNAStringSet:      [ 789 reference sequences ]
physeq_1  %>%
  filter_taxa(function(x) sd(x)/mean(x) > varcoef_filter, TRUE)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1534 taxa and 116 samples ]
## sample_data() Sample Data:       [ 116 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 1534 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1534 tips and 1532 internal nodes ]
## refseq()      DNAStringSet:      [ 1534 reference sequences ]

3.3.2 subset_taxa() subset_samples()

Those functions subsets unwanted taxa/samples from a phyloseq object based on conditions that must be met:

Keep only OTU belonging to Firmicutes Phyla:

physeq_1 %>%
  subset_taxa(Phylum == "Firmicutes")
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1370 taxa and 116 samples ]
## sample_data() Sample Data:       [ 116 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 1370 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1370 tips and 1368 internal nodes ]
## refseq()      DNAStringSet:      [ 1370 reference sequences ]

Keep only samples from treatment A and no treatment to our metadata:

physeq_1 %>%
  subset_samples(treatment_invivo %in% c("TreatA"))
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1872 taxa and 18 samples ]
## sample_data() Sample Data:       [ 18 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 1872 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1872 tips and 1870 internal nodes ]
## refseq()      DNAStringSet:      [ 1872 reference sequences ]

Keep only samples from treatment A and no treatment to our metadata:

physeq_1 %>%
  subset_samples(treatment_invivo %in% c("TreatA", "none"))
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1872 taxa and 71 samples ]
## sample_data() Sample Data:       [ 71 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 1872 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1872 tips and 1870 internal nodes ]
## refseq()      DNAStringSet:      [ 1872 reference sequences ]

Keep only samples which are not defined by “none” in the “treatment_invivo” metadata:

physeq_1 %>%
  subset_samples(treatment_invivo != "none") # != different from
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1872 taxa and 63 samples ]
## sample_data() Sample Data:       [ 63 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 1872 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1872 tips and 1870 internal nodes ]
## refseq()      DNAStringSet:      [ 1872 reference sequences ]

Keep only samples which are not defined by “none” in the “treatment_invivo” metadata and with samples with BW values > 20:

physeq_1 %>%
  subset_samples(treatment_invivo != "none" &  BW  > 20 ) # | = or, & = AND
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1872 taxa and 28 samples ]
## sample_data() Sample Data:       [ 28 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 1872 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1872 tips and 1870 internal nodes ]
## refseq()      DNAStringSet:      [ 1872 reference sequences ]

3.4 updating the metadata:

Most of the time we fine tune the metadata along the analysis process, physeq_add_metadata() helps to quickly change the metadata of a phyloseq object. Below an example using an excel file directly. The sample_column defines the column in the external file to be use to link with the sample_names() of the phyloseq object.

source("https://raw.githubusercontent.com/fconstancias/metabaRpipe/master/Rscripts/functions.R")

physeq_1 %>%
  physeq_add_metadata(physeq = .,
                      metadata = "../../data-raw/ps_invivo_meta.xlsx" %>%
                        readxl::read_xlsx(),
                      sample_column = "Sample") -> physeq_1_new_meta

physeq_1_new_meta %>% 
  sample_data() %>% 
  data.frame() %>% 
  DT::datatable()

3.5 Smoothing taxonomical information:

3.5.1 ASV taxonomic agglomeration:

ASV obtained 16S V4 metabarcoding are mostly defined from a taxonomic perspective at the Family/ Genus.

tax_table(physeq_1)[,c("Family","Genus","Species")] %>% 
  head()
## Taxonomy Table:     [6 taxa by 3 taxonomic ranks]:
##         Family            Genus            Species  
## ASV0002 "Ruminococcaceae" "Ruminococcus_1" "unknown"
## ASV0003 "Ruminococcaceae" "Ruminococcus_1" "unknown"
## ASV0006 "Ruminococcaceae" "unknown"        "unknown"
## ASV0008 "Ruminococcaceae" "unknown"        "unknown"
## ASV0012 "Ruminococcaceae" "unknown"        "unknown"
## ASV0013 "Ruminococcaceae" "unknown"        "unknown"

The function tax_glom() agglomerates ASV at a given taxonomic level. Finer taxonomic information is lost even if ASV id e.g., ASV0002 is maintained.

Agglomerate taxonomy at the Family level:

physeq_1 %>%
  tax_glom(taxrank = "Family") %>% 
  tax_table() %>% 
  head()
## Taxonomy Table:     [6 taxa by 7 taxonomic ranks]:
##         Kingdom    Phylum          Class             Order                
## ASV0066 "Bacteria" "Firmicutes"    "Clostridia"      "Clostridiales"      
## ASV0251 "Bacteria" "Tenericutes"   "Mollicutes"      "Mollicutes_RF39"    
## ASV0308 "Bacteria" "Cyanobacteria" "Melainabacteria" "Gastranaerophilales"
## ASV0385 "Bacteria" "Bacteroidetes" "Bacteroidia"     "Bacteroidales"      
## ASV0406 "Bacteria" "Bacteroidetes" "Bacteroidia"     "unknown"            
## ASV0411 "Bacteria" "Bacteroidetes" "Bacteroidia"     "Bacteroidales"      
##         Family             Genus Species
## ASV0066 "Clostridiaceae_1" NA    NA     
## ASV0251 "unknown"          NA    NA     
## ASV0308 "unknown"          NA    NA     
## ASV0385 "Prevotellaceae"   NA    NA     
## ASV0406 "unknown"          NA    NA     
## ASV0411 "Marinifilaceae"   NA    NA

3.5.2 Maxiumum ASV resolution taxonomic assignments:

Agglomerated dataset at a particular taxonomic rank might be of interest for specific analysis or also to summarize the taxonomic profile at a broader level. It is also meaningful to know the highest taxonomic definition of the ASV. We can do that using the function phyloseq_get_strains()

We first need to source the function from the DivComAnalyses repository. Those function can be accessed from your internet browser.

source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_normalisation.R") 

Apply the function:

physeq_1 %>% 
  phyloseq_get_strains()  -> physeq_1_st
## Joining, by = "ASV"

We now have the a new Strain taxonomic rank which summarizes the taxonomic path:

physeq_1_st %>% 
  tax_table() %>% 
  head()
## Taxonomy Table:     [6 taxa by 8 taxonomic ranks]:
##         Kingdom    Phylum       Class        Order           Family           
## ASV0002 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0003 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0006 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0008 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0012 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0013 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
##         Genus            Species Strain                                    
## ASV0002 "Ruminococcus_1" NA      "unknown Ruminococcus_1 (Genus) ASV0002"  
## ASV0003 "Ruminococcus_1" NA      "unknown Ruminococcus_1 (Genus) ASV0003"  
## ASV0006 NA               NA      "unknown Ruminococcaceae (Family) ASV0006"
## ASV0008 NA               NA      "unknown Ruminococcaceae (Family) ASV0008"
## ASV0012 NA               NA      "unknown Ruminococcaceae (Family) ASV0012"
## ASV0013 NA               NA      "unknown Ruminococcaceae (Family) ASV0013"

3.6 Count transformation:

Metabarcoding data are count data with two important properties: sparcitiy (i.e., there are usually a lot of 0) and compositionality (i.e., number of sequences assigned to taxa/ASV are linked to the total number of sequences per sample which is arbitrary: imposed by the instrument and the library preparation method but do not indicate any density) - see here or there for more details.

The information stored in the otu_table() is counts data.

physeq_1 %>% 
  otu_table() %>% 
  head()
## OTU Table:          [ 6 taxa and 116 samples ]:
## Taxa are rows
##         S_219 S_220 S_221 S_222 S_223 S_224 S_225 S_226 S_227 S_228 S_229 S_230
## ASV0002     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0003    12     0     0     3     0    30    84    15    26     2     0     6
## ASV0006     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0008     0     3     0     1     3    10    39     3     0     0     3     0
## ASV0012     0     0     0     0     0     0     0     0     0     0     0     0
## ASV0013     0     0     0     0     0     0     0     0     0     0     0     0
## # … with 5 more samples (columns)

And the total number of sequences per samples varies and is not an indication of any absolute quantification.

physeq_1 %>% 
  sample_sums() %>% 
  head()
## S_219 S_220 S_221 S_222 S_223 S_224 
## 26927 44122 49814 42532 49859 44187

We can transform the count data in the otu_table() into other type of data.

physeq_1 %>% 
  transform_sample_counts(function(x) x/sum(x) * 100) -> physeq_1_pc # transform as percentage - total is 100%


physeq_1_pc %>% 
  otu_table() %>% 
  head()
## OTU Table:          [ 6 taxa and 116 samples ]:
## Taxa are rows
##          S_219   S_220 S_221   S_222   S_223  S_224  S_225   S_226  S_227
## ASV0002 0      0           0 0       0       0      0      0       0     
## ASV0003 0.0446 0           0 0.00705 0       0.0679 0.0673 0.0384  0.0551
## ASV0006 0      0           0 0       0       0      0      0       0     
## ASV0008 0      0.00680     0 0.00235 0.00602 0.0226 0.0313 0.00767 0     
## ASV0012 0      0           0 0       0       0      0      0       0     
## ASV0013 0      0           0 0       0       0      0      0       0     
## # … with 8 more samples (columns)
physeq_1_pc %>% 
  sample_sums() %>% 
  head()
## S_219 S_220 S_221 S_222 S_223 S_224 
##   100   100   100   100   100   100

3.6.0.1 using microbiome::transform function:

?microbiome::transform()

# compositional' (ie relative abundance), 'Z', 'log10', 'log10p', 'hellinger', 'identity', 'clr'

The log10p transformation refers to log10(1 + x):

physeq_1 %>% 
  microbiome::transform("log10p") %>% 
  otu_table() %>% 
  head()
## OTU Table:          [ 6 taxa and 116 samples ]:
## Taxa are rows
##         S_219 S_220 S_221 S_222 S_223 S_224 S_225 S_226 S_227 S_228 S_229 S_230
## ASV0002  0    0         0 0     0      0     0    0      0    0     0     0    
## ASV0003  1.11 0         0 0.602 0      1.49  1.93 1.20   1.43 0.477 0     0.845
## ASV0006  0    0         0 0     0      0     0    0      0    0     0     0    
## ASV0008  0    0.602     0 0.301 0.602  1.04  1.60 0.602  0    0     0.602 0    
## ASV0012  0    0         0 0     0      0     0    0      0    0     0     0    
## ASV0013  0    0         0 0     0      0     0    0      0    0     0     0    
## # … with 5 more samples (columns)

Centered-Log-Ratio (CLR) transform - see here:

physeq_1 %>% 
  microbiome::transform("clr") %>% 
  otu_table() %>% 
  head()
## OTU Table:          [ 6 taxa and 116 samples ]:
## Taxa are rows
##         S_219  S_220  S_221  S_222 S_223 S_224 S_225 S_226 S_227  S_228 S_229
## ASV0002 -1.09 -0.941 -0.964 -1.08  -1.39 -1.46 -1.41 -1.13 -1.22 -0.850 -1.14
## ASV0003  3.63 -0.941 -0.964  1.85  -1.39  3.67  3.72  3.44  3.71  1.77  -1.14
## ASV0006 -1.09 -0.941 -0.964 -1.08  -1.39 -1.46 -1.41 -1.13 -1.22 -0.850 -1.14
## ASV0008 -1.09  1.95  -0.964  0.850  1.38  2.59  2.96  1.87 -1.22 -0.850  1.64
## ASV0012 -1.09 -0.941 -0.964 -1.08  -1.39 -1.46 -1.41 -1.13 -1.22 -0.850 -1.14
## ASV0013 -1.09 -0.941 -0.964 -1.08  -1.39 -1.46 -1.41 -1.13 -1.22 -0.850 -1.14
## # … with 6 more samples (columns)

compositional transformation - proportion scaled to 1:

physeq_1 %>% 
  microbiome::transform("compositional") %>% 
  otu_table() %>% 
  head()
## OTU Table:          [ 6 taxa and 116 samples ]:
## Taxa are rows
##           S_219   S_220 S_221   S_222   S_223   S_224   S_225   S_226   S_227
## ASV0002 0       0           0 0       0       0       0       0       0      
## ASV0003 4.46e-4 0           0 7.05e-5 0       6.79e-4 6.73e-4 3.84e-4 5.51e-4
## ASV0006 0       0           0 0       0       0       0       0       0      
## ASV0008 0       6.80e-5     0 2.35e-5 6.02e-5 2.26e-4 3.13e-4 7.67e-5 0      
## ASV0012 0       0           0 0       0       0       0       0       0      
## ASV0013 0       0           0 0       0       0       0       0       0      
## # … with 8 more samples (columns)

sqrt transform:

physeq_1 %>% 
  transform_sample_counts(., sqrt) %>% 
  otu_table() %>% 
  head()
## OTU Table:          [ 6 taxa and 116 samples ]:
## Taxa are rows
##         S_219 S_220 S_221 S_222 S_223 S_224 S_225 S_226 S_227 S_228 S_229 S_230
## ASV0002  0     0        0  0     0     0     0     0     0     0     0     0   
## ASV0003  3.46  0        0  1.73  0     5.48  9.17  3.87  5.10  1.41  0     2.45
## ASV0006  0     0        0  0     0     0     0     0     0     0     0     0   
## ASV0008  0     1.73     0  1     1.73  3.16  6.24  1.73  0     0     1.73  0   
## ASV0012  0     0        0  0     0     0     0     0     0     0     0     0   
## ASV0013  0     0        0  0     0     0     0     0     0     0     0     0   
## # … with 5 more samples (columns)

3.7 Rarefaction:

The rarefy_evendepth() function down all samples to the same sequencing depth (a.k.a., Library Size) and prunes ASV that disappear from all samples as well as samples with a lower sequencing depth as a result.

physeq_1 %>%
  rarefy_even_depth(rngseed = 123) %>% # important to specify rngseed: random number generator for reproducibility
  sample_sums() %>%
  head()
## `set.seed(123)` was used to initialize repeatable random subsampling.
## Please record this for your records so others can reproduce.
## Try `set.seed(123); .Random.seed` for the full vector
## ...
## 289OTUs were removed because they are no longer 
## present in any sample after random subsampling
## ...
## S_219 S_220 S_221 S_222 S_223 S_224 
## 15459 15459 15459 15459 15459 15459
physeq_1 %>%
  sample_sums() %>%
  min()
## [1] 15459

In this example, we rarefied all the dataset to 15459 sequences per sample because it is the minumum in the dataset. You might not want to do that if some of the samples are Negative Controls (with likely low PCR amplification and therefore low reads after sequencing) or samples which failed to amplify. In those case you can specify the the sample.size argument.

physeq_1 %>%
  rarefy_even_depth(rngseed = 123,
                    sample.size = 20000) -> physeq_1_rare # important to specify rngseed: random number generator for reproducibility
## `set.seed(123)` was used to initialize repeatable random subsampling.
## Please record this for your records so others can reproduce.
## Try `set.seed(123); .Random.seed` for the full vector
## ...
## 2 samples removedbecause they contained fewer reads than `sample.size`.
## Up to first five removed samples are:
## S_345S_469
## ...
## 207OTUs were removed because they are no longer 
## present in any sample after random subsampling
## ...
physeq_1_rare
## phyloseq-class experiment-level object
## otu_table()   OTU Table:          [ 1665 taxa and 114 samples ]:
## sample_data() Sample Data:        [ 114 samples by 26 sample variables ]:
## tax_table()   Taxonomy Table:     [ 1665 taxa by 7 taxonomic ranks ]:
## phy_tree()    Phylogenetic Tree:  [ 1665 tips and 1663 internal nodes ]:
## refseq()      DNAStringSet     :      [ 1665 reference sequences ]
## taxa are rows

A warning message specify that 2 samples were removed because of lower Library size. Altough in that case 2070 ASV were removed because they are no longer present. As you can imagine they are low abundant and low prevalent. We will come back to this a bit later.

Should you rarefy your data ? This question is still under ‘debate’: see here and there. It depend on the type of analysis you are going to do, the shape of your rarefaction curves, … but to make it short for alpha-diversity and beta diversity rarefaction is advised for standard analysis, not much for differential abundance testing. Again if you have a decent sequencing depth it should not matter. It also depends how the rarefaction curves look like, which nicely brings us to the next topic. Most of the case it does not impact much the alpha and beta diversity patterns if the sequencing depth is high enough to allow an exhaustive characterization of the diversity of the samples.

Since sequencing depth is important, phyloseq_check_lib_size() function allows to summarize the sequencing depth per sample. The function returns two objects a data.frame df and a plot plot stored in the output.

physeq_1_st %>% 
  phyloseq_check_lib_size(data_color = "treatment_invivo",
                          data_facet = NULL,
                          nreads_display = 30000,
                          first_n = nsamples(physeq_1_st)) -> lib

lib$df %>% 
  DT::datatable()

The DT::datatable() function allows to generate interactive table using marqdown.

We can easily customize the plot using ggplot2 grammar:

lib$plot + 
  scale_y_log10() + 
  theme_classic() -> p_lib

p_lib
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

We can also use tidyverse to select a few columns from the dataframe output:

lib$df %>% 
  select(Sample, mouse_label, day_num, treatment_invivo, LibrarySize, Index) %>% 
  DT::datatable()

3.8 Rarefaction curves:

Rarefaction curves depict the relationship between sequencing depth (i.e., Library Size) and the observed richness (i.e., number of different ASV) and ultimately allow to estimate if the sequencing effort seems sufficient to characterize the richness of the samples. We can use the function phyloseq_rarefaction_curves():

physeq_1_st %>%
  phyloseq_rarefaction_curves(stepsize = 500, 
                              color_data = NULL, 
                              facet_data = NULL) -> p

p

You are now familiar with ggplot objects and can easily customize the plot, faceting with the level of treatment_invivo() and adding a vertical line with the minimum sequencing depth.

physeq_1_st %>%  sample_sums() %>%  min() -> min_lib

p + geom_vline(xintercept = min_lib,
               color = "red",
               linetype = "dashed", size=0.25) +
  facet_wrap(~ treatment_invivo) + ylab("ASV Richness") -> plot

plot + theme(legend.position = "none")

Those curves look nice.

We are going to rarefy the samples to 15459 the minimum library size from our dataset.

physeq_1_st %>%
  rarefy_even_depth(rngseed = 123,
                    sample.size = min_lib) -> physeq_1_rare # important to specify rngseed: random number generator for reproducibility
## `set.seed(123)` was used to initialize repeatable random subsampling.
## Please record this for your records so others can reproduce.
## Try `set.seed(123); .Random.seed` for the full vector
## ...
## 289OTUs were removed because they are no longer 
## present in any sample after random subsampling
## ...
physeq_1_rare
## phyloseq-class experiment-level object
## otu_table()   OTU Table:          [ 1583 taxa and 116 samples ]:
## sample_data() Sample Data:        [ 116 samples by 26 sample variables ]:
## tax_table()   Taxonomy Table:     [ 1583 taxa by 8 taxonomic ranks ]:
## phy_tree()    Phylogenetic Tree:  [ 1583 tips and 1581 internal nodes ]:
## refseq()      DNAStringSet     :      [ 1583 reference sequences ]
## taxa are rows

Let’s now extract the ASV which were removed to see whether.

Extract ASV id from the initial / unrarefied dataset:

physeq_1_st %>% 
  filter_taxa(function(x) sum(x > 0) > 0, TRUE) %>%  # to remove any ASV kept but with only 0 from the dataset
  taxa_names() -> asv_raw

Extract ASV id from the rarefied dataset:

physeq_1_rare %>% 
  filter_taxa(function(x) sum(x > 0) > 0, TRUE) %>%  # to remove any ASV kept but with only 0 from the dataset
  taxa_names() -> asv_rare

Get the ASV id which were removed:

setdiff(asv_raw,
        asv_rare) %>% 
  as.vector() -> rm_rare_asv

rm_rare_asv %>% 
  length()
## [1] 289

260 were removed.

Generate a heatmap with the proportion of those 260 ASV in the initial dataset.

source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_heatmap.R")

physeq_1_st %>%
  transform_sample_counts(function(x) x/sum(x) * 1000) %>% # transform as percentage before filtering
  prune_taxa(rm_rare_asv, .) %>% # keep only the ASV with id matching the rm_rare_asv vector from the phyloseq object
  phyloseq_ampvis_heatmap(physeq = .,
                          transform = FALSE, # extract only the taxa to display - after percentage normalisation
                          facet_by = "treatment_invivo",
                          group_by = "SampleID",
                          ntax =  Inf) + ggtitle("ASV removed during rarefaction ‰") -> heat_rm_rare_asv

heat_rm_rare_asv

So we see that those ASV were quite low abundant and very sparse in the full dataset.

physeq_1_st %>%
  transform_sample_counts(function(x) x/sum(x) * 1000) %>% # transform as percentage before filtering
  prune_taxa(rm_rare_asv, .) %>% # keep only the ASV with id matching the rm_rare_asv vector from the phyloseq object
  plot_bar(fill = "OTU") +
  facet_grid(~ treatment_invivo, scales = "free", space = "free") +
  ggtitle("Bar plot colored by ASV ") +
  ylab("Proportion - ‰") +
  theme_light() +
  theme(legend.position = "none") + 
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

You should not worry too much regarding the ASV removed during the rarefaction step. What if we had rarefy the dataset to 1000 sequences per samples?

physeq_1_rare %>% 
  phyloseq_rarefaction_curves(stepsize = 500, 
                              color_data = "treatment_invivo", 
                              facet_data = NULL)

Looks good!

physeq_1_st %>%
  transform_sample_counts(function(x) x/sum(x) * 1000) %>% # transform as percentage before filtering
  prune_taxa(rm_rare_asv, .) %>% # keep only the ASV with id matching the rm_rare_asv vector from the 
microbiomeutilities::plot_abund_prev(.,
                                     label.core = TRUE,
                                     color = "Class", # NA or "blue"
                                     mean.abund.thres = 0.01,
                                     mean.prev.thres = 0.95,
                                     dot.opacity = 0.7,
                                     label.size = 3,
                                     label.opacity = 1.0,
                                     nudge.label=-0.15,
                                     bs.iter=999, # increase for actual analysis e.g. 999
                                     size = length(rm_rare_asv), # increase to match your nsamples(asv_ps)
                                     replace = TRUE,
                                     label.color="#5f0f40") +
  geom_vline(xintercept = 0.95, lty="dashed", alpha=0.7) +
  geom_hline(yintercept = 0.01,lty="dashed", alpha=0.7)  -> pv
## Warning: replacing previous import 'ggplot2::alpha' by 'microbiome::alpha' when
## loading 'microbiomeutilities'
## Make sure to set.seed!!!
## Joining, by = "Taxa"
## Joining, by = "Taxa"
## Joining, by = "Taxa"
pv
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous x-axis

This plot displays the mean prevalence and relative abundance (0-100%) of the ASV we removed. We clearly see those taxa are very low prevalent and abundant in propotion.

3.9 Normalisation with total bacterial density:

See here. This function multiplies the proportion of ASV with absolute density values (i.e., as qPCR16S of flow cytometry) allowing to esimate the density of each ASV.

physeq_1_new_meta %>%
  phyloseq_density_normalize(value_idx = "qPCR16S") %>% 
  otu_table()
## OTU Table:          [ 1872 taxa and 116 samples ]:
## Taxa are rows
##          S_219  S_220  S_221  S_222  S_223  S_224  S_225  S_226  S_227  S_228
## ASV0002 0           0 0           0      0 0           0 0      0      0     
## ASV0003 2.79e6      0 0      323996      0 6.47e6 534111 1.63e6 1.87e6 3.55e5
## ASV0006 0           0 0           0      0 0           0 0      0      0     
## ASV0008 0      236983 0      107999 320497 2.16e6 247980 3.25e5 0      0     
## ASV0012 0           0 0           0      0 0           0 0      0      0     
## ASV0013 0           0 0           0      0 0           0 0      0      0     
## ASV0015 0           0 0           0      0 0           0 0      0      0     
## ASV0018 0      868938 1.50e6      0 747827 8.63e5 133528 4.33e5 0      2.31e6
## ASV0020 0           0 0           0      0 0           0 0      0      0     
## ASV0022 0           0 0           0      0 0           0 0      0      0     
## # … with 1,862 more taxa (rows), and 106 more samples (columns)

4 Community analysis:

There are three main dimension of community analysis using metbarcoding data: alpha diversity, beta diversity and ASV/ taxa centric appraoch

4.1 Alpha-diversity:

Alpha diversity is often defined as univariate within sample community characteristic, for more information please refer to slides 63, 66 to 79 Mahendra Mariadassou’s Lecture Chao1 and ACE are richness estimator trying to extrapolate the total richness observed + not detected due to unsufficient sequencing depth.

4.1.1 phyloseq::plot_richness()

Using the rarefied physeq_1_rare object we are going to visualize alpha-diversity metrics:

physeq_1_rare %>%
  plot_richness(measures = c("Observed", "Shannon", "ACE"),
                shape = "treatment_invivo") + 
  facet_grid(variable ~ treatment_invivo, scale = "free",  space = "fixed", drop = TRUE) -> alpha_plot

alpha_plot %>%
  print()

The function plot_richness() actually use estimate_richness() and then ggplot() to plot alpha_diversity values. alpha_plot is the ggplot and data are stored in alpha_plot$data:

alpha_plot$data %>%
  DT::datatable()

Let’s modify the plot removing ACE values, and a one sample (for demonstration purpose):

alpha_plot$data %>%
  filter(variable != "ACE", samples != "S_234") %>% # we can filter anything, metadata, alphadiversity metric
  ggplot(aes(treatment_invivo,
             value,
             colour = treatment_invivo,
             fill = treatment_invivo,
             shape = treatment_invivo)) +
  facet_grid(variable ~ treatment_invivo ,scale="free",space="fixed") +
  geom_boxplot(outlier.colour = NA, alpha = 0.2) +
  geom_point(size=2,position=position_jitterdodge(dodge.width = 0.2)) +
  ylab("Diversity indices") + xlab(NULL) + theme_bw()

4.1.2 Perform statistical evaluation:

The data are stored in long format:

alpha_plot$data %>%
  head() %>% 
  DT::datatable()

You can transform in wide format and export your data to a tsv file and run statistical tests outside of R:

alpha_plot$data %>%
  filter(variable != "ACE", samples != "S_234") %>% # we can filter anything, metadata, alphadiversity metric
  pivot_wider(names_from = variable, values_from = value) -> alpha_wide

alpha_wide %>% 
  head() %>% 
  DT::datatable()
alpha_wide %>% 
  write_tsv("alphadiv.tsv") # check the path

Or you can do it in R using an anova aov:

aov(Observed ~  treatment_invivo, alpha_wide) %>%
  summary()
##                   Df Sum Sq Mean Sq F value   Pr(>F)    
## treatment_invivo   3 332333  110778   34.55 7.54e-16 ***
## Residuals        111 355891    3206                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ggpubr::compare_means() function allows to nicely compare among multiple groups for all the variables.

alpha_plot$data %>%
  filter(variable != "ACE", samples != "S_234") %>%  
  ggpubr::compare_means(value ~ treatment_invivo, 
                        group.by = "variable", 
                        data = ., # . indicates that the object is coming from the pipe above.
                        method = "wilcox.test")
## # A tibble: 12 × 9
##    variable .y.   group1 group2             p    p.adj p.format  p.signif method
##    <fct>    <chr> <chr>  <chr>          <dbl>    <dbl> <chr>     <chr>    <chr> 
##  1 Observed value none   TreatA 0.0000000235    2.6e-7 0.000000… ****     Wilco…
##  2 Observed value none   TreatB 0.0000000463    4.6e-7 0.000000… ****     Wilco…
##  3 Observed value none   TreatC 0.00000000570   6.8e-8 0.000000… ****     Wilco…
##  4 Observed value TreatA TreatB 0.286           1  e+0 0.286     ns       Wilco…
##  5 Observed value TreatA TreatC 0.778           1  e+0 0.778     ns       Wilco…
##  6 Observed value TreatB TreatC 0.280           1  e+0 0.280     ns       Wilco…
##  7 Shannon  value none   TreatA 0.0538          4.8e-1 0.054     ns       Wilco…
##  8 Shannon  value none   TreatB 0.334           1  e+0 0.334     ns       Wilco…
##  9 Shannon  value none   TreatC 0.167           1  e+0 0.167     ns       Wilco…
## 10 Shannon  value TreatA TreatB 0.307           1  e+0 0.307     ns       Wilco…
## 11 Shannon  value TreatA TreatC 0.549           1  e+0 0.549     ns       Wilco…
## 12 Shannon  value TreatB TreatC 0.884           1  e+0 0.884     ns       Wilco…

4.1.3 Compute alpha-diversity indices.

phyloseq_alphas() function compute all kind of alpha diversity metrics. If phylo = TRUE, it will also compute phylogenetic metrics -if you have a phylogenetic tree included in your phyloseq object - but could take quite some time. The idea is not to loose too much time exploring many indices, but on the other hand it can help to better understand the pattern and guide the analysis (phylogeny, richness, evenness).

source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_alpha.R")

physeq_1_rare %>%
  phyloseq_alphas(phylo = FALSE) -> alphas

Let’s check the colanmes and the computed indices:

alphas %>% 
  colnames()
##  [1] "id...1"                     "Sample"                    
##  [3] "input"                      "filtered"                  
##  [5] "filtered_pc"                "denoisedF"                 
##  [7] "denoisedR"                  "denoisedF_pc"              
##  [9] "denoisedR_pc"               "merged"                    
## [11] "merged_pc"                  "tabled"                    
## [13] "chimera_out"                "length_filtered"           
## [15] "tabled_pc"                  "chimera_out_pc"            
## [17] "length_filtered_pc"         "day"                       
## [19] "treatment"                  "mouse_label"               
## [21] "BW"                         "BW_percent"                
## [23] "BW_delta"                   "DAI"                       
## [25] "treatment_grouped"          "treatment_invivo"          
## [27] "day_num"                    "id...28"                   
## [29] "observed"                   "chao1"                     
## [31] "diversity_inverse_simpson"  "diversity_gini_simpson"    
## [33] "diversity_shannon"          "diversity_fisher"          
## [35] "diversity_coverage"         "evenness_camargo"          
## [37] "evenness_pielou"            "evenness_simpson"          
## [39] "evenness_evar"              "evenness_bulla"            
## [41] "dominance_dbp"              "dominance_dmn"             
## [43] "dominance_absolute"         "dominance_relative"        
## [45] "dominance_simpson"          "dominance_core_abundance"  
## [47] "dominance_gini"             "rarity_log_modulo_skewness"
## [49] "rarity_low_abundance"       "rarity_rare_abundance"     
## [51] "id...51"                    "Observed"                  
## [53] "Chao1"                      "se.chao1"                  
## [55] "ACE"                        "se.ACE"
alphas %>% 
  head() %>% 
  DT::datatable()

4.1.4 Plot the data and run basic stats:

plot_alphas() function takes as an input the output from phyloseq_alphas(). As always, the alphas dataframe can be manipulated/filtered and plot as you wish with ggplot2.

alphas %>%
  plot_alphas(measure = c("Observed", "diversity_shannon", "evenness_pielou", "dominance_gini"),
              x_group = "treatment_invivo",
              colour_group = "treatment_invivo",
              fill_group = "treatment_invivo",
              shape_group = NULL,
              facet_group = ".",
              test_group = "treatment_invivo",
              test_group_2 = NULL) -> out

The function generates two outputs. The first is a plot stored in the $plot object.

out$plot %>%
  print()

The second is a data frame of pairwise non parametric tests for all indices. We can check the significant comparisons. We will use adjusted p values here.

out$stat %>%
  filter(p.adj <= 0.05) %>% 
  arrange(p.adj) %>%
  DT::datatable()

We can also try to correlate alpha-diversity metrics with metadata - here Body Weight - using correlate_alpha() function.

alphas %>%
  correlate_alpha(colnums_to_plot = c("Observed", "diversity_shannon",
                                      "BW"),
                  method = "spearman") -> alpha_corr

alpha_corr

N.B.: correlations between alpha-diversity metrics are also included here.

4.2 Beta-diversity:

Beta-diversity is often defined as multidimensional between sample community characteristic, for more information please refer to slides 63, 80 to 92 of Mahendra Mariadassou’s Lecture. This type of data is multidimensional and we need specific multidimensional approaches to visualize analyze such data. Beta-diversity depict similarity between samples based on a synthetic matrix which contains pairwise similarity/dissimilarity values for all the samples combinations. There are phylogenetic / non phylogenetic beta-diversity metrics and abundance or presence based metrics. The most used are Bray-Curtis and Binary-Jaccard: non-phylogenetic abundance and presence based metrics, respectively as well as weighted and unweighted unifrac metrics: phylogenetic abundance and presence based metrics, respectively. The use of phylogenetic metrics was introduced at a time when clustering was apply to analyze metabarcoding data. This led to noisy OTU and the use of phylogenetic metric allowed to lower the influence of very similar OTU sequences see here. I now prefer to use binary and weigthed jaccard metrics when exploring a dataset. Aitchinson distance is now popular again but is usually very corelated with weighted jaccard.

First, we generate a distance matrix from the phyloseq_object:

physeq_1_rare %>%
  phyloseq::distance(method = "bray") -> bc

4.2.1 Visualisation:

4.2.1.1 Heatmap:

We can visualize the distance matrix using heatmap() function:

bc %>%
  as.matrix() %>%
  heatmap()

As you can see it is a symetric matrix: the diagonal are 0 values dissimilarity or distance between same sample = 0. Difficult to identify patterns from there.

4.2.2 PCoA/NMDS:

PCoA and NMDS are unconstrained ordination techniques in order to visualize data with high dimensions. They are similar as a PCA but the input distance matrix can be specified (PCA is based only on euclidean distance.). More information on PCoA or NMDS

physeq_1_rare %>%
  ordinate(method = "PCoA",
           distance = bc) -> ord

Samples coordinate on the PCoA axis are stored in but plot_ordination can make use of ord object we have just created easily.

ord$vectors[c(1:10), c(1,2)]
##            Axis.1      Axis.2
## S_219 -0.18477138 -0.10158338
## S_220 -0.23657906  0.04852235
## S_221 -0.24458723  0.05939406
## S_222 -0.23035443 -0.02873114
## S_223 -0.12984620 -0.24477378
## S_224 -0.07919993 -0.34760950
## S_225 -0.21062096 -0.03771307
## S_226 -0.22533786  0.01167839
## S_227 -0.23866155 -0.11826223
## S_228 -0.25078434  0.07550421

Plot ordination can be then used to plot the ordination object:

physeq_1_rare %>% 
  plot_ordination(physeq = ., 
                  ordination = ord,
                  color = "BW", 
                  shape = "treatment_invivo", 
                  title = "PCoA Bray-Curtis", 
                  label= "Sample") + 
  theme_bw()

4.2.2.1 Dendrogram:

source("https://raw.githubusercontent.com/mahendra-mariadassou/phyloseq-extended/415075fb7f690f6ac39eab18f62aeb4e4967ef84/R/graphical_methods.R")

require(ape)
## Loading required package: ape
physeq_1_rare %>% 
  plot_clust(physeq = ., 
             dist = bc, 
             # label = "desc01",
             color = "treatment_invivo")

Similar patterns as we saw on the PCoA. Makes sense because the same distance object is an input ;)

4.2.3 Distance boxplot:

Another way to visualize pairwise distance comparaison is to plot them as boxplot. The phyloseq_distance_boxplot() performs all pairwise distance comparisons between the group in the facet header, with all of the other groups (including itself). Each boxplot is shown twice, once in each group. For example, “Mock” in the “Soil” facet has the same data has “Soil” in the “Mock” facet.

Red is the within group comparisons, which is why it is typically lower distance compared to the black boxplots. There are some within group comparisons that are actually more distant than comparing across groups.

source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_beta.R")

physeq_1_rare %>% 
  phyloseq_distance_boxplot(p = ., # phyloseq object
                            dist = bc, # distance matrix
                            d = "treatment_invivo") -> out # column in metadata for comparaisons

The function generates the following plot:

out$plot + xlab(NULL) 

We can see samples from TreatA, B and C are mainly distant from the “none” group.

As well as the distance matrix that can be used to customize distance boxplot.

out$matrix %>%
  head()
##    Var1  Var2     value Type1 Type2
## 1 S_220 S_219 0.4986739  none  none
## 2 S_221 S_219 0.4546866  none  none
## 3 S_222 S_219 0.3976324  none  none
## 4 S_223 S_219 0.4727343  none  none
## 5 S_224 S_219 0.5096707  none  none
## 6 S_225 S_219 0.4509994  none  none

4.2.4 Statistical tests

4.2.4.1 PERMANOVA:

Let’s check if the observed patterns we visualized in PCoA, dendrogram are significant using PERMANOVA i.e., adonis function from vegan:

vegan::adonis2(bc ~ get_variable(physeq_1_rare, "treatment_invivo"))
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## vegan::adonis2(formula = bc ~ get_variable(physeq_1_rare, "treatment_invivo"))
##                                                  Df SumOfSqs      R2      F
## get_variable(physeq_1_rare, "treatment_invivo")   3    4.223 0.21616 10.295
## Residual                                        112   15.314 0.78384       
## Total                                           115   19.537 1.00000       
##                                                 Pr(>F)    
## get_variable(physeq_1_rare, "treatment_invivo")  0.001 ***
## Residual                                                  
## Total                                                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on Bray-Curtis distance treatment_invivo is significant which means there is at least one group different from the others in terms ob beta-diversity/ community similarity.

I have computed some functions to help with the process:

source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_beta.R") 

physeq_1_rare %>% 
  phyloseq_adonis2(dm = bc,
                   formula = "treatment_invivo")
##              terms  Df  SumOfSqs        R2        F Pr..F.
## 1 treatment_invivo   3  4.223042 0.2161567 10.29524  0.001
## 2         Residual 112 15.313904 0.7838433       NA     NA
## 3            Total 115 19.536946 1.0000000       NA     NA

The ‘Pr..F.’ is the pvalue - significant here - and the R2 indicates the strength or the % of explained variation. Here treatment_invivo explains 22% of the variations and is significant.

When having more than two levels, it makes sense to run pairwise comparisons:

physeq_1_rare %>% 
  physeq_pairwise_permanovas_adonis2(dm = bc,
                                     compare_header = "treatment_invivo") %>% 
  arrange(pvalFDR)
##       X1     X2         R2  pval pvalBon pvalFDR
## 1   none TreatA 0.17801503 0.001   0.006   0.003
## 2   none TreatB 0.19216850 0.001   0.006   0.003
## 3   none TreatC 0.17722244 0.001   0.006   0.003
## 4 TreatA TreatC 0.03931906 0.119   0.714   0.143
## 5 TreatB TreatC 0.03443042 0.113   0.678   0.170
## 6 TreatA TreatB 0.02233659 0.446   2.676   0.446

When running many statistical tests we increase the risk of false positive results and we need to correct the pvalues. Here we have the raw pvalues pval and the corrected ones using bonnferroni or FDR approaches. The results clearly showed mice from the group TreatA, B and C have different microbiota as compared to the non group (i.e., the Control). BTW it was already the case when we run the statistical test on the alpha-diversity metrics comparing different groups.

We can also test the effect of the combinaition of different variables in explaining the variation of beta-diversity.

physeq_1_rare %>% 
  phyloseq_adonis2(dm = bc,
                   formula = " BW + treatment_invivo")
##              terms  Df  SumOfSqs         R2        F Pr..F.
## 1               BW   1  1.287502 0.06590091 9.601193  0.001
## 2 treatment_invivo   3  3.364545 0.17221449 8.363387  0.001
## 3         Residual 111 14.884898 0.76188460       NA     NA
## 4            Total 115 19.536946 1.00000000       NA     NA
physeq_1_rare %>% 
  phyloseq_adonis2(dm = bc,
                   formula = " treatment_invivo + BW")
##              terms  Df   SumOfSqs         R2         F Pr..F.
## 1 treatment_invivo   3  4.2230424 0.21615673 10.497389  0.001
## 2               BW   1  0.4290052 0.02195866  3.199187  0.003
## 3         Residual 111 14.8848984 0.76188460        NA     NA
## 4            Total 115 19.5369460 1.00000000        NA     NA

Note that by default the order matters and the first variable will be fitted against the community distance. The residuals will be fed to the other variable. By default the terms_margins argument is set on “terms”.

We can also check the interaction between the two:

physeq_1_rare %>% 
  phyloseq_adonis2(dm = bc,
                   formula = " treatment_invivo * BW") # same as treatment_invivo + BW + treatment_invivo:BW
##                 terms  Df   SumOfSqs         R2         F Pr..F.
## 1    treatment_invivo   3  4.2230424 0.21615673 10.858025  0.001
## 2                  BW   1  0.4290052 0.02195866  3.309094  0.004
## 3 treatment_invivo:BW   3  0.8833163 0.04521261  2.271128  0.001
## 4            Residual 108 14.0015820 0.71667199        NA     NA
## 5               Total 115 19.5369460 1.00000000        NA     NA

treatment_invivo variations explains most of the variations but BW Body Weight also explains some variations. Interestingly, the interaction between the 2: treatment_invivo:BW is also significant meaning BW do not impact the communities in a similar way among the differnet treatments.

We can also use terms_margins = "margin"which allows to test for the effect of the variables after controlling the effect of the other.

physeq_1_rare %>% 
  phyloseq_adonis2(dm = bc,
                  formula = " treatment_invivo + BW",
                  terms_margins = "margin")
##              terms  Df   SumOfSqs         R2        F Pr..F.
## 1 treatment_invivo   3  3.3645451 0.17221449 8.363387  0.001
## 2               BW   1  0.4290052 0.02195866 3.199187  0.005
## 3         Residual 111 14.8848984 0.76188460       NA     NA
## 4            Total 115 19.5369460 1.00000000       NA     NA

More details here.

4.2.4.2 Beta Dispersion:

In order to conclude we need to check PERMANOVA’s main assumption: variance homoscedasticity.

physeq_1_rare %>% 
  physeq_betadisper(dm = bc,
                    variable = "treatment_invivo") -> betadisp

betadisp$pval
## [1] 0.089

Variation in communities among the different groups do not differ in terms of spread of beta diversity. More info here.

4.2.4.3 Multivariate Welch t-test:

As an alternative to PERMANOVA (i.e., `vegan::adonis``) - and espacially in the case of unbalanced and heteroscedastic data we can use distance multivariate Welch t-test.

physeq_1_rare %>%
  phyloseq_TW(dm = bc, 
              physeq = .,
              variable = "treatment_invivo") -> out 

out %>% 
  DT::datatable()

Multivariate Welch t-test confirmed the patterns.

4.2.4.4 db-RDA:

Constrained ordination detects how much of community variations is explained / correlated with variables. See here

phyloseq_dbRDA() takes a phyloseq object, a distance matrix and formula with the covariable(s) to test.

physeq_1 %>%
  phyloseq_dbRDA(dm = bc,
                 ps = .,
                 group_plot = "treatment_invivo",
                 forumla = paste0(c("BW", "DAI"), collapse=" + ")) -> out

The function generates four outputs: 1- the classic vegan dbRDA:

out$dbRDA
## Call: vegan::capscale(formula = dist ~ BW + DAI, data = metadata, add =
## TRUE)
## 
##                Inertia Proportion Rank
## Total         31.45909    1.00000     
## Constrained    2.20316    0.07003    2
## Unconstrained 29.25593    0.92997  113
## Inertia is Lingoes adjusted squared Unknown distance 
## 
## Eigenvalues for constrained axes:
##   CAP1   CAP2 
## 1.8456 0.3576 
## 
## Eigenvalues for unconstrained axes:
##  MDS1  MDS2  MDS3  MDS4  MDS5  MDS6  MDS7  MDS8 
## 3.891 2.553 1.582 1.142 0.934 0.770 0.747 0.657 
## (Showing 8 of 113 unconstrained eigenvalues)
## 
## Constant added to distances: 0.1036708

2- the ordination plot:

out$plot2

3- the anova results to test the significance of the model overall:

out$anova_all %>%
  DT::datatable()

4- the anova results testing the influence of each of the covaraibles:

out$anova_terms %>%
  DT::datatable()

Here The analysis confirm that Body Weight is correlating with the differences in BW Body Weight. When coloring the samples by treatment group we see that treatment also impact body weight.

4.2.4.5 partial db-RDA:

This approach seeks to remove the effect of one or more explanatory variables on a set of response variables prior to a standard RDA. This may be useful when well-characterised variables with strong effects obscure the effects of more interesting explanatory variables. This can be achieved with vegan::capscale and the use of Condition(variable) in the formula. Here we can try to remove the effect of donor microbiota + fermentation and see constrained ordination plot as well as the statistics. See here

physeq_1 %>% 
  phyloseq_dbRDA(ps = .,
                 dm = bc,
                 forumla = " treatment_invivo + Condition(BW)",
                 group_plot = "treatment_invivo",
                 vec_ext = 0.4) -> dbRDA_test

dbRDA_test$anova_all %>% 
  mutate_if(is.numeric, round, 3) %>%
  DT::datatable()

We can access the dbRDA object displaying constrained and unconstrained variations as well as conditioned:

dbRDA_test$dbRDA
## Call: vegan::capscale(formula = dist ~ treatment_invivo +
## Condition(BW), data = metadata, add = TRUE)
## 
##                Inertia Proportion Rank
## Total         31.45909    1.00000     
## Conditional    1.39117    0.04422    1
## Constrained    3.67556    0.11684    3
## Unconstrained 26.39236    0.83894  111
## Inertia is Lingoes adjusted squared Unknown distance 
## 
## Eigenvalues for constrained axes:
##   CAP1   CAP2   CAP3 
## 3.0445 0.3757 0.2553 
## 
## Eigenvalues for unconstrained axes:
##   MDS1   MDS2   MDS3   MDS4   MDS5   MDS6   MDS7   MDS8 
## 3.1580 1.6347 1.2949 1.0873 0.8306 0.7255 0.6542 0.5580 
## (Showing 8 of 111 unconstrained eigenvalues)
## 
## Constant added to distances: 0.1036708

We can access the constrained ordination plot:

dbRDA_test$plot2
## Warning: Removed 114 rows containing missing values (geom_point).

# can be fine tuned with ?ggord()
dbRDA_test$anova_terms
##                   Df  SumOfSqs        F Pr..F.
## treatment_invivo   3  3.675558 5.152841  0.001
## Residual         111 26.392361       NA     NA

Since treatment has an effect we can further investigate the pairwise comparisons accross levels of treatment:

# physeq_1 %>% 
# phyloseq_pairwise_dbRDA(ps = .,
#                         dm = dm,
#                         RHS_formula = " treatment_invivo + Condition(BW)") -> pw_db_treat_cond
# 
# pw_db_treat_cond %>%
#   mutate_if(is.numeric, round, 3) %>%
#   DT::datatable()

4.2.4.6 Usefull functions:

phyloseq_compute_bdiv() function compute all kind of beta diversity distances. If phylo = TRUE, it will also compute phylogenetic metrics -if you have a phylogenetic tree included in your phyloseq object - but will take some time. The idea is not to loose too much time exploring many distances, but on the other hand it can help to better understand the pattern and guide the analysis. The included distances are binary jaccard: bjaccard, weigthed jaccard: wjaccard, bray-curtis, un-weighted unifrac: uunifrac, wunifrac and d_0 and d_0.5 from the generalized unifrac.

physeq_1_rare %>% 
  phyloseq_compute_bdiv(phylo = TRUE) -> dlist
## Loading required package: GUniFrac
## Registered S3 method overwritten by 'rmutil':
##   method         from
##   print.response httr
dlist %>% 
  summary()
##          Length Class Mode   
## bray     6670   dist  numeric
## bjaccard 6670   dist  numeric
## wjaccard 6670   dist  numeric
## uunifrac 6670   dist  numeric
## wunifrac 6670   dist  numeric
## d_0      6670   dist  numeric
## d_0.5    6670   dist  numeric

We can remove some of those distance from the list:

dlist$bray <- NULL
dlist$d_0.5 <- NULL
dlist$d_0 <- NULL

Compute ordinations:

physeq_1_rare %>% 
  phyloseq_plot_bdiv(dlist = dlist, # list of distance computed from a phyloseq object
                     ps_rare = ., # phyloseq object
                     m = "PCoA", # PCoA or NMDS
                     seed = 123, # for reproducibility
                     axis1 = 1, # axis to plot
                     axis2 = 2) -> plot_list
## [1] "bjaccard"
## [1] "wjaccard"
## [1] "uunifrac"
## [1] "wunifrac"

Explore all the ordination plots:

plot_list %>%
  phyloseq_plot_ordinations_facet(color_group = "BW",
                                  shape_group = "treatment_invivo") -> pcoas
## Warning in if (class(plot_list[[1]]) == "list") {: the condition has length > 1
## and only the first element will be used
pcoas

Check % of explained variance for all ordinations:

phyloseq_ordinations_expl_var() generates:

plot_list %>%
  phyloseq_ordinations_expl_var() %>%
  DT::datatable()

Perform PERMANOVA on all distances using lapply:

A way to apply a function to a list is to use lapply. This is an example to run PERMANOVA (Adonis vegan function on a list of distances). We can test here if SampleType influences community distances:

lapply(
  dlist,
  FUN = phyloseq_adonis2,
  physeq = physeq_1_rare,
  formula = paste0(c("treatment_invivo"), collapse=" * "),
  nrep = 999,
  strata = FALSE
) %>%
  bind_rows(.id = "Distance") %>%
  mutate_if(is.numeric, round, 3) %>%
  # filter(! terms %in% (c("Residuals", "Total"))) %>%
  DT::datatable()

Since it is significant, we can run pairwise comparisons to further investigate which groups are different from the other(s).

lapply(
  dlist,
  FUN = physeq_pairwise_permanovas_adonis2,
  physeq = physeq_1_rare,
  compare_header = "treatment_invivo",
  n_perm = 999,
  strat = FALSE
) %>%
  bind_rows(.id = "Distance") %>%
  mutate_if(is.numeric, round, 3) %>%
  # filter(! terms %in% (c("Residuals", "Total"))) %>%
  DT::datatable()

We could do the same for betadisper:

lapply(
  dlist,
  FUN = physeq_betadisper,
  physeq = physeq_1_rare,
  variable = "treatment_invivo"
)

## $bjaccard
## $bjaccard$pval
## [1] 0.041
## 
## $bjaccard$plot
## $bjaccard$plot$stats
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.3222634 0.3539865 0.3362548 0.3467928
## [2,] 0.3561419 0.3616801 0.3595863 0.3602318
## [3,] 0.3769350 0.3769323 0.3851296 0.3890985
## [4,] 0.3995027 0.4007802 0.4204568 0.4111401
## [5,] 0.4523347 0.4359965 0.4790630 0.4524837
## 
## $bjaccard$plot$n
## [1] 53 18 24 21
## 
## $bjaccard$plot$conf
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.3675244 0.3623711 0.3654979 0.3715461
## [2,] 0.3863456 0.3914935 0.4047613 0.4066509
## 
## $bjaccard$plot$out
##     S_251     S_264     S_303     S_304     S_348 
## 0.4963659 0.4716019 0.6169250 0.6568167 0.6003817 
## 
## $bjaccard$plot$group
## [1] 1 1 4 4 4
## 
## $bjaccard$plot$names
## [1] "none"   "TreatA" "TreatB" "TreatC"
## 
## 
## 
## $wjaccard
## $wjaccard$pval
## [1] 0.127
## 
## $wjaccard$plot
## $wjaccard$plot$stats
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.3097886 0.3429259 0.3568736 0.3570027
## [2,] 0.3596359 0.3826550 0.4045826 0.3942019
## [3,] 0.3918497 0.4411128 0.4269739 0.4564468
## [4,] 0.5011181 0.4822541 0.5055536 0.5733960
## [5,] 0.6993111 0.5996836 0.6005157 0.7636644
## 
## $wjaccard$plot$n
## [1] 53 18 24 21
## 
## $wjaccard$plot$conf
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.3611439 0.4040211 0.3944092 0.3946635
## [2,] 0.4225556 0.4782044 0.4595387 0.5182301
## 
## $wjaccard$plot$out
##     S_251 
## 0.7449809 
## 
## $wjaccard$plot$group
## [1] 1
## 
## $wjaccard$plot$names
## [1] "none"   "TreatA" "TreatB" "TreatC"
## 
## 
## 
## $uunifrac
## $uunifrac$pval
## [1] 0.002
## 
## $uunifrac$plot
## $uunifrac$plot$stats
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.1471045 0.1728876 0.1670796 0.1865194
## [2,] 0.1811844 0.2040400 0.2059661 0.2023125
## [3,] 0.1960632 0.2208035 0.2263794 0.2136097
## [4,] 0.2130277 0.2376040 0.2388214 0.2299618
## [5,] 0.2519337 0.2877896 0.2848743 0.2614012
## 
## $uunifrac$plot$n
## [1] 53 18 24 21
## 
## $uunifrac$plot$conf
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.1891522 0.2083039 0.2157830 0.2040767
## [2,] 0.2029741 0.2333031 0.2369758 0.2231428
## 
## $uunifrac$plot$out
##     S_251     S_303     S_304     S_348 
## 0.3352518 0.4396205 0.4944090 0.3922345 
## 
## $uunifrac$plot$group
## [1] 1 4 4 4
## 
## $uunifrac$plot$names
## [1] "none"   "TreatA" "TreatB" "TreatC"
## 
## 
## 
## $wunifrac
## $wunifrac$pval
## [1] 0.067
## 
## $wunifrac$plot
## $wunifrac$plot$stats
##            [,1]      [,2]      [,3]      [,4]
## [1,] 0.08547445 0.1127499 0.1042706 0.1303766
## [2,] 0.11738724 0.1462398 0.1548101 0.1473732
## [3,] 0.13688849 0.1791199 0.1733416 0.2142894
## [4,] 0.19618800 0.2361543 0.2618620 0.2866256
## [5,] 0.28471590 0.3586046 0.3696712 0.4442491
## 
## $wunifrac$plot$n
## [1] 53 18 24 21
## 
## $wunifrac$plot$conf
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.1197864 0.1456349 0.1388156 0.1662774
## [2,] 0.1539906 0.2126049 0.2078675 0.2623014
## 
## $wunifrac$plot$out
##     S_224     S_229     S_231     S_251     S_308     S_310     S_353     S_471 
## 0.3296568 0.4615768 0.4549740 0.5386644 0.3302266 0.3468041 0.3158119 0.3382327 
##     S_474     S_478     S_303     S_304 
## 0.4846597 0.3582213 0.6905075 0.6993610 
## 
## $wunifrac$plot$group
##  [1] 1 1 1 1 1 1 1 1 1 1 4 4
## 
## $wunifrac$plot$names
## [1] "none"   "TreatA" "TreatB" "TreatC"

3d PCoA:

physeq_1_rare %>% 
  phyloseq_plot_PCoA_3d(ps_rare = .,
                        dlist = dlist,
                        shape = FALSE,
                        color = "treatment_invivo")
## [1] "bjaccard"
## [1] "wjaccard"
## [1] "uunifrac"
## [1] "wunifrac"
## $bjaccard
## 
## $wjaccard
## 
## $uunifrac
## 
## $wunifrac

adding taxa correlated with the ordination pattern:

First, generate a PCoA:

physeq_1_rare %>% 
  plot_ordination(physeq = ., 
                  ordination = ord,
                  color = "treatment_invivo", 
                  title = "PCoA Bray-Curtis", 
                  label= NULL) + 
  theme_bw()  -> pca

pca

phyloseq_add_taxa_vector():

We can use then the function phyloseq_add_taxa_vector() to add top correlated taxa, here at the Genus level. The function uses `vegan::envfit()``.

physeq_1 %>% 
  phyloseq_add_taxa_vector(phyloseq = .,
                           dist = bc,
                           join_cbind = "join",
                           tax_rank_plot = "Genus",
                           figure_ord = pca, 
                           fact = 0.25, pval_cutoff = 0.05,
                           top_r = 10) -> out #top 10 correlated genera

out$plot

We can export the data (correlation and taxa) of all the significant correlations:

out$signenvfit
##         Axis.1      Axis.2      id  pval         r  Kingdom          Phylum
## 1  -0.07227232  0.60108796 ASV0385 0.001 0.3665300 Bacteria   Bacteroidetes
## 2   0.76901561 -0.09789388 ASV0411 0.001 0.6009682 Bacteria   Bacteroidetes
## 3   0.27075224  0.23924967 ASV0555 0.001 0.1305472 Bacteria   Bacteroidetes
## 4  -0.57298504 -0.26051444 ASV1197 0.001 0.3961796 Bacteria  Actinobacteria
## 5   0.23587135  0.63187487 ASV1253 0.001 0.4549011 Bacteria Verrucomicrobia
## 6  -0.37024644  0.36835532 ASV1386 0.001 0.2727681 Bacteria  Proteobacteria
## 7   0.16372285  0.52030672 ASV1635 0.001 0.2975243 Bacteria   Bacteroidetes
## 8   0.58105525 -0.17341278 ASV2364 0.001 0.3676972 Bacteria      Firmicutes
## 9   0.01139161 -0.85679816 ASV3987 0.001 0.7342329 Bacteria      Firmicutes
## 10  0.43448797  0.10670162 ASV4836 0.001 0.2001650 Bacteria      Firmicutes
##                  Class                 Order              Family
## 1          Bacteroidia         Bacteroidales      Prevotellaceae
## 2          Bacteroidia         Bacteroidales      Marinifilaceae
## 3          Bacteroidia         Bacteroidales       Rikenellaceae
## 4       Coriobacteriia      Coriobacteriales     Eggerthellaceae
## 5     Verrucomicrobiae    Verrucomicrobiales     Akkermansiaceae
## 6  Gammaproteobacteria Betaproteobacteriales    Burkholderiaceae
## 7          Bacteroidia         Bacteroidales      Tannerellaceae
## 8           Clostridia         Clostridiales     Ruminococcaceae
## 9           Clostridia         Clostridiales     Lachnospiraceae
## 10    Erysipelotrichia    Erysipelotrichales Erysipelotrichaceae
##                    tax_rank_plot                     Species
## 1         Prevotellaceae_UCG-001                     unknown
## 2                    Odoribacter                     unknown
## 3                      Alistipes                     unknown
## 4                  Enterorhabdus                     unknown
## 5                    Akkermansia     Akkermansia_muciniphila
## 6                 Parasutterella                     unknown
## 7                Parabacteroides Parabacteroides_goldsteinii
## 8        Ruminococcaceae_UCG-014                     unknown
## 9  Lachnospiraceae_NK4A136_group                     unknown
## 10                  Turicibacter                     unknown

phyloseq_add_metadata_vector() allows to see correlation between ordination and metadata also using vegan::envfit().

physeq_1 %>% 
  phyloseq_add_metadata_vector(dist = bc,
                               phyloseq = .,
                               figure_ord = out$plot,
                               metadata_sel = c("BW_percent", "BW",  "colon_weight"),# , "histo_colo_total"),
                               top_r = 10, fact = 0.6, na.rm = TRUE) -> out2

out2$plot

phyloseq_generate_pcoa_per_variables() can be usefull to generate PCoA per variable levels. We can apply this on the list of distances.

We will first reduce the distances to wjaccard and bjaccard:

dlist$uunifrac <- NULL
dlist$wunifrac <- NULL

dlist %>% 
  names()
## [1] "bjaccard" "wjaccard"
physeq_1 %>% 
  phyloseq_generate_pcoa_per_variables(tmp = ,
                                     group = "treatment_invivo",
                                     dist = dlist) -> pcoa_group
## [1] "bjaccard"
## [1] "wjaccard"
## [1] "bjaccard"
## [1] "wjaccard"
## [1] "bjaccard"
## [1] "wjaccard"
## [1] "bjaccard"
## [1] "wjaccard"
pcoa_group %>% 
  phyloseq_plot_ordinations_facet(plot_list = ., 
                                  return_eig_df = TRUE, 
                                  color_group = "day")
## $plots

## 
## $eig_df
##           .id...1           V1...2         .id...3           V1...4
## 1   none.bjaccard Axis.1   [10.3%]   none.bjaccard  Axis.2   [7.4%]
## 2   none.wjaccard Axis.1   [19.4%]   none.wjaccard Axis.2   [10.1%]
## 3 TreatA.bjaccard Axis.1   [17.1%] TreatA.bjaccard Axis.2   [10.4%]
## 4 TreatA.wjaccard Axis.1   [25.2%] TreatA.wjaccard Axis.2   [13.9%]
## 5 TreatB.bjaccard Axis.1   [17.6%] TreatB.bjaccard  Axis.2   [8.6%]
## 6 TreatB.wjaccard Axis.1   [24.4%] TreatB.wjaccard Axis.2   [12.2%]
## 7 TreatC.bjaccard   Axis.1   [23%] TreatC.bjaccard   Axis.2   [11%]
## 8 TreatC.wjaccard Axis.1   [24.6%] TreatC.wjaccard Axis.2   [16.6%]

4.3 Taxa/ASV centric appraoches:

source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_taxa_tests.R") 

phyloseq_run_DESeq2_pair_plots_formula

phyloseq_Maaslin2

phyloseq_boxplot_abundance

phyloseq_A_B_ratio

phyloseq_run_compare_means

phyloseq_run_Deseq

phyloseq_run_ALDEx2

phyloseq_correlate_taxa

4.3.1 Correlations with metadata:

4.3.2 Differential abundance tests:

4.3.3 Taxaonomic ratios:

4.4 Miscellaneous:

4.4.0.1 physeq_simplify_tax():

physeq_simplify_tax() agglomerate the table at a specific level and rename taxa id accordingly while removing the rest of the table:

physeq_1_st %>% 
  physeq_simplify_tax(tax_sel = c("Family")) -> physeq_1_st_fam

physeq_1_st_fam %>% 
  tax_table() %>% 
  head()
## Taxonomy Table:     [ 6 taxa by 8 taxonomic ranks ]:
##                    Kingdom Phylum Class Order Genus Species Strain X.Family.
##                    <chr>   <chr>  <chr> <chr> <chr> <chr>   <chr>  <chr>    
## Clostridiaceae_1   <NA>    <NA>   <NA>  <NA>  <NA>  <NA>    <NA>   <NA>     
## Prevotellaceae     <NA>    <NA>   <NA>  <NA>  <NA>  <NA>    <NA>   <NA>     
## Marinifilaceae     <NA>    <NA>   <NA>  <NA>  <NA>  <NA>    <NA>   <NA>     
## Rikenellaceae      <NA>    <NA>   <NA>  <NA>  <NA>  <NA>    <NA>   <NA>     
## Muribaculaceae     <NA>    <NA>   <NA>  <NA>  <NA>  <NA>    <NA>   <NA>     
## Saccharimonadaceae <NA>    <NA>   <NA>  <NA>  <NA>  <NA>    <NA>   <NA>

4.4.0.2 Obtain unique features:

sample_variables() allows to list the names of the columns from the sample_data:

physeq_1_st %>% 
  sample_variables()
##  [1] "Sample"             "input"              "filtered"          
##  [4] "filtered_pc"        "denoisedF"          "denoisedR"         
##  [7] "denoisedF_pc"       "denoisedR_pc"       "merged"            
## [10] "merged_pc"          "tabled"             "chimera_out"       
## [13] "length_filtered"    "tabled_pc"          "chimera_out_pc"    
## [16] "length_filtered_pc" "day"                "treatment"         
## [19] "mouse_label"        "BW"                 "BW_percent"        
## [22] "BW_delta"           "DAI"                "treatment_grouped" 
## [25] "treatment_invivo"   "day_num"

physeq_get_unique() allows to list the different levels of the categorical metadata:

physeq_1_st %>% 
  physeq_get_unique("treatment_invivo") -> treatment

treatment
## [1] none   TreatB TreatA TreatC
## Levels: none TreatA TreatB TreatC

The variable treatment_invivo() has 4 levels: none, TreatA, TreatB and TreatC.

Same for another metadata variable:

physeq_1_st %>% 
  physeq_get_unique("mouse_label") -> mice

mice
##  [1] "m_32" "m_48" "m_18" "m_7"  "m_22" "m_13" "m_34" "m_37" "m_19" "m_29"
## [11] "m_44" "m_26" "m_11" "m_2"  "m_49" "m_35" "m_46" "m_51" "m_21" "m_54"
## [21] "m_53" "m_27" "m_36" "m_39" "m_28" "m_56" "m_52" "m_24" "m_10" "m_25"
## [31] "m_8"  "m_55"

We can also apply this function to identify taxa. Below we extract the different phyla.

physeq_1_st %>% 
  physeq_get_unique("Phylum") -> phyla

phyla
## [1] "Firmicutes"      "Bacteroidetes"   "Verrucomicrobia" "Proteobacteria" 
## [5] "Cyanobacteria"   "Actinobacteria"  "Tenericutes"     "Patescibacteria"

4.4.0.3 Getting top taxa per treatment levels:

The function phyloseq_ampvis_heatmap() allows to generate taxonomic heatmaps and the parameter ntax controls the top taxa among the explored samples.

source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_heatmap.R")

physeq_1_st %>% 
  phyloseq_ampvis_heatmap(physeq = ., 
                          facet_by = "treatment_invivo" , 
                          group_by = "treatment_invivo",  
                          ntax =  5) -> heat_overall

heat_overall

We might be also interested in visualizing the 5 most abundant taxa per groups: per the levels of treatment_invivo i.e., “none”, “TreatA”, “TreatB”, “TreatC”. The function physeq_most_abundant() returns a vector of the most abundant taxa per levels.

physeq_1_st %>% 
  physeq_most_abundant("treatment_invivo", 
                       tax_level = "Strain",
                       ntax = 5) -> most_ab_treat

most_ab_treat
##  [1] "Akkermansia Akkermansia_muciniphila ASV1253"          
##  [2] "unknown Alloprevotella (Genus) ASV0365"               
##  [3] "unknown Clostridium_sensu_stricto_1 (Genus) ASV0066"  
##  [4] "unknown Lachnospiraceae_NK4A136_group (Genus) ASV3104"
##  [5] "unknown Lachnospiraceae_NK4A136_group (Genus) ASV3987"
##  [6] "unknown Lactobacillus (Genus) ASV4998"                
##  [7] "unknown Muribaculaceae (Family) ASV0840"              
##  [8] "unknown Odoribacter (Genus) ASV0411"                  
##  [9] "unknown Prevotellaceae_UCG-001 (Genus) ASV0385"       
## [10] "unknown Turicibacter (Genus) ASV4836"

There are 9 ASV when considering the most abundant among the 4 groups. We can visualize those using a heatmap. We will transform the data in proportion before subseting the taxa of interest and generating the heatmap omitting the transformation.

physeq_1_st %>% 
  transform_sample_counts(function(x) x/sum(x) * 100) %>% # transform as percentage before filtering
  subset_taxa(Strain %in% most_ab_treat) %>% # extract only the taxa to display - after percentage normalisation
  phyloseq_ampvis_heatmap(physeq = ., 
                          transform = FALSE, # extract only the taxa to display - after percentage normalisation
                          facet_by = "treatment_invivo" , 
                          group_by = "treatment_invivo",  
                          ntax =  Inf) -> heat_top_taxa_per_group

heat_top_taxa_per_group

4.4.0.4 Generate color palette from metadata:

Important aspect to keep homogeneity between plots in terms of colors - group assignments.

source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_varia.R") 

physeq_1_st %>% 
  generate_color_palette(var = "treatment_invivo",
                         pal = "npg",
                         print = FALSE) -> treat_pal
## Loading required package: ggpubr
## 
## Attaching package: 'ggpubr'
## The following object is masked from 'package:ape':
## 
##     rotate
treat_pal
##        none      TreatB      TreatA      TreatC 
## "#E64B35FF" "#4DBBD5FF" "#00A087FF" "#3C5488FF"

We can use show_col() from the scales package to visualise the palette. Using scales::show_col() allows to run the function without loading the full package.

treat_pal %>% 
  scales::show_col(.,
                   cex_label = 0.5)

Compare rarefied vs un-rarefied alpha-diversity metrics:

This might be misleading when comparing Observed (i.e., number of observed ASV - ASV richness) since as we have seen rarefaction - with a decent sequencing depth i.e., when the the rarefaction curves starts saturating - removes some ASV which are very low abundant and prevalent. We will use that function to illustrate how we can keep colors assigned to treatment levels homogeneous and how to combine plots.

physeq_1_st %>%
  phyloseq_rarefied_unrarefied_richness(measure = "Observed", 
                                        sample_size = physeq_1_st %>%  sample_sums() %>%  min(),
                                        color_data = "treatment_invivo") -> p1

p1

We can also specify the colors manually using the treat_pal vector we created which associates levels of treatments and colors:

treat_pal
##        none      TreatB      TreatA      TreatC 
## "#E64B35FF" "#4DBBD5FF" "#00A087FF" "#3C5488FF"
p1 + scale_color_manual(values = treat_pal) -> p1

p1

The colors will also be preserved if one of the level is filtered out:

physeq_1_st %>%
  subset_samples(treatment_invivo != "none") %>% 
  phyloseq_rarefied_unrarefied_richness(measure = "Observed", 
                                        sample_size = physeq_1_st %>%  sample_sums() %>%  min(),
                                        color_data = "treatment_invivo") + scale_color_manual(values = treat_pal) # since the output of the function is a ggplot we can use + to finetune the graph
## `set.seed(123)` was used to initialize repeatable random subsampling.
## Please record this for your records so others can reproduce.
## Try `set.seed(123); .Random.seed` for the full vector
## ...
## 759OTUs were removed because they are no longer 
## present in any sample after random subsampling
## ...
## `geom_smooth()` using formula 'y ~ x'

Using a richness estimator such as Chao1 or ACE might be more meaningfull.

physeq_1_st %>%
  phyloseq_rarefied_unrarefied_richness(measure = "Chao1", 
                                        sample_size = physeq_1_st %>%  sample_sums() %>%  min(),
                                        color_data = "treatment_invivo") + scale_color_manual(values = treat_pal) -> p2

p2

Using Shannon we see that the value does not change much - because Shannon metric is also impacted by proportion of taxa:

physeq_1_st %>%
  phyloseq_rarefied_unrarefied_richness(measure = "Shannon", 
                                        sample_size = physeq_1_st %>%  sample_sums() %>%  min(),
                                        color_data = "treatment_invivo") + scale_color_manual(values = treat_pal) -> p3

p3

Even rarefying with 2000 sequences will preserve the Shannon pattern:

physeq_1_st %>%
  phyloseq_rarefied_unrarefied_richness(measure = "Shannon", 
                                        sample_size = 2000,
                                        color_data = "treatment_invivo") + scale_color_manual(values = treat_pal) 

We can combine the graphs using ggarrange from the ggpubr package:

ggpubr::ggarrange(p1,
                  p2,
                  p3,
                  common.legend = TRUE)

source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_normalisation.R") 
phyloseq_filter_per_sample_OTU

phyloseq_remove_chloro_mitho

4.4.0.5 Compare phyloseq taxonomy of two phyloseq objects:

# source("https://raw.githubusercontent.com/fconstancias/metabaRpipe/master/Rscripts/functions.R")

# physeq_1_rare %>% 
  # phyloseq_DECIPHER_tax(physeq = ., 
  #                       threshold = 60,
  #                       db="~/db/DADA2/SILVA_SSU_r132_March2018.RData"  # check where you downloaded your database
  #                       tryRC = TRUE,
  #                       return = TRUE) -> physeq_1_rare_2
# compare_phyloseq_taxonomy(physeq_A = physeq_1_rare,
#                           physeq_B = physeq_1_rare_2) %>% 
#   head() %>% 
#   DT::datatable()

4.4.0.6 Exporting plots:

Please check here

ggplot(mtcars, aes(mpg, wt)) +
  geom_point() -> p

p

ggsave("mtcars.pdf", 
       plot = p,
       device = "pdf", width = 20, height = 20, units = "cm")

4.4.0.7 Exporting tables:

as tsv.

physeq_1 %>% 
  sample_data() %>% 
  write_tsv("my_sample_data.tsv")

as an excel file.

# physeq_1 %>% 
#   sample_data() %>% 
#   xlsx::write.xlsx("my_sample_data.xlsx", 
#                    sheetName = "sample_data",
#                    append = TRUE)

Good opportunity to introduce the phloseq_export_otu_tax() function, it basically exports otu_table() and tax_table() of a phyloseq object.

source("https://raw.githubusercontent.com/fconstancias/metabaRpipe/master/Rscripts/functions.R")

physeq_1 %>% 
  phloseq_export_otu_tax() %>% 
  head() %>% 
  DT::datatable()

We can use write.xlsx under another sheet from the same excel file:

 # physeq_1 %>% 
 #  phloseq_export_otu_tax() %>% 
 #  xlsx::write.xlsx("my_sample_data.xlsx", 
 #                   sheetName = "otu_tax",
 #                   append = TRUE) # allows to add the new sheet to an existing file.

4.4.1 To explore:

physeq_1_rare %>% 
  phyloseq_ntaxa_by_tax(TaxRank = "Genus",
                        add_meta_data = FALSE) %>% 
  arrange(-N.OTU) %>% 
  DT::datatable()
## Warning in phyloseq::psmelt(physeq): The sample variables: 
## Sample
##  have been renamed to: 
## sample_Sample
## to avoid conflicts with special phyloseq plot attribute names.
physeq_1 %>% 
  phyloseq_coverage(., 
                    correct_singletons = FALSE, 
                    add_attr = T) %>% 
  arrange(-SampleCoverage) %>% 
  DT::datatable()
physeq_1 %>%
  phyloseq_coverage_raref(seeds = 123, multithread = TRUE)
# physeq_1 %>%
#   phyloseq_effort_div()
# physeq_1 %>%
#   phyloseq_effort_div_rangeplot(range = seq(from = 10, to = 100, by = 10),
#                                 # range_type = "coverage",
#                                 show_plot = TRUE)
# physeq_1 %>%
#   metagMisc::phyloseq_mult_raref_dist(dissimilarity = "bray",
#                            parallel = TRUE)
# physeq_1 %>%
#   phyloseq_mult_raref_div(parallel = TRUE)
physeq_1 %>% 
  phyloseq_phylo_div()
physeq_1 %>% 
  phyloseq_phylo_ses()
physeq_1 %>% 
  phyloseq_sep_samp()
physeq_1 %>% 
  phyloseq_sep_tax()
# physeq_1 %>% 
#   phyloseq_standardize_otu_abundance()
physeq_1 %>% 
  phyloseq_summary(., physeq_1_rare, more_stats = TRUE)
##                                                 Parameter           Phys1
## 1                         Average number of reads per OTU    2536.0731838
## 2                      Average number of reads per sample   40926.9741379
## 3                        Average OTU occurrence, percents      21.1322023
## 4      Coefficient of quartile variation in OTU abundance       0.9865244
## 5   Coefficient of quartile variation in sample abundance       0.1653620
## 6                         Data sparsity (number of zeros)  171263.0000000
## 7                     Data sparsity (percentage of zeros)      78.8677977
## 8                                 Max total OTU abundance  889239.0000000
## 9                              Max total sample abundance  124774.0000000
## 10                         Median number of reads per OTU      39.0000000
## 11                      Median number of reads per sample   37874.0000000
## 12                        Median OTU occurrence, percents       5.1724138
## 13                                Min total OTU abundance       1.0000000
## 14                             Min total sample abundance   15459.0000000
## 15                                         Number of OTUs    1872.0000000
## 16                                      Number of samples     116.0000000
## 17                                   Number of singletons      71.0000000
## 18                               Percentage of singletons       3.7927350
## 19                              Q1 of total OTU abundance       3.0000000
## 20                           Q1 of total sample abundance   33680.2500000
## 21                              Q3 of total OTU abundance     442.2500000
## 22                           Q3 of total sample abundance   47026.0000000
## 23                                  Total number of reads 4747529.0000000
## 231                                   Percentage of reads     100.0000000
## 151                                    Percentage of OTUs     100.0000000
##               Phys2
## 1      1132.8136450
## 2     15459.0000000
## 3        21.5653386
## 4         0.9844358
## 5         0.0000000
## 6    144028.0000000
## 7        78.4346614
## 8    347030.0000000
## 9     15459.0000000
## 10       34.0000000
## 11    15459.0000000
## 12        8.6206897
## 13        1.0000000
## 14    15459.0000000
## 15     1583.0000000
## 16      116.0000000
## 17      258.0000000
## 18       16.2981680
## 19        2.0000000
## 20    15459.0000000
## 21      255.0000000
## 22    15459.0000000
## 23  1793244.0000000
## 231      37.7721547
## 151      84.5619658
# physeq_1 %>% 
#   phyloseq_taxonomy_imputation()
# source("https://raw.githubusercontent.com/vmikk/metagMisc/master/R/phyloseq_taxonomic_resolution.R")
# 
# physeq_1 %>% 
#   phyloseq_taxonomic_resolution()
# physeq_1 %>% 
#   phyloseq_to_MetaCommunity() -> meta_out
# 
# meta_out %>% 
#   str()
physeq_1 %>% 
  phyloseq_get_strains() %>% 
  physeq_glom_rename(rename_ASV = "Strain") %>% 
  prevalence(count = TRUE, sort = TRUE) %>% 
  head()
## Joining, by = "ASV"
##        unknown Lactobacillus (Genus) ASV5082 
##                                          116 
##     unknown Lachnospiraceae (Family) ASV4233 
##                                          116 
##      unknown Peptococcaceae (Family) ASV3327 
##                                          116 
## unknown Desulfovibrionaceae (Family) ASV1903 
##                                          116 
## Bacteroides Bacteroides_acidifaciens ASV1690 
##                                          116 
##          unknown Bacteroides (Genus) ASV1684 
##                                          116
physeq_1 %>% nsamples()
## [1] 116

https://cran.r-project.org/web/packages/iNEXT/vignettes/Introduction.html

# physeq_1 %>% 
#   phyloseq_get_strains() %>% 
#   physeq_glom_rename(rename_ASV = "Strain") %>% 
#   prepare_inext(., correct_singletons = F) -> inext
# 
# inext %>% 
#   iNEXT::iNEXT(., q=c(0, 1, 2), 
#                datatype = "incidence_freq",
#                nboot = 2,
#                se = FALSE) -> out
# 
# out %>% 
#   iNEXT::ggiNEXT(., type=1, facet.var="site")
# physeq_1_st %>%
#   # subset_samples(PCR_plate != "P3" | sample_number > 30) %>%
#   prune_taxa(taxa =  taxa_sums(physeq_1_st) > 1000) %>%
#   subset_taxa(Phylum != "unknown" & Order %in% c("Clostridiales", "Bacteroidales")) %>%
#   # filter_taxa(function(x) sum(x) > 0, TRUE) %>%
#   rarefy_even_depth(sample.size = 1000, rngseed = 123)  %>%
#   tax_glom(taxrank = "Order") %>%
#   transform_sample_counts(function(x) x/sum(x) * 100)
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.2 (2020-06-22)
##  os       macOS High Sierra 10.13.6   
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Europe/Paris                
##  date     2022-04-13                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package             * version    date       lib
##  abind                 1.4-5      2016-07-21 [1]
##  ade4                  1.7-17     2021-06-17 [1]
##  ampvis2             * 2.7.17     2022-03-25 [1]
##  ape                 * 5.5        2021-04-25 [1]
##  assertthat            0.2.1      2019-03-21 [1]
##  backports             1.4.1      2021-12-13 [1]
##  Biobase               2.50.0     2021-02-17 [1]
##  BiocGenerics          0.34.0     2020-04-27 [1]
##  biomformat            1.16.0     2020-04-27 [1]
##  Biostrings            2.56.0     2020-04-27 [1]
##  bit                   4.0.4      2020-08-04 [1]
##  bit64                 4.0.5      2020-08-30 [1]
##  bookdown              0.25       2022-03-16 [1]
##  broom                 0.7.11     2022-01-03 [1]
##  bslib                 0.2.5.1    2021-05-18 [1]
##  cachem                1.0.6      2021-08-19 [1]
##  callr                 3.7.0      2021-04-20 [1]
##  car                   3.0-12     2021-11-06 [1]
##  carData               3.0-5      2022-01-06 [1]
##  cellranger            1.1.0      2016-07-27 [1]
##  cli                   3.1.0      2021-10-27 [1]
##  cluster               2.1.2      2021-04-17 [1]
##  codetools             0.2-18     2020-11-04 [1]
##  colorspace            2.0-2      2021-06-24 [1]
##  cowplot               1.1.1      2020-12-30 [1]
##  crayon                1.4.2      2021-10-29 [1]
##  crosstalk             1.2.0      2021-11-04 [1]
##  data.table            1.14.2     2021-09-27 [1]
##  DBI                   1.1.1      2021-01-15 [1]
##  dbplyr                2.1.1      2021-04-06 [1]
##  desc                  1.3.0      2021-03-05 [1]
##  devtools              2.4.2      2021-06-07 [1]
##  digest                0.6.29     2021-12-01 [1]
##  dplyr               * 1.0.7      2021-06-18 [1]
##  DT                    0.20       2021-11-15 [1]
##  ellipsis              0.3.2      2021-04-29 [1]
##  evaluate              0.14       2019-05-28 [1]
##  fansi                 0.5.0      2021-05-25 [1]
##  fantaxtic             0.1.0      2020-08-25 [1]
##  farver                2.1.0      2021-02-28 [1]
##  fastmap               1.1.0      2021-01-25 [1]
##  forcats             * 0.5.1      2021-01-27 [1]
##  foreach               1.5.1      2020-10-15 [1]
##  fs                    1.5.2      2021-12-08 [1]
##  generics              0.1.1      2021-10-25 [1]
##  GGally                2.1.2      2021-06-21 [1]
##  gghalves              0.1.1      2020-11-10 [1]
##  ggord               * 1.1.6      2021-08-16 [1]
##  ggplot2             * 3.3.5      2021-06-25 [1]
##  ggpubr                0.4.0      2020-06-27 [1]
##  ggrepel               0.9.1      2021-01-15 [1]
##  ggsci                 2.9        2018-05-14 [1]
##  ggsignif              0.6.3      2021-09-09 [1]
##  ggvegan             * 0.1-0      2021-09-02 [1]
##  glue                  1.6.2      2022-02-24 [1]
##  gridExtra             2.3        2017-09-09 [1]
##  gtable                0.3.0      2019-03-25 [1]
##  GUniFrac            * 1.3        2021-08-18 [1]
##  haven                 2.4.3      2021-08-04 [1]
##  highr                 0.9        2021-04-16 [1]
##  hms                   1.1.0      2021-05-17 [1]
##  htmltools             0.5.2      2021-08-25 [1]
##  htmlwidgets           1.5.4      2021-09-08 [1]
##  httr                  1.4.2      2020-07-20 [1]
##  igraph                1.2.6      2020-10-06 [1]
##  iNEXT                 2.0.20     2020-01-28 [1]
##  IRanges               2.22.2     2020-05-21 [1]
##  iterators             1.0.13     2020-10-15 [1]
##  jquerylib             0.1.4      2021-04-26 [1]
##  jsonlite              1.8.0      2022-02-22 [1]
##  knitr                 1.38       2022-03-25 [1]
##  labeling              0.4.2      2020-10-20 [1]
##  lattice             * 0.20-44    2021-05-02 [1]
##  lazyeval              0.2.2      2019-03-15 [1]
##  lifecycle             1.0.1      2021-09-24 [1]
##  lubridate             1.8.0      2021-10-07 [1]
##  magrittr              2.0.3      2022-03-30 [1]
##  MASS                  7.3-54     2021-05-03 [1]
##  Matrix                1.3-4      2021-06-01 [1]
##  matrixStats           0.61.0     2021-09-17 [1]
##  memoise               2.0.0      2021-01-26 [1]
##  metagMisc           * 0.0.4      2022-04-12 [1]
##  mgcv                  1.8-36     2021-06-01 [1]
##  microbiome          * 1.10.0     2020-04-27 [1]
##  microbiomeutilities   1.00.11    2020-11-10 [1]
##  modelr                0.1.8      2020-05-19 [1]
##  multtest              2.44.0     2020-04-27 [1]
##  munsell               0.5.0      2018-06-12 [1]
##  nlme                  3.1-152    2021-02-04 [1]
##  permute             * 0.9-5      2019-03-12 [1]
##  pheatmap              1.0.12     2019-01-04 [1]
##  phyloseq            * 1.34.0     2021-02-17 [1]
##  pillar                1.6.4      2021-10-18 [1]
##  pkgbuild              1.2.0      2020-12-15 [1]
##  pkgconfig             2.0.3      2019-09-22 [1]
##  pkgload               1.2.1      2021-04-06 [1]
##  plotly                4.9.4.1    2021-06-18 [1]
##  plyr                  1.8.6      2020-03-03 [1]
##  prettyunits           1.1.1      2020-01-24 [1]
##  processx              3.5.2      2021-04-30 [1]
##  ps                    1.6.0      2021-02-28 [1]
##  purrr               * 0.3.4      2020-04-17 [1]
##  R6                    2.5.1      2021-08-19 [1]
##  RColorBrewer          1.1-2      2014-12-07 [1]
##  Rcpp                  1.0.7      2021-07-07 [1]
##  readr               * 2.0.1      2021-08-10 [1]
##  readxl                1.3.1      2019-03-13 [1]
##  remotes               2.4.0      2021-06-02 [1]
##  reprex                2.0.1      2021-08-05 [1]
##  reshape               0.8.8      2018-10-23 [1]
##  reshape2            * 1.4.4      2020-04-09 [1]
##  rhdf5                 2.32.4     2020-10-05 [1]
##  Rhdf5lib              1.10.1     2020-07-09 [1]
##  rlang                 1.0.2      2022-03-04 [1]
##  rmarkdown             2.13       2022-03-10 [1]
##  rmdformats            1.0.3.9000 2022-04-11 [1]
##  rmutil                1.1.5      2020-06-09 [1]
##  rprojroot             2.0.2      2020-11-15 [1]
##  rstatix               0.7.0      2021-02-13 [1]
##  rstudioapi            0.13       2020-11-12 [1]
##  Rtsne                 0.15       2018-11-10 [1]
##  rvest                 1.0.1      2021-07-26 [1]
##  S4Vectors             0.26.1     2020-05-16 [1]
##  sass                  0.4.1      2022-03-23 [1]
##  scales              * 1.1.1      2020-05-11 [1]
##  sessioninfo           1.1.1      2018-11-05 [1]
##  speedyseq           * 0.5.0      2020-09-29 [1]
##  statmod               1.4.36     2021-05-10 [1]
##  stringi               1.7.6      2021-11-29 [1]
##  stringr             * 1.4.0      2019-02-10 [1]
##  survival              3.2-13     2021-08-24 [1]
##  testthat              3.0.4      2021-07-01 [1]
##  tibble              * 3.1.6      2021-11-07 [1]
##  tidyr               * 1.1.4      2021-09-27 [1]
##  tidyselect            1.1.1      2021-04-30 [1]
##  tidyverse           * 1.3.1      2021-04-15 [1]
##  tzdb                  0.1.2      2021-07-20 [1]
##  usethis               2.0.1      2021-02-10 [1]
##  utf8                  1.2.2      2021-07-24 [1]
##  vctrs                 0.3.8      2021-04-29 [1]
##  vegan               * 2.5-7      2020-11-28 [1]
##  viridisLite           0.4.0      2021-04-13 [1]
##  vroom                 1.5.4      2021-08-05 [1]
##  withr                 2.4.3      2021-11-30 [1]
##  xfun                  0.30       2022-03-02 [1]
##  xml2                  1.3.2      2020-04-23 [1]
##  XVector               0.28.0     2020-04-27 [1]
##  yaml                  2.3.5      2022-02-21 [1]
##  zlibbioc              1.34.0     2020-04-27 [1]
##  source                                       
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  Github (MadsAlbertsen/ampvis2@f48b708)       
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  bioc_git2r (@9927f90)                        
##  Bioconductor                                 
##  Bioconductor                                 
##  Bioconductor                                 
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.5)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.1)                               
##  CRAN (R 4.0.2)                               
##  Github (gmteunisse/Fantaxtic@4653973)        
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  Github (erocoar/gghalves@0e0ea16)            
##  https://fawda123.r-universe.dev (R 4.0.5)    
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  Github (gavinsimpson/ggvegan@cabb641)        
##  CRAN (R 4.0.5)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  Bioconductor                                 
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.5)                               
##  CRAN (R 4.0.5)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.5)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  Github (vmikk/metagMisc@9cb3131)             
##  CRAN (R 4.0.2)                               
##  Bioconductor                                 
##  Github (microsud/microbiomeutilities@fd0dd94)
##  CRAN (R 4.0.2)                               
##  Bioconductor                                 
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  bioc_git2r (@cbed93e)                        
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  Bioconductor                                 
##  Bioconductor                                 
##  CRAN (R 4.0.5)                               
##  CRAN (R 4.0.5)                               
##  Github (juba/rmdformats@b151102)             
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  Bioconductor                                 
##  CRAN (R 4.0.5)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  Github (mikemc/speedyseq@64932ce)            
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.2)                               
##  CRAN (R 4.0.5)                               
##  CRAN (R 4.0.2)                               
##  Bioconductor                                 
##  CRAN (R 4.0.5)                               
##  Bioconductor                                 
## 
## [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

5 Alternative analysis approaches using user-friendlier platformes:

If you do not have experience with R and you do not want to invest time learning it or you want basic descriptions of your data, the easiest way to proceed is to use user friendly interfaces. There are several options:

5.1 R-Shiny-interfaces:

These are R based interactive interfaces enabling a smooth code free experience.

Select your data > RDS allows to directly load the phyloseq object.

5.2 qiime2:

QIIME™ (pronounced chime) stands for Quantitative Insights Into Microbial Ecology. It is a bioinformatic pipeline developed to ease the process of bioinformatic analysis of raw sequencing microbiome data - not explored here since we have processed the data using metabaRpipe. It is also useful to visualize and analyze the processed data, in a user friendly experience:

5.2.1 Generate qiime2 compatibles files from the phyloseq object:

Check the metabaRpipe tutorial here or there

Rscript metabaRpipe/Rscripts/phyloseq_export_qiime.Rscript \
-i dada2/physeq_phylo/phyloseq_phylo.RDS \
-o qiime2 \
-f metabaRpipe/Rscripts/functions.R 

Alternatively, this can also be run from R/Rstudio:

require(tidyverse); source("https://raw.githubusercontent.com/fconstancias/metabaRpipe/master/Rscripts/functions.R")

readRDS("dada2/phyloseq.RDS") %>%  
  physeq_export_qiime(output_dir = "~/qiime2/")

This command will generate all the necessary processed files to analyze the metabarcoding data and as seen here.

# List/ check the files are here:
ls
asv.fna         asv_neweek.qza      asv_rep_set.qza     qiime2_mapping_file.txt qiime2_otu.qza      tax.txt
asv_biom.biom       asv_neweek.tre      qiime1_mapping_file.txt qiime2_metadata.qzv qiime2_taxonomy.qza

We have generated:

  • A fasta file containing the sequence of the ASV: asv.fna. This corresponds to the ps %>% refseq() in R.
  • A text file with the metadata: qiime2_mapping_file.txt. This corresponds to the ps %>% sample_data() in R.
  • A text with the taxonomic path of the ASV: tax.txt. This corresponds to the ps %>% sample_data() in R.
  • A biom file with the sample wise ASV counts: asv_biom.biom. This to the ps %>% otu_table() in R.
  • A neweek ASV phylogenetic tree : asv_neweek.tre. This to the ps %>% phy_tree() in R.

5.2.2 Import the data in qiime2:

wget https://data.qiime2.org/distro/core/qiime2-2022.2-py38-osx-conda.yml
conda env create -n qiime2-2022.2 --file qiime2-2022.2-py38-osx-conda.yml
# OPTIONAL CLEANUP
rm qiime2-2022.2-py38-osx-conda.yml
# Activate the conda environment
conda activate qiime2-2022.2
# Test your installation
qiime --help
  • Activate the environment change directory:
# Activate the qiime conda environment

conda activate qiime2-2022.2

# Navigate were the output of phyloseq_export_qiime.Rscript were generated:

cd qiime2/ 

# List/ check the files are here:
ls
asv.fna         asv_biom.biom       asv_neweek.tre      qiime1_mapping_file.txt qiime2_mapping_file.txt tax.txt
  • Import the data in qiime2 compatible format:

The following commands are required to import the files in qza qiime2 specific format

# Import OTU/ASV table:
qiime tools import   \
--input-path asv_biom.biom \
--type 'FeatureTable[Frequency]' \
--input-format BIOMV100Format \
--output-path qiime2_otu.qza

# Import taxonomic table:
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path tax.txt \
--output-path qiime2_taxonomy.qza

# Import phylogenetic tree:
qiime tools import \
--input-path asv_neweek.tre \
--output-path asv_neweek.qza \
--type 'Phylogeny[Rooted]'

# Import ASV sequences:
qiime tools import \
--input-path asv.fna \
--output-path asv_rep_set.qza \
--type 'FeatureData[Sequence]'

# Import metadata:
qiime metadata tabulate \
--m-input-file qiime2_mapping_file.txt \
--o-visualization qiime2_metadata.qzv

The import successfully generated .qza files we will then use to visualize/ analyze the data:

# List/ check the files are here:
ls
asv.fna         asv_neweek.qza      asv_rep_set.qza     qiime2_mapping_file.txt qiime2_otu.qza      tax.txt
asv_biom.biom       asv_neweek.tre      qiime1_mapping_file.txt qiime2_metadata.qzv qiime2_taxonomy.qza

5.2.3 Diversity analyses:

5.2.4 Data summary:

The first step is to summarize the ASV table using qiime feature-table summarize

qiime feature-table summarize \
--i-table qiime2_otu.qza \
--o-visualization  qiime2_otu.qzv \
--m-sample-metadata-file qiime2_mapping_file.txt

This will generate a qiime2_otu.qzv which we can open using:

qiime tools view qiime2_otu.qzv

In addition to the Overview, the Interactive Sample Detail summaries the number of Features (i.e., here ASV) among samples. We can adjust the Sampling Depth from the Plot Controls panel to see how many samples would be discarded when considering a minimum sampling depth cutoff or a rarefaction threshold.

An example of such summary can be found here

5.2.5 Rarafaction analysis:

Another meaningful preliminary analysis is to perform rarefaction curves investigating the changes of alpha-diversity metrics (e.g., Richness or Observed_features) as a function of the sequencing depth.

qiime diversity alpha-rarefaction \
--i-phylogeny asv_neweek.qza \
--i-table qiime2_otu.qza \
--p-max-depth 5000 \
--o-visualization CORE-METRICS-RESULTS/alpha-rarefaction.qzv
qiime tools view CORE-METRICS-RESULTS/alpha-rarefaction.qzv

An example of rarefaction curves can be found here. Ideally they should reach a plateau meaning the sequencing effort allows an exhaustive description of the diversity.

5.2.6 Generating core-metrics for diversity analyses.

An important parameter here is the --p-sampling-depth which is the sequencing depth at which all the samples will be rarefied in order to avoid biases due to uneven sequencing depth.

qiime diversity core-metrics-phylogenetic \
--i-phylogeny asv_neweek.qza \
--i-table qiime2_otu.qza \
--p-sampling-depth 3500 \
--m-metadata-file qiime2_mapping_file.txt \
--output-dir CORE-METRICS-RESULTS

This command generates a lot of qza files corresponding to different alpha (i.e., faith_pd, observed_features, shannon, evenness) and beta diversity metrics (i.e., unweighted_unifrac, weighted_unifrac, jaccard_distance, bray_curtis_distance) and including PCoA ordniations for beta diversity metrics. (e.g., jaccard_pcoa_results.qza).

Saved FeatureTable[Frequency] to: CORE-METRICS-RESULTS/rarefied_table.qza
Saved SampleData[AlphaDiversity] to: CORE-METRICS-RESULTS/faith_pd_vector.qza
Saved SampleData[AlphaDiversity] to: CORE-METRICS-RESULTS/observed_features_vector.qza
Saved SampleData[AlphaDiversity] to: CORE-METRICS-RESULTS/shannon_vector.qza
Saved SampleData[AlphaDiversity] to: CORE-METRICS-RESULTS/evenness_vector.qza
Saved DistanceMatrix to: CORE-METRICS-RESULTS/unweighted_unifrac_distance_matrix.qza
Saved DistanceMatrix to: CORE-METRICS-RESULTS/weighted_unifrac_distance_matrix.qza
Saved DistanceMatrix to: CORE-METRICS-RESULTS/jaccard_distance_matrix.qza
Saved DistanceMatrix to: CORE-METRICS-RESULTS/bray_curtis_distance_matrix.qza
Saved PCoAResults to: CORE-METRICS-RESULTS/unweighted_unifrac_pcoa_results.qza
Saved PCoAResults to: CORE-METRICS-RESULTS/weighted_unifrac_pcoa_results.qza
Saved PCoAResults to: CORE-METRICS-RESULTS/jaccard_pcoa_results.qza
Saved PCoAResults to: CORE-METRICS-RESULTS/bray_curtis_pcoa_results.qza
Saved Visualization to: CORE-METRICS-RESULTS/unweighted_unifrac_emperor.qzv
Saved Visualization to: CORE-METRICS-RESULTS/weighted_unifrac_emperor.qzv
Saved Visualization to: CORE-METRICS-RESULTS/jaccard_emperor.qzv
Saved Visualization to: CORE-METRICS-RESULTS/bray_curtis_emperor.qzv

5.2.6.1 Alpha-diversity:

Alpha diversity is often defined as univariate within sample community characteristic, for more information please refer to slides 63, 66 to 79 Mahendra Mariadassou’s Lecture Now we have generated different metrics, we can visualize and explore associations with the metadata.

qiime diversity alpha-group-significance \
--i-alpha-diversity CORE-METRICS-RESULTS/faith_pd_vector.qza \
--m-metadata-file qiime2_mapping_file.txt \
--o-visualization CORE-METRICS-RESULTS/faith-pd-group-significance.qzv

qiime diversity alpha-group-significance \
--i-alpha-diversity CORE-METRICS-RESULTS/evenness_vector.qza \
--m-metadata-file qiime2_mapping_file.txt \
--o-visualization CORE-METRICS-RESULTS/evenness-group-significance.qzv

You will get errors related to the mapping file, we can quickly generate a new artificial mapping file linking samples to group A or B.

vim qiime2_mapping_file_2.txt
sampleid        group
R1F1-S66        A
R1F2-S300       A
R1F3-S90        B
Y2A15-2M-06-S78 B
Y2A15-2M-12-S77 B
Y3-R1F4-S136    B

Let’s run the functions again:

qiime diversity alpha-group-significance \
--i-alpha-diversity CORE-METRICS-RESULTS/faith_pd_vector.qza \
--m-metadata-file qiime2_mapping_file_2.txt \
--o-visualization CORE-METRICS-RESULTS/faith-pd-group-significance.qzv

qiime diversity alpha-group-significance \
--i-alpha-diversity CORE-METRICS-RESULTS/evenness_vector.qza \
--m-metadata-file qiime2_mapping_file_2.txt \
--o-visualization CORE-METRICS-RESULTS/evenness-group-significance.qzv

And visualize the output, for instance for the evenness metric:

qiime tools view CORE-METRICS-RESULTS/evenness-group-significance.qzv

Here and example of the output.

If you want to explore the results for all the metrics you will have to open all the files and just considering one metadata covariate.

CORE-METRICS-RESULTS/evenness-group-significance.qzv                CORE-METRICS-RESULTS/unweighted-unifrac-group-significance.qzv
CORE-METRICS-RESULTS/faith-pd-group-significance.qzv

5.2.6.2 Beta-diversity:

Beta-diversity is often defined as multidimensional between sample community characteristic, for more information please refer to slides 63, 80 to 92 of Mahendra Mariadassou’s Lecture. This type of data is multidimensional and we need specific multidimensional approaches to visualize analyze such data. Now we have generated different metrics, we can visualize and explore associations with the metadata.

5.2.6.2.1 PCoA ordination:

Principal Coordinate analysis is and unconstrained ordination aiming at exploring the main variation in the data. It is similar as a PCA but can deal with any type of metric.

Below are the commands to generate a PCoA using the unweighted-unifrac metric we generated using the qiime diversity core-metrics-phylogenetic with the data rarefied at 3500 sequences per sample.

qiime emperor plot \
--i-pcoa CORE-METRICS-RESULTS/unweighted_unifrac_pcoa_results.qza \
--m-metadata-file qiime2_mapping_file_2.txt \
--o-visualization CORE-METRICS-RESULTS/unweighted-unifrac-emperor.qzv

We can then visualize the 3 first axes of the PCoA using the following command:

qiime tools view CORE-METRICS-RESULTS/unweighted-unifrac-emperor.qzv

An example of ordination plot can be found here

5.2.7 Taxonomic visualisation:

The diversity analyses we have explored are not incorporating any taxonomic information and are considering ASV (i.e., features in qiime2). We can visualize community composition in terms of taxa proportion among samples using qiime taxa barplot:

qiime taxa barplot \
--i-table qiime2_otu.qza \
--i-taxonomy qiime2_taxonomy.qza \
--m-metadata-file qiime2_mapping_file_2.txt \
--o-visualization tax-bar-plots.qzv

The function generates a qzv plot we can explore using qiime tools view:

qiime tools view tax-bar-plots.qzv

An example of the output can be found here

5.2.8 Filtering data:

There are many ways to filter the data: the samples, the ASV based on metadata, proportions, taxonomic information… Below an example on sample filtering based on metadata.

5.2.8.1 Based on metadata:

qiime feature-table filter-samples \
--i-table qiime2_otu.qza \
--m-metadata-file qiime2_mapping_file_2.txt \
--p-where "[group]='B'" \
--o-filtered-table qiime2_otu_groupB.qza

Analyzing the filtered dataset would require to start again with the beginning qiime diversity core-metrics-phylogenetic which will lead to loads of folders, outputs and makes it difficult to keep track of the code and analyses when things are getting a bit complex.

5.2.8.2 More details about the filtering options.

A much more detailed qiime2 tutorial can be found here