Microbial community analysis using R and phyloseq
This document is a guideline for community analysis using R
, Phyloseq, microbiome, vegan, ggplot2, tidyverse and custom made functions to facilitate the analyses and visualization stored under the DivComAnalyses repository.
We recommend using post-clustering approach for processing raw metabarcoding data and developed an R
friendly pipeline named metabaRpipe.
The learning curve of R
might be a bit steep but analyzing your metabarcoding data will give you access to state-of-the art approaches - developed in R
before getting integrated into user friendly pipelines. Moreover, you will be then able to apply your R skills to any other type of data.
1 Resources:
1.1 Commnuity ecology:
- Lecture from Mahendra Mariadassou including
phyloseq
R package (slides 4 - 61
) and community ecology concepts (slides 62 to 156
) - GustaMe website
- Ramette A. Multivariate analyses in microbial ecology. FEMS Microbiol Ecol. 2007 Nov;62(2):142-60. doi: 10.1111/j.1574-6941.2007.00375.x. Epub 2007 Sep 20. PMID: 17892477; PMCID: PMC2121141.
- Figures from inspiring papers.
- This tutorial.
2 Getting ready:
2.1 Install R/ Rstudio:
Install R/ Rstudio from the ETH AppV Kiosk or following the instructions here. It is important to start with an up to date R
version otherwise you will likely not been able to install the necessary packages and use the functions.
2.2 Install the required packages:
2.2.1 MAC:
macOS users - might need to install xquartz for some of the function (to do this with homebrew, run the following command in your mac’s Terminal: brew install --cask xquartz
)
2.2.2 Windows:
Windows users - will need to have RTools installed so that your computer can build this package. Follow instructions here.
2.3 Load the necessary packages:
Tidyverse
and phyloseq
are the two main packages we are going to use but the functions we will source later will require additional packages you would need to install.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## [1] '1.3.1'
## [1] '1.34.0'
3 Getting familiar with R
& phyloseq
:
3.1 Data import:
We are going to illustrate some functions using two different datasets generated in the group and processed with the metabaRpipe
pipeline. Those files are located under the Github repository. You can download those directly https://github.com/fconstancias/DivComAnalyses/raw/master/data-raw/ps_invivo.RDS.
Mice were subjected with different treatments and 16S metabarcoding was use to describe feces level changes in the microbiota ps_invivo.RDS
.
Define the phyloseq object’s location:
Load the phyloseq object:
ps1 %>% # ps object wich is path of our physeq file on our computer
readRDS() -> physeq_1 # the readRDS function and then it is directed/stored to a R object we named physeq
3.1.1 The different facets of the phyloseq object:
Call the phyloseq object:
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1872 taxa and 116 samples ]
## sample_data() Sample Data: [ 116 samples by 26 sample variables ]
## tax_table() Taxonomy Table: [ 1872 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1872 tips and 1870 internal nodes ]
## refseq() DNAStringSet: [ 1872 reference sequences ]
The dataset contains 3704 taxa, ASV and recorded in 116 samples otu_table()
and 26 variables as metadata sample_data()
. There are 7 levels in the taxonomic path of those ASV and tax_table()
an ASV phylogenetic tree is stored phy_tree()
in addition to the sequences refseq()
The different facets of the data can be easily access:
OTU/ASV table:
## OTU Table: [6 taxa and 116 samples]
## taxa are rows
## S_219 S_220 S_221 S_222 S_223 S_224 S_225 S_226 S_227 S_228 S_229 S_230
## ASV0002 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0003 12 0 0 3 0 30 84 15 26 2 0 6
## ASV0006 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0008 0 3 0 1 3 10 39 3 0 0 3 0
## ASV0012 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0013 0 0 0 0 0 0 0 0 0 0 0 0
## S_231 S_232 S_233 S_234 S_251 S_252 S_253 S_254 S_255 S_256 S_257 S_259
## ASV0002 8 0 0 0 0 0 0 0 0 0 0 0
## ASV0003 8 0 43 5 0 0 0 0 0 14 11 185
## ASV0006 0 0 0 0 0 0 1 0 0 0 0 0
## ASV0008 8 0 0 0 1 20 26 6 9 0 0 0
## ASV0012 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0013 0 0 0 0 0 0 0 0 0 0 0 0
## S_260 S_261 S_262 S_263 S_264 S_265 S_267 S_268 S_269 S_270 S_271 S_272
## ASV0002 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0003 14 13 27 40 0 4 678 823 0 0 30 30
## ASV0006 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0008 0 9 0 1 24 6 0 0 1 5 0 0
## ASV0012 0 0 0 0 0 0 0 0 0 0 81 67
## ASV0013 0 0 0 0 0 0 0 0 0 0 0 0
## S_273 S_274 S_275 S_276 S_277 S_278 S_279 S_280 S_281 S_298 S_299 S_300
## ASV0002 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0003 18 0 0 0 0 137 311 4 222 6 14 89
## ASV0006 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0008 0 0 2 0 0 0 0 0 0 0 0 0
## ASV0012 45 0 0 0 0 12 12 0 9 8 0 0
## ASV0013 0 0 0 0 0 0 0 0 0 0 0 3
## S_301 S_302 S_303 S_304 S_305 S_306 S_308 S_309 S_310 S_311 S_312 S_313
## ASV0002 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0003 72 35 0 0 35 5 3 9 3 60 14 0
## ASV0006 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0008 1 0 0 0 0 2 0 0 0 0 0 0
## ASV0012 0 0 0 0 0 3 0 0 0 0 0 15
## ASV0013 0 0 0 0 0 0 0 0 0 0 0 0
## S_315 S_316 S_317 S_318 S_319 S_320 S_321 S_322 S_323 S_324 S_325 S_326
## ASV0002 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0003 0 0 2 0 0 0 0 0 0 0 0 4
## ASV0006 0 3 2 0 0 0 0 0 0 0 0 0
## ASV0008 0 0 0 0 1 0 7 0 0 0 2 1
## ASV0012 15 0 0 0 99 0 0 0 8 0 0 14
## ASV0013 0 0 0 0 0 0 0 0 0 0 0 0
## S_327 S_341 S_342 S_343 S_344 S_345 S_346 S_347 S_348 S_349 S_350 S_351
## ASV0002 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0003 0 3 61 13 0 0 3 0 0 17 3 25
## ASV0006 3 1 0 0 0 0 0 2 0 0 0 0
## ASV0008 0 1 2 0 0 0 0 3 0 2 0 4
## ASV0012 71 57 84 0 0 0 13 33 0 0 11 0
## ASV0013 0 0 0 0 0 0 0 0 0 0 0 0
## S_352 S_353 S_354 S_355 S_356 S_438 S_439 S_440 S_441 S_442 S_443 S_444
## ASV0002 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0003 41 14 19 5 23 6 75 95 7 63 0 98
## ASV0006 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0008 6 15 2 8 0 3 0 0 0 0 1 2
## ASV0012 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0013 0 0 0 0 0 0 0 0 0 0 0 0
## S_445 S_446 S_447 S_448 S_449 S_450 S_465 S_466 S_467 S_468 S_469 S_470
## ASV0002 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0003 9 70 84 47 263 9 263 0 173 14 6 26
## ASV0006 0 0 1 0 0 0 0 0 0 0 0 0
## ASV0008 0 0 0 0 1 0 1 2 0 0 0 0
## ASV0012 0 0 0 0 0 6 0 0 0 0 0 0
## ASV0013 0 0 0 0 0 0 0 0 0 0 0 0
## S_471 S_472 S_473 S_474 S_475 S_476 S_477 S_478
## ASV0002 0 0 0 0 0 0 0 0
## ASV0003 12 0 2 14 18 8 0 7
## ASV0006 0 0 0 0 0 0 2 0
## ASV0008 1 0 1 0 8 2 3 3
## ASV0012 0 0 0 0 0 0 0 0
## ASV0013 0 0 0 0 0 0 0 0
N.B.: OTU (Operational Taxonomic Unit)can cover a lot of things including bands on a DGGE profile to OTU based on sequence similarity threshold or ASV. ASV (Amplicon Sequence Variant) is an type of OTU generated using post clustering approaches. See here, here or there for more details.
Check the tax table:
## Taxonomy Table: [6 taxa by 7 taxonomic ranks]:
## Kingdom Phylum Class Order Family
## ASV0002 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0003 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0006 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0008 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0012 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0013 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## Genus Species
## ASV0002 "Ruminococcus_1" "unknown"
## ASV0003 "Ruminococcus_1" "unknown"
## ASV0006 "unknown" "unknown"
## ASV0008 "unknown" "unknown"
## ASV0012 "unknown" "unknown"
## ASV0013 "unknown" "unknown"
Our phyloseq object has a ‘Strain’ taxonomic information we created - know taxonomical information at highest level + ASV id
## Taxonomy Table: [6 taxa by 7 taxonomic ranks]:
## Kingdom Phylum Class Order Family
## ASV0002 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0003 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0006 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0008 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0012 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0013 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## Genus Species
## ASV0002 "Ruminococcus_1" "unknown"
## ASV0003 "Ruminococcus_1" "unknown"
## ASV0006 "unknown" "unknown"
## ASV0008 "unknown" "unknown"
## ASV0012 "unknown" "unknown"
## ASV0013 "unknown" "unknown"
N.B.: I prefer to use tydiverse’s pipe: %>% which makes the code more readable as compared to classical R. For more info: https://dplyr.tidyverse.org/ & https://rstudio.com/resources/cheatsheets/
Which is similar to:
## Taxonomy Table: [6 taxa by 7 taxonomic ranks]:
## Kingdom Phylum Class Order Family
## ASV0002 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0003 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0006 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0008 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0012 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0013 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## Genus Species
## ASV0002 "Ruminococcus_1" "unknown"
## ASV0003 "Ruminococcus_1" "unknown"
## ASV0006 "unknown" "unknown"
## ASV0008 "unknown" "unknown"
## ASV0012 "unknown" "unknown"
## ASV0013 "unknown" "unknown"
Check nucleotide sequences of the ASV:
## DNAStringSet object of length 1872:
## width seq names
## [1] 253 TACGTAGGGAGCGAGCGTTGTC...AAAGCGTGGGGAGCAAACAGG ASV0002
## [2] 253 TACGTAGGGAGCGAGCGTTGTC...AAAGCGTGGGGAGCAAACAGG ASV0003
## [3] 253 TACGTAGGGAGCGAGCGTTGTC...AAAGCGTGGGGAGCAAACAGG ASV0006
## [4] 253 TACGTAGGTGGCAAGCGTTATC...AAAGTGTGGGGAGCAAACAGG ASV0008
## [5] 253 TACGTAGGATGCGAGCGTTATC...AAAGTGTGGGGAGCAAACAGG ASV0012
## ... ... ...
## [1868] 253 TACGTAGGGGGCAAGCGTTATC...AAAGCGTGGGGAGCAAACAGG ASV5274
## [1869] 253 TACGTAGGGGGCAAGCGTTATC...AAAGCGTGGGGAGCAAACAGG ASV5277
## [1870] 253 TACGTAGGGGGCAAGCGTTATC...AAAGCGTGGGGAGCAAACAGG ASV5278
## [1871] 253 TACGTAGGGGGCAAGCGTTATC...AAAGCGTGGGGAGCAAACAGG ASV5279
## [1872] 253 TACGTAGGGGGCAAGCGTTATC...AAAGCGTGGGGAGCGAATAGG ASV5280
Check the metadata:
## [1] "Sample" "input" "filtered"
## [4] "filtered_pc" "denoisedF" "denoisedR"
## [7] "denoisedF_pc" "denoisedR_pc" "merged"
## [10] "merged_pc" "tabled" "chimera_out"
## [13] "length_filtered" "tabled_pc" "chimera_out_pc"
## [16] "length_filtered_pc" "day" "treatment"
## [19] "mouse_label" "BW" "BW_percent"
## [22] "BW_delta" "DAI" "treatment_grouped"
## [25] "treatment_invivo" "day_num"
## Sample input filtered filtered_pc denoisedF denoisedR denoisedF_pc
## S_219 S_219 44852 44233 0.99 42985 43681 0.97
## S_220 S_220 72545 72048 0.99 70726 70818 0.98
## S_221 S_221 76560 76051 0.99 74773 75012 0.98
## S_222 S_222 62479 62192 1.00 61528 61201 0.99
## S_223 S_223 78944 78670 1.00 77717 77643 0.99
## S_224 S_224 72930 72622 1.00 71294 71600 0.98
## denoisedR_pc merged merged_pc tabled chimera_out length_filtered
## S_219 0.99 37694 0.88 37694 26929 26929
## S_220 0.98 63881 0.90 63881 44125 44125
## S_221 0.99 67904 0.91 67904 49814 49814
## S_222 0.98 54905 0.89 54905 42532 42532
## S_223 0.99 68254 0.88 68254 49859 49859
## S_224 0.99 60502 0.85 60502 44187 44187
## tabled_pc chimera_out_pc length_filtered_pc day treatment mouse_label
## S_219 1 0.71 1 Day_-1 TreatA m_8
## S_220 1 0.69 1 Day_-1 TreatA m_19
## S_221 1 0.73 1 Day_-1 TreatA m_34
## S_222 1 0.77 1 Day_-1 TreatA m_26
## S_223 1 0.73 1 Day_-1 TreatA m_39
## S_224 1 0.73 1 Day_-1 TreatA m_46
## BW BW_percent BW_delta DAI treatment_grouped treatment_invivo day_num
## S_219 20.55 0 0 0 TreatA none -1
## S_220 21.32 0 0 0 TreatA none -1
## S_221 20.11 0 0 0 TreatA none -1
## S_222 21.90 0 0 0 TreatA none -1
## S_223 20.94 0 0 0 TreatA none -1
## S_224 21.12 0 0 0 TreatA none -1
The phyloseq
package has also some useful functions:
For instance to access sample names:
## [1] "S_219" "S_220" "S_221" "S_222" "S_223" "S_224"
which is similar to:
## [1] "S_219" "S_220" "S_221" "S_222" "S_223" "S_224"
Some checks:
First top samples with highest highest total number of reads.
## S_225 S_231 S_465 S_444 S_466 S_274
## 124774 89836 63618 60835 59991 59927
First top ASV with highest highest total number of reads:
## ASV0840 ASV1253 ASV0411 ASV0385 ASV4998 ASV4836
## 889239 231459 180720 126535 111485 106082
Getting help on a particular function:
3.2 Manipulating a phyloseq object:
Filtering, subseting samples or OTU based on metadata, taxonomy, abundance and prevalence is very easy in phyloseq:
3.3 Filtering:
3.3.1 prune_taxa() prune_samples():
Two functions which prunes unwanted taxa/samples from a phyloseq object based on a vector of taxa to keep:
#Create a logical vector answering the logical question: does the ASV represent more than 10 reads in total?
physeq_1 %>%
taxa_sums() > 10 -> taxa_top
#taxa_top is indeed a logical vector
taxa_top %>%
head()
## ASV0002 ASV0003 ASV0006 ASV0008 ASV0012 ASV0013
## FALSE TRUE TRUE TRUE TRUE FALSE
#filter the physeq object based on that vector
physeq_1 %>%
prune_taxa(taxa = taxa_top) -> physeq_1_filtered
Note that all the data in our physeq_filtered object has been filtered: otu_table()
, tax_table()
, refseq()
, …
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1872 taxa and 116 samples ]
## sample_data() Sample Data: [ 116 samples by 26 sample variables ]
## tax_table() Taxonomy Table: [ 1872 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1872 tips and 1870 internal nodes ]
## refseq() DNAStringSet: [ 1872 reference sequences ]
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1043 taxa and 116 samples ]
## sample_data() Sample Data: [ 116 samples by 26 sample variables ]
## tax_table() Taxonomy Table: [ 1043 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1043 tips and 1041 internal nodes ]
## refseq() DNAStringSet: [ 1043 reference sequences ]
It is also possible to filter based on mean proportion, prevalence, variation coefficient:
sum_filter = 10
prev_filter = 1
varcoef_filter = 2
mean_filter = 10
physeq_1 %>%
filter_taxa(., function(x){mean(x) > mean_filter}, TRUE)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 288 taxa and 116 samples ]
## sample_data() Sample Data: [ 116 samples by 26 sample variables ]
## tax_table() Taxonomy Table: [ 288 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 288 tips and 286 internal nodes ]
## refseq() DNAStringSet: [ 288 reference sequences ]
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 789 taxa and 116 samples ]
## sample_data() Sample Data: [ 116 samples by 26 sample variables ]
## tax_table() Taxonomy Table: [ 789 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 789 tips and 787 internal nodes ]
## refseq() DNAStringSet: [ 789 reference sequences ]
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1534 taxa and 116 samples ]
## sample_data() Sample Data: [ 116 samples by 26 sample variables ]
## tax_table() Taxonomy Table: [ 1534 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1534 tips and 1532 internal nodes ]
## refseq() DNAStringSet: [ 1534 reference sequences ]
3.3.2 subset_taxa() subset_samples()
Those functions subsets unwanted taxa/samples from a phyloseq object based on conditions that must be met:
Keep only OTU belonging to Firmicutes Phyla:
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1370 taxa and 116 samples ]
## sample_data() Sample Data: [ 116 samples by 26 sample variables ]
## tax_table() Taxonomy Table: [ 1370 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1370 tips and 1368 internal nodes ]
## refseq() DNAStringSet: [ 1370 reference sequences ]
Keep only samples from treatment A and no treatment to our metadata:
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1872 taxa and 18 samples ]
## sample_data() Sample Data: [ 18 samples by 26 sample variables ]
## tax_table() Taxonomy Table: [ 1872 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1872 tips and 1870 internal nodes ]
## refseq() DNAStringSet: [ 1872 reference sequences ]
Keep only samples from treatment A and no treatment to our metadata:
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1872 taxa and 71 samples ]
## sample_data() Sample Data: [ 71 samples by 26 sample variables ]
## tax_table() Taxonomy Table: [ 1872 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1872 tips and 1870 internal nodes ]
## refseq() DNAStringSet: [ 1872 reference sequences ]
Keep only samples which are not defined by “none” in the “treatment_invivo” metadata:
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1872 taxa and 63 samples ]
## sample_data() Sample Data: [ 63 samples by 26 sample variables ]
## tax_table() Taxonomy Table: [ 1872 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1872 tips and 1870 internal nodes ]
## refseq() DNAStringSet: [ 1872 reference sequences ]
Keep only samples which are not defined by “none” in the “treatment_invivo” metadata and with samples with BW values > 20:
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1872 taxa and 28 samples ]
## sample_data() Sample Data: [ 28 samples by 26 sample variables ]
## tax_table() Taxonomy Table: [ 1872 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1872 tips and 1870 internal nodes ]
## refseq() DNAStringSet: [ 1872 reference sequences ]
3.4 updating the metadata:
Most of the time we fine tune the metadata along the analysis process, physeq_add_metadata()
helps to quickly change the metadata of a phyloseq object. Below an example using an excel file directly. The sample_column
defines the column in the external file to be use to link with the sample_names()
of the phyloseq object.
source("https://raw.githubusercontent.com/fconstancias/metabaRpipe/master/Rscripts/functions.R")
physeq_1 %>%
physeq_add_metadata(physeq = .,
metadata = "../../data-raw/ps_invivo_meta.xlsx" %>%
readxl::read_xlsx(),
sample_column = "Sample") -> physeq_1_new_meta
physeq_1_new_meta %>%
sample_data() %>%
data.frame() %>%
DT::datatable()
3.5 Smoothing taxonomical information:
3.5.1 ASV taxonomic agglomeration:
ASV obtained 16S V4 metabarcoding are mostly defined from a taxonomic perspective at the Family/ Genus.
## Taxonomy Table: [6 taxa by 3 taxonomic ranks]:
## Family Genus Species
## ASV0002 "Ruminococcaceae" "Ruminococcus_1" "unknown"
## ASV0003 "Ruminococcaceae" "Ruminococcus_1" "unknown"
## ASV0006 "Ruminococcaceae" "unknown" "unknown"
## ASV0008 "Ruminococcaceae" "unknown" "unknown"
## ASV0012 "Ruminococcaceae" "unknown" "unknown"
## ASV0013 "Ruminococcaceae" "unknown" "unknown"
The function tax_glom()
agglomerates ASV at a given taxonomic level. Finer taxonomic information is lost even if ASV id e.g., ASV0002
is maintained.
Agglomerate taxonomy at the Family level:
## Taxonomy Table: [6 taxa by 7 taxonomic ranks]:
## Kingdom Phylum Class Order
## ASV0066 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales"
## ASV0251 "Bacteria" "Tenericutes" "Mollicutes" "Mollicutes_RF39"
## ASV0308 "Bacteria" "Cyanobacteria" "Melainabacteria" "Gastranaerophilales"
## ASV0385 "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
## ASV0406 "Bacteria" "Bacteroidetes" "Bacteroidia" "unknown"
## ASV0411 "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
## Family Genus Species
## ASV0066 "Clostridiaceae_1" NA NA
## ASV0251 "unknown" NA NA
## ASV0308 "unknown" NA NA
## ASV0385 "Prevotellaceae" NA NA
## ASV0406 "unknown" NA NA
## ASV0411 "Marinifilaceae" NA NA
3.5.2 Maxiumum ASV resolution taxonomic assignments:
Agglomerated dataset at a particular taxonomic rank might be of interest for specific analysis or also to summarize the taxonomic profile at a broader level. It is also meaningful to know the highest taxonomic definition of the ASV. We can do that using the function phyloseq_get_strains()
We first need to source the function from the DivComAnalyses
repository. Those function can be accessed from your internet browser.
source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_normalisation.R")
Apply the function:
## Joining, by = "ASV"
We now have the a new Strain
taxonomic rank which summarizes the taxonomic path:
## Taxonomy Table: [6 taxa by 8 taxonomic ranks]:
## Kingdom Phylum Class Order Family
## ASV0002 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0003 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0006 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0008 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0012 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## ASV0013 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
## Genus Species Strain
## ASV0002 "Ruminococcus_1" NA "unknown Ruminococcus_1 (Genus) ASV0002"
## ASV0003 "Ruminococcus_1" NA "unknown Ruminococcus_1 (Genus) ASV0003"
## ASV0006 NA NA "unknown Ruminococcaceae (Family) ASV0006"
## ASV0008 NA NA "unknown Ruminococcaceae (Family) ASV0008"
## ASV0012 NA NA "unknown Ruminococcaceae (Family) ASV0012"
## ASV0013 NA NA "unknown Ruminococcaceae (Family) ASV0013"
3.6 Count transformation:
Metabarcoding data are count data with two important properties: sparcitiy
(i.e., there are usually a lot of 0) and compositionality
(i.e., number of sequences assigned to taxa/ASV are linked to the total number of sequences per sample which is arbitrary: imposed by the instrument and the library preparation method but do not indicate any density) - see here or there for more details.
The information stored in the otu_table()
is counts data.
## OTU Table: [ 6 taxa and 116 samples ]:
## Taxa are rows
## S_219 S_220 S_221 S_222 S_223 S_224 S_225 S_226 S_227 S_228 S_229 S_230
## ASV0002 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0003 12 0 0 3 0 30 84 15 26 2 0 6
## ASV0006 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0008 0 3 0 1 3 10 39 3 0 0 3 0
## ASV0012 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0013 0 0 0 0 0 0 0 0 0 0 0 0
## # … with 5 more samples (columns)
And the total number of sequences per samples varies and is not an indication of any absolute quantification.
## S_219 S_220 S_221 S_222 S_223 S_224
## 26927 44122 49814 42532 49859 44187
We can transform the count data in the otu_table()
into other type of data.
physeq_1 %>%
transform_sample_counts(function(x) x/sum(x) * 100) -> physeq_1_pc # transform as percentage - total is 100%
physeq_1_pc %>%
otu_table() %>%
head()
## OTU Table: [ 6 taxa and 116 samples ]:
## Taxa are rows
## S_219 S_220 S_221 S_222 S_223 S_224 S_225 S_226 S_227
## ASV0002 0 0 0 0 0 0 0 0 0
## ASV0003 0.0446 0 0 0.00705 0 0.0679 0.0673 0.0384 0.0551
## ASV0006 0 0 0 0 0 0 0 0 0
## ASV0008 0 0.00680 0 0.00235 0.00602 0.0226 0.0313 0.00767 0
## ASV0012 0 0 0 0 0 0 0 0 0
## ASV0013 0 0 0 0 0 0 0 0 0
## # … with 8 more samples (columns)
## S_219 S_220 S_221 S_222 S_223 S_224
## 100 100 100 100 100 100
3.6.0.1 using microbiome::transform
function:
?microbiome::transform()
# compositional' (ie relative abundance), 'Z', 'log10', 'log10p', 'hellinger', 'identity', 'clr'
The log10p transformation refers to log10(1 + x):
## OTU Table: [ 6 taxa and 116 samples ]:
## Taxa are rows
## S_219 S_220 S_221 S_222 S_223 S_224 S_225 S_226 S_227 S_228 S_229 S_230
## ASV0002 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0003 1.11 0 0 0.602 0 1.49 1.93 1.20 1.43 0.477 0 0.845
## ASV0006 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0008 0 0.602 0 0.301 0.602 1.04 1.60 0.602 0 0 0.602 0
## ASV0012 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0013 0 0 0 0 0 0 0 0 0 0 0 0
## # … with 5 more samples (columns)
Centered-Log-Ratio (CLR) transform - see here:
## OTU Table: [ 6 taxa and 116 samples ]:
## Taxa are rows
## S_219 S_220 S_221 S_222 S_223 S_224 S_225 S_226 S_227 S_228 S_229
## ASV0002 -1.09 -0.941 -0.964 -1.08 -1.39 -1.46 -1.41 -1.13 -1.22 -0.850 -1.14
## ASV0003 3.63 -0.941 -0.964 1.85 -1.39 3.67 3.72 3.44 3.71 1.77 -1.14
## ASV0006 -1.09 -0.941 -0.964 -1.08 -1.39 -1.46 -1.41 -1.13 -1.22 -0.850 -1.14
## ASV0008 -1.09 1.95 -0.964 0.850 1.38 2.59 2.96 1.87 -1.22 -0.850 1.64
## ASV0012 -1.09 -0.941 -0.964 -1.08 -1.39 -1.46 -1.41 -1.13 -1.22 -0.850 -1.14
## ASV0013 -1.09 -0.941 -0.964 -1.08 -1.39 -1.46 -1.41 -1.13 -1.22 -0.850 -1.14
## # … with 6 more samples (columns)
compositional transformation - proportion scaled to 1:
## OTU Table: [ 6 taxa and 116 samples ]:
## Taxa are rows
## S_219 S_220 S_221 S_222 S_223 S_224 S_225 S_226 S_227
## ASV0002 0 0 0 0 0 0 0 0 0
## ASV0003 4.46e-4 0 0 7.05e-5 0 6.79e-4 6.73e-4 3.84e-4 5.51e-4
## ASV0006 0 0 0 0 0 0 0 0 0
## ASV0008 0 6.80e-5 0 2.35e-5 6.02e-5 2.26e-4 3.13e-4 7.67e-5 0
## ASV0012 0 0 0 0 0 0 0 0 0
## ASV0013 0 0 0 0 0 0 0 0 0
## # … with 8 more samples (columns)
sqrt transform:
## OTU Table: [ 6 taxa and 116 samples ]:
## Taxa are rows
## S_219 S_220 S_221 S_222 S_223 S_224 S_225 S_226 S_227 S_228 S_229 S_230
## ASV0002 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0003 3.46 0 0 1.73 0 5.48 9.17 3.87 5.10 1.41 0 2.45
## ASV0006 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0008 0 1.73 0 1 1.73 3.16 6.24 1.73 0 0 1.73 0
## ASV0012 0 0 0 0 0 0 0 0 0 0 0 0
## ASV0013 0 0 0 0 0 0 0 0 0 0 0 0
## # … with 5 more samples (columns)
3.7 Rarefaction:
The rarefy_evendepth()
function down all samples to the same sequencing depth (a.k.a., Library Size) and prunes ASV that disappear from all samples as well as samples with a lower sequencing depth as a result.
physeq_1 %>%
rarefy_even_depth(rngseed = 123) %>% # important to specify rngseed: random number generator for reproducibility
sample_sums() %>%
head()
## `set.seed(123)` was used to initialize repeatable random subsampling.
## Please record this for your records so others can reproduce.
## Try `set.seed(123); .Random.seed` for the full vector
## ...
## 289OTUs were removed because they are no longer
## present in any sample after random subsampling
## ...
## S_219 S_220 S_221 S_222 S_223 S_224
## 15459 15459 15459 15459 15459 15459
## [1] 15459
In this example, we rarefied all the dataset to 15459 sequences per sample because it is the minumum in the dataset. You might not want to do that if some of the samples are Negative Controls (with likely low PCR amplification and therefore low reads after sequencing) or samples which failed to amplify. In those case you can specify the the sample.size
argument.
physeq_1 %>%
rarefy_even_depth(rngseed = 123,
sample.size = 20000) -> physeq_1_rare # important to specify rngseed: random number generator for reproducibility
## `set.seed(123)` was used to initialize repeatable random subsampling.
## Please record this for your records so others can reproduce.
## Try `set.seed(123); .Random.seed` for the full vector
## ...
## 2 samples removedbecause they contained fewer reads than `sample.size`.
## Up to first five removed samples are:
## S_345S_469
## ...
## 207OTUs were removed because they are no longer
## present in any sample after random subsampling
## ...
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1665 taxa and 114 samples ]:
## sample_data() Sample Data: [ 114 samples by 26 sample variables ]:
## tax_table() Taxonomy Table: [ 1665 taxa by 7 taxonomic ranks ]:
## phy_tree() Phylogenetic Tree: [ 1665 tips and 1663 internal nodes ]:
## refseq() DNAStringSet : [ 1665 reference sequences ]
## taxa are rows
A warning message specify that 2 samples were removed because of lower Library size. Altough in that case 2070 ASV were removed because they are no longer present. As you can imagine they are low abundant and low prevalent. We will come back to this a bit later.
Should you rarefy your data ? This question is still under ‘debate’: see here and there. It depend on the type of analysis you are going to do, the shape of your rarefaction curves, … but to make it short for alpha-diversity and beta diversity rarefaction is advised for standard analysis, not much for differential abundance testing. Again if you have a decent sequencing depth it should not matter. It also depends how the rarefaction curves look like, which nicely brings us to the next topic. Most of the case it does not impact much the alpha and beta diversity patterns if the sequencing depth is high enough to allow an exhaustive characterization of the diversity of the samples.
Since sequencing depth is important, phyloseq_check_lib_size()
function allows to summarize the sequencing depth per sample. The function returns two objects a data.frame df
and a plot plot
stored in the output.
physeq_1_st %>%
phyloseq_check_lib_size(data_color = "treatment_invivo",
data_facet = NULL,
nreads_display = 30000,
first_n = nsamples(physeq_1_st)) -> lib
lib$df %>%
DT::datatable()
The DT::datatable()
function allows to generate interactive table using marqdown
.
We can easily customize the plot using ggplot2
grammar:
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
We can also use tidyverse
to select a few columns from the dataframe output:
3.8 Rarefaction curves:
Rarefaction curves depict the relationship between sequencing depth (i.e., Library Size) and the observed richness (i.e., number of different ASV) and ultimately allow to estimate if the sequencing effort seems sufficient to characterize the richness of the samples. We can use the function phyloseq_rarefaction_curves()
:
physeq_1_st %>%
phyloseq_rarefaction_curves(stepsize = 500,
color_data = NULL,
facet_data = NULL) -> p
p
You are now familiar with ggplot objects and can easily customize the plot, faceting with the level of treatment_invivo()
and adding a vertical line with the minimum sequencing depth.
physeq_1_st %>% sample_sums() %>% min() -> min_lib
p + geom_vline(xintercept = min_lib,
color = "red",
linetype = "dashed", size=0.25) +
facet_wrap(~ treatment_invivo) + ylab("ASV Richness") -> plot
plot + theme(legend.position = "none")
Those curves look nice.
We are going to rarefy the samples to 15459
the minimum library size from our dataset.
physeq_1_st %>%
rarefy_even_depth(rngseed = 123,
sample.size = min_lib) -> physeq_1_rare # important to specify rngseed: random number generator for reproducibility
## `set.seed(123)` was used to initialize repeatable random subsampling.
## Please record this for your records so others can reproduce.
## Try `set.seed(123); .Random.seed` for the full vector
## ...
## 289OTUs were removed because they are no longer
## present in any sample after random subsampling
## ...
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1583 taxa and 116 samples ]:
## sample_data() Sample Data: [ 116 samples by 26 sample variables ]:
## tax_table() Taxonomy Table: [ 1583 taxa by 8 taxonomic ranks ]:
## phy_tree() Phylogenetic Tree: [ 1583 tips and 1581 internal nodes ]:
## refseq() DNAStringSet : [ 1583 reference sequences ]
## taxa are rows
Let’s now extract the ASV which were removed to see whether.
Extract ASV id from the initial / unrarefied dataset:
physeq_1_st %>%
filter_taxa(function(x) sum(x > 0) > 0, TRUE) %>% # to remove any ASV kept but with only 0 from the dataset
taxa_names() -> asv_raw
Extract ASV id from the rarefied dataset:
physeq_1_rare %>%
filter_taxa(function(x) sum(x > 0) > 0, TRUE) %>% # to remove any ASV kept but with only 0 from the dataset
taxa_names() -> asv_rare
Get the ASV id which were removed:
## [1] 289
260 were removed.
Generate a heatmap with the proportion of those 260 ASV in the initial dataset.
source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_heatmap.R")
physeq_1_st %>%
transform_sample_counts(function(x) x/sum(x) * 1000) %>% # transform as percentage before filtering
prune_taxa(rm_rare_asv, .) %>% # keep only the ASV with id matching the rm_rare_asv vector from the phyloseq object
phyloseq_ampvis_heatmap(physeq = .,
transform = FALSE, # extract only the taxa to display - after percentage normalisation
facet_by = "treatment_invivo",
group_by = "SampleID",
ntax = Inf) + ggtitle("ASV removed during rarefaction ‰") -> heat_rm_rare_asv
heat_rm_rare_asv
So we see that those ASV were quite low abundant and very sparse in the full dataset.
physeq_1_st %>%
transform_sample_counts(function(x) x/sum(x) * 1000) %>% # transform as percentage before filtering
prune_taxa(rm_rare_asv, .) %>% # keep only the ASV with id matching the rm_rare_asv vector from the phyloseq object
plot_bar(fill = "OTU") +
facet_grid(~ treatment_invivo, scales = "free", space = "free") +
ggtitle("Bar plot colored by ASV ") +
ylab("Proportion - ‰") +
theme_light() +
theme(legend.position = "none") +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
You should not worry too much regarding the ASV removed during the rarefaction step. What if we had rarefy the dataset to 1000 sequences per samples?
physeq_1_rare %>%
phyloseq_rarefaction_curves(stepsize = 500,
color_data = "treatment_invivo",
facet_data = NULL)
Looks good!
physeq_1_st %>%
transform_sample_counts(function(x) x/sum(x) * 1000) %>% # transform as percentage before filtering
prune_taxa(rm_rare_asv, .) %>% # keep only the ASV with id matching the rm_rare_asv vector from the
microbiomeutilities::plot_abund_prev(.,
label.core = TRUE,
color = "Class", # NA or "blue"
mean.abund.thres = 0.01,
mean.prev.thres = 0.95,
dot.opacity = 0.7,
label.size = 3,
label.opacity = 1.0,
nudge.label=-0.15,
bs.iter=999, # increase for actual analysis e.g. 999
size = length(rm_rare_asv), # increase to match your nsamples(asv_ps)
replace = TRUE,
label.color="#5f0f40") +
geom_vline(xintercept = 0.95, lty="dashed", alpha=0.7) +
geom_hline(yintercept = 0.01,lty="dashed", alpha=0.7) -> pv
## Warning: replacing previous import 'ggplot2::alpha' by 'microbiome::alpha' when
## loading 'microbiomeutilities'
## Make sure to set.seed!!!
## Joining, by = "Taxa"
## Joining, by = "Taxa"
## Joining, by = "Taxa"
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous x-axis
This plot displays the mean prevalence and relative abundance (0-100%) of the ASV we removed. We clearly see those taxa are very low prevalent and abundant in propotion.
3.9 Normalisation with total bacterial density:
See here. This function multiplies the proportion of ASV with absolute density values (i.e., as qPCR16S of flow cytometry) allowing to esimate the density of each ASV.
## OTU Table: [ 1872 taxa and 116 samples ]:
## Taxa are rows
## S_219 S_220 S_221 S_222 S_223 S_224 S_225 S_226 S_227 S_228
## ASV0002 0 0 0 0 0 0 0 0 0 0
## ASV0003 2.79e6 0 0 323996 0 6.47e6 534111 1.63e6 1.87e6 3.55e5
## ASV0006 0 0 0 0 0 0 0 0 0 0
## ASV0008 0 236983 0 107999 320497 2.16e6 247980 3.25e5 0 0
## ASV0012 0 0 0 0 0 0 0 0 0 0
## ASV0013 0 0 0 0 0 0 0 0 0 0
## ASV0015 0 0 0 0 0 0 0 0 0 0
## ASV0018 0 868938 1.50e6 0 747827 8.63e5 133528 4.33e5 0 2.31e6
## ASV0020 0 0 0 0 0 0 0 0 0 0
## ASV0022 0 0 0 0 0 0 0 0 0 0
## # … with 1,862 more taxa (rows), and 106 more samples (columns)
4 Community analysis:
There are three main dimension of community analysis using metbarcoding data: alpha diversity
, beta diversity
and ASV/ taxa centric appraoch
4.1 Alpha-diversity:
Alpha diversity is often defined as univariate within sample community characteristic, for more information please refer to slides 63, 66 to 79 Mahendra Mariadassou’s Lecture Chao1 and ACE are richness estimator trying to extrapolate the total richness observed + not detected due to unsufficient sequencing depth.
4.1.1 phyloseq::plot_richness()
Using the rarefied physeq_1_rare
object we are going to visualize alpha-diversity metrics:
physeq_1_rare %>%
plot_richness(measures = c("Observed", "Shannon", "ACE"),
shape = "treatment_invivo") +
facet_grid(variable ~ treatment_invivo, scale = "free", space = "fixed", drop = TRUE) -> alpha_plot
alpha_plot %>%
print()
The function plot_richness()
actually use estimate_richness()
and then ggplot()
to plot alpha_diversity values. alpha_plot is the ggplot and data are stored in alpha_plot$data
:
Let’s modify the plot removing ACE values, and a one sample (for demonstration purpose):
alpha_plot$data %>%
filter(variable != "ACE", samples != "S_234") %>% # we can filter anything, metadata, alphadiversity metric
ggplot(aes(treatment_invivo,
value,
colour = treatment_invivo,
fill = treatment_invivo,
shape = treatment_invivo)) +
facet_grid(variable ~ treatment_invivo ,scale="free",space="fixed") +
geom_boxplot(outlier.colour = NA, alpha = 0.2) +
geom_point(size=2,position=position_jitterdodge(dodge.width = 0.2)) +
ylab("Diversity indices") + xlab(NULL) + theme_bw()
4.1.2 Perform statistical evaluation:
The data are stored in long format:
You can transform in wide format and export your data to a tsv file and run statistical tests outside of R
:
alpha_plot$data %>%
filter(variable != "ACE", samples != "S_234") %>% # we can filter anything, metadata, alphadiversity metric
pivot_wider(names_from = variable, values_from = value) -> alpha_wide
alpha_wide %>%
head() %>%
DT::datatable()
Or you can do it in R
using an anova aov
:
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment_invivo 3 332333 110778 34.55 7.54e-16 ***
## Residuals 111 355891 3206
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggpubr::compare_means()
function allows to nicely compare among multiple groups for all the variables.
alpha_plot$data %>%
filter(variable != "ACE", samples != "S_234") %>%
ggpubr::compare_means(value ~ treatment_invivo,
group.by = "variable",
data = ., # . indicates that the object is coming from the pipe above.
method = "wilcox.test")
## # A tibble: 12 × 9
## variable .y. group1 group2 p p.adj p.format p.signif method
## <fct> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Observed value none TreatA 0.0000000235 2.6e-7 0.000000… **** Wilco…
## 2 Observed value none TreatB 0.0000000463 4.6e-7 0.000000… **** Wilco…
## 3 Observed value none TreatC 0.00000000570 6.8e-8 0.000000… **** Wilco…
## 4 Observed value TreatA TreatB 0.286 1 e+0 0.286 ns Wilco…
## 5 Observed value TreatA TreatC 0.778 1 e+0 0.778 ns Wilco…
## 6 Observed value TreatB TreatC 0.280 1 e+0 0.280 ns Wilco…
## 7 Shannon value none TreatA 0.0538 4.8e-1 0.054 ns Wilco…
## 8 Shannon value none TreatB 0.334 1 e+0 0.334 ns Wilco…
## 9 Shannon value none TreatC 0.167 1 e+0 0.167 ns Wilco…
## 10 Shannon value TreatA TreatB 0.307 1 e+0 0.307 ns Wilco…
## 11 Shannon value TreatA TreatC 0.549 1 e+0 0.549 ns Wilco…
## 12 Shannon value TreatB TreatC 0.884 1 e+0 0.884 ns Wilco…
4.1.3 Compute alpha-diversity indices.
phyloseq_alphas()
function compute all kind of alpha diversity metrics. If phylo = TRUE,
it will also compute phylogenetic metrics -if you have a phylogenetic tree included in your phyloseq object - but could take quite some time.
The idea is not to loose too much time exploring many indices, but on the other hand it can help to better understand the pattern and guide the analysis (phylogeny, richness, evenness).
source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_alpha.R")
physeq_1_rare %>%
phyloseq_alphas(phylo = FALSE) -> alphas
Let’s check the colanmes and the computed indices:
## [1] "id...1" "Sample"
## [3] "input" "filtered"
## [5] "filtered_pc" "denoisedF"
## [7] "denoisedR" "denoisedF_pc"
## [9] "denoisedR_pc" "merged"
## [11] "merged_pc" "tabled"
## [13] "chimera_out" "length_filtered"
## [15] "tabled_pc" "chimera_out_pc"
## [17] "length_filtered_pc" "day"
## [19] "treatment" "mouse_label"
## [21] "BW" "BW_percent"
## [23] "BW_delta" "DAI"
## [25] "treatment_grouped" "treatment_invivo"
## [27] "day_num" "id...28"
## [29] "observed" "chao1"
## [31] "diversity_inverse_simpson" "diversity_gini_simpson"
## [33] "diversity_shannon" "diversity_fisher"
## [35] "diversity_coverage" "evenness_camargo"
## [37] "evenness_pielou" "evenness_simpson"
## [39] "evenness_evar" "evenness_bulla"
## [41] "dominance_dbp" "dominance_dmn"
## [43] "dominance_absolute" "dominance_relative"
## [45] "dominance_simpson" "dominance_core_abundance"
## [47] "dominance_gini" "rarity_log_modulo_skewness"
## [49] "rarity_low_abundance" "rarity_rare_abundance"
## [51] "id...51" "Observed"
## [53] "Chao1" "se.chao1"
## [55] "ACE" "se.ACE"
4.1.4 Plot the data and run basic stats:
plot_alphas()
function takes as an input the output from phyloseq_alphas()
. As always, the alphas dataframe can be manipulated/filtered and plot as you wish with ggplot2.
alphas %>%
plot_alphas(measure = c("Observed", "diversity_shannon", "evenness_pielou", "dominance_gini"),
x_group = "treatment_invivo",
colour_group = "treatment_invivo",
fill_group = "treatment_invivo",
shape_group = NULL,
facet_group = ".",
test_group = "treatment_invivo",
test_group_2 = NULL) -> out
The function generates two outputs. The first is a plot stored in the $plot
object.
The second is a data frame of pairwise non parametric tests for all indices. We can check the significant comparisons. We will use adjusted p values here.
We can also try to correlate alpha-diversity metrics with metadata - here Body Weight - using correlate_alpha()
function.
alphas %>%
correlate_alpha(colnums_to_plot = c("Observed", "diversity_shannon",
"BW"),
method = "spearman") -> alpha_corr
alpha_corr
N.B.: correlations between alpha-diversity metrics are also included here.
4.2 Beta-diversity:
Beta-diversity is often defined as multidimensional between sample community characteristic, for more information please refer to slides 63, 80 to 92 of Mahendra Mariadassou’s Lecture. This type of data is multidimensional and we need specific multidimensional approaches to visualize analyze such data. Beta-diversity depict similarity between samples based on a synthetic matrix which contains pairwise similarity/dissimilarity values for all the samples combinations. There are phylogenetic / non phylogenetic beta-diversity metrics and abundance or presence based metrics. The most used are Bray-Curtis and Binary-Jaccard: non-phylogenetic abundance and presence based metrics, respectively as well as weighted and unweighted unifrac metrics: phylogenetic abundance and presence based metrics, respectively. The use of phylogenetic metrics was introduced at a time when clustering was apply to analyze metabarcoding data. This led to noisy OTU and the use of phylogenetic metric allowed to lower the influence of very similar OTU sequences see here. I now prefer to use binary and weigthed jaccard metrics when exploring a dataset. Aitchinson distance is now popular again but is usually very corelated with weighted jaccard.
First, we generate a distance matrix from the phyloseq_object:
4.2.1 Visualisation:
4.2.2 PCoA/NMDS:
PCoA and NMDS are unconstrained ordination techniques in order to visualize data with high dimensions. They are similar as a PCA but the input distance matrix can be specified (PCA is based only on euclidean distance.). More information on PCoA or NMDS
Samples coordinate on the PCoA axis are stored in but plot_ordination can make use of ord object we have just created easily.
## Axis.1 Axis.2
## S_219 -0.18477138 -0.10158338
## S_220 -0.23657906 0.04852235
## S_221 -0.24458723 0.05939406
## S_222 -0.23035443 -0.02873114
## S_223 -0.12984620 -0.24477378
## S_224 -0.07919993 -0.34760950
## S_225 -0.21062096 -0.03771307
## S_226 -0.22533786 0.01167839
## S_227 -0.23866155 -0.11826223
## S_228 -0.25078434 0.07550421
Plot ordination can be then used to plot the ordination object:
physeq_1_rare %>%
plot_ordination(physeq = .,
ordination = ord,
color = "BW",
shape = "treatment_invivo",
title = "PCoA Bray-Curtis",
label= "Sample") +
theme_bw()
4.2.2.1 Dendrogram:
source("https://raw.githubusercontent.com/mahendra-mariadassou/phyloseq-extended/415075fb7f690f6ac39eab18f62aeb4e4967ef84/R/graphical_methods.R")
require(ape)
## Loading required package: ape
Similar patterns as we saw on the PCoA. Makes sense because the same distance object is an input ;)
4.2.3 Distance boxplot:
Another way to visualize pairwise distance comparaison is to plot them as boxplot. The phyloseq_distance_boxplot()
performs all pairwise distance comparisons between the group in the facet header, with all of the other groups (including itself). Each boxplot is shown twice, once in each group. For example, “Mock” in the “Soil” facet has the same data has “Soil” in the “Mock” facet.
Red is the within group comparisons, which is why it is typically lower distance compared to the black boxplots. There are some within group comparisons that are actually more distant than comparing across groups.
source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_beta.R")
physeq_1_rare %>%
phyloseq_distance_boxplot(p = ., # phyloseq object
dist = bc, # distance matrix
d = "treatment_invivo") -> out # column in metadata for comparaisons
The function generates the following plot:
We can see samples from TreatA, B and C are mainly distant from the “none” group.
As well as the distance matrix that can be used to customize distance boxplot.
## Var1 Var2 value Type1 Type2
## 1 S_220 S_219 0.4986739 none none
## 2 S_221 S_219 0.4546866 none none
## 3 S_222 S_219 0.3976324 none none
## 4 S_223 S_219 0.4727343 none none
## 5 S_224 S_219 0.5096707 none none
## 6 S_225 S_219 0.4509994 none none
4.2.4 Statistical tests
4.2.4.1 PERMANOVA:
Let’s check if the observed patterns we visualized in PCoA, dendrogram are significant using PERMANOVA i.e., adonis function from vegan:
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## vegan::adonis2(formula = bc ~ get_variable(physeq_1_rare, "treatment_invivo"))
## Df SumOfSqs R2 F
## get_variable(physeq_1_rare, "treatment_invivo") 3 4.223 0.21616 10.295
## Residual 112 15.314 0.78384
## Total 115 19.537 1.00000
## Pr(>F)
## get_variable(physeq_1_rare, "treatment_invivo") 0.001 ***
## Residual
## Total
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on Bray-Curtis distance treatment_invivo is significant which means there is at least one group different from the others in terms ob beta-diversity/ community similarity.
I have computed some functions to help with the process:
source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_beta.R")
physeq_1_rare %>%
phyloseq_adonis2(dm = bc,
formula = "treatment_invivo")
## terms Df SumOfSqs R2 F Pr..F.
## 1 treatment_invivo 3 4.223042 0.2161567 10.29524 0.001
## 2 Residual 112 15.313904 0.7838433 NA NA
## 3 Total 115 19.536946 1.0000000 NA NA
The ‘Pr..F.’ is the pvalue - significant here - and the R2 indicates the strength or the % of explained variation. Here treatment_invivo explains 22% of the variations and is significant.
When having more than two levels, it makes sense to run pairwise comparisons:
physeq_1_rare %>%
physeq_pairwise_permanovas_adonis2(dm = bc,
compare_header = "treatment_invivo") %>%
arrange(pvalFDR)
## X1 X2 R2 pval pvalBon pvalFDR
## 1 none TreatA 0.17801503 0.001 0.006 0.003
## 2 none TreatB 0.19216850 0.001 0.006 0.003
## 3 none TreatC 0.17722244 0.001 0.006 0.003
## 4 TreatA TreatC 0.03931906 0.119 0.714 0.143
## 5 TreatB TreatC 0.03443042 0.113 0.678 0.170
## 6 TreatA TreatB 0.02233659 0.446 2.676 0.446
When running many statistical tests we increase the risk of false positive results and we need to correct the pvalues. Here we have the raw pvalues pval and the corrected ones using bonnferroni or FDR approaches. The results clearly showed mice from the group TreatA, B and C have different microbiota as compared to the non group (i.e., the Control). BTW it was already the case when we run the statistical test on the alpha-diversity metrics comparing different groups.
We can also test the effect of the combinaition of different variables in explaining the variation of beta-diversity.
## terms Df SumOfSqs R2 F Pr..F.
## 1 BW 1 1.287502 0.06590091 9.601193 0.001
## 2 treatment_invivo 3 3.364545 0.17221449 8.363387 0.001
## 3 Residual 111 14.884898 0.76188460 NA NA
## 4 Total 115 19.536946 1.00000000 NA NA
## terms Df SumOfSqs R2 F Pr..F.
## 1 treatment_invivo 3 4.2230424 0.21615673 10.497389 0.001
## 2 BW 1 0.4290052 0.02195866 3.199187 0.003
## 3 Residual 111 14.8848984 0.76188460 NA NA
## 4 Total 115 19.5369460 1.00000000 NA NA
Note that by default the order matters and the first variable will be fitted against the community distance. The residuals will be fed to the other variable. By default the terms_margins
argument is set on “terms”.
We can also check the interaction between the two:
physeq_1_rare %>%
phyloseq_adonis2(dm = bc,
formula = " treatment_invivo * BW") # same as treatment_invivo + BW + treatment_invivo:BW
## terms Df SumOfSqs R2 F Pr..F.
## 1 treatment_invivo 3 4.2230424 0.21615673 10.858025 0.001
## 2 BW 1 0.4290052 0.02195866 3.309094 0.004
## 3 treatment_invivo:BW 3 0.8833163 0.04521261 2.271128 0.001
## 4 Residual 108 14.0015820 0.71667199 NA NA
## 5 Total 115 19.5369460 1.00000000 NA NA
treatment_invivo variations explains most of the variations but BW Body Weight also explains some variations. Interestingly, the interaction between the 2: treatment_invivo:BW is also significant meaning BW do not impact the communities in a similar way among the differnet treatments.
We can also use terms_margins = "margin"
which allows to test for the effect of the variables after controlling the effect of the other.
physeq_1_rare %>%
phyloseq_adonis2(dm = bc,
formula = " treatment_invivo + BW",
terms_margins = "margin")
## terms Df SumOfSqs R2 F Pr..F.
## 1 treatment_invivo 3 3.3645451 0.17221449 8.363387 0.001
## 2 BW 1 0.4290052 0.02195866 3.199187 0.005
## 3 Residual 111 14.8848984 0.76188460 NA NA
## 4 Total 115 19.5369460 1.00000000 NA NA
More details here.
4.2.4.2 Beta Dispersion:
In order to conclude we need to check PERMANOVA’s main assumption: variance homoscedasticity.
## [1] 0.089
Variation in communities among the different groups do not differ in terms of spread of beta diversity. More info here.
4.2.4.3 Multivariate Welch t-test:
As an alternative to PERMANOVA (i.e., `vegan::adonis``) - and espacially in the case of unbalanced and heteroscedastic data we can use distance multivariate Welch t-test.
physeq_1_rare %>%
phyloseq_TW(dm = bc,
physeq = .,
variable = "treatment_invivo") -> out
out %>%
DT::datatable()
Multivariate Welch t-test confirmed the patterns.
4.2.4.4 db-RDA:
Constrained ordination detects how much of community variations is explained / correlated with variables. See here
phyloseq_dbRDA()
takes a phyloseq object, a distance matrix and formula with the covariable(s) to test.
physeq_1 %>%
phyloseq_dbRDA(dm = bc,
ps = .,
group_plot = "treatment_invivo",
forumla = paste0(c("BW", "DAI"), collapse=" + ")) -> out
The function generates four outputs: 1- the classic vegan dbRDA:
## Call: vegan::capscale(formula = dist ~ BW + DAI, data = metadata, add =
## TRUE)
##
## Inertia Proportion Rank
## Total 31.45909 1.00000
## Constrained 2.20316 0.07003 2
## Unconstrained 29.25593 0.92997 113
## Inertia is Lingoes adjusted squared Unknown distance
##
## Eigenvalues for constrained axes:
## CAP1 CAP2
## 1.8456 0.3576
##
## Eigenvalues for unconstrained axes:
## MDS1 MDS2 MDS3 MDS4 MDS5 MDS6 MDS7 MDS8
## 3.891 2.553 1.582 1.142 0.934 0.770 0.747 0.657
## (Showing 8 of 113 unconstrained eigenvalues)
##
## Constant added to distances: 0.1036708
2- the ordination plot:
3- the anova results to test the significance of the model overall:
4- the anova results testing the influence of each of the covaraibles:
Here The analysis confirm that Body Weight is correlating with the differences in BW Body Weight. When coloring the samples by treatment group we see that treatment also impact body weight.
4.2.4.5 partial db-RDA:
This approach seeks to remove the effect of one or more explanatory variables on a set of response variables prior to a standard RDA. This may be useful when well-characterised variables with strong effects obscure the effects of more interesting explanatory variables. This can be achieved with vegan::capscale
and the use of Condition(variable)
in the formula. Here we can try to remove the effect of donor microbiota + fermentation and see constrained ordination plot as well as the statistics. See here
physeq_1 %>%
phyloseq_dbRDA(ps = .,
dm = bc,
forumla = " treatment_invivo + Condition(BW)",
group_plot = "treatment_invivo",
vec_ext = 0.4) -> dbRDA_test
dbRDA_test$anova_all %>%
mutate_if(is.numeric, round, 3) %>%
DT::datatable()
We can access the dbRDA object displaying constrained and unconstrained variations as well as conditioned:
## Call: vegan::capscale(formula = dist ~ treatment_invivo +
## Condition(BW), data = metadata, add = TRUE)
##
## Inertia Proportion Rank
## Total 31.45909 1.00000
## Conditional 1.39117 0.04422 1
## Constrained 3.67556 0.11684 3
## Unconstrained 26.39236 0.83894 111
## Inertia is Lingoes adjusted squared Unknown distance
##
## Eigenvalues for constrained axes:
## CAP1 CAP2 CAP3
## 3.0445 0.3757 0.2553
##
## Eigenvalues for unconstrained axes:
## MDS1 MDS2 MDS3 MDS4 MDS5 MDS6 MDS7 MDS8
## 3.1580 1.6347 1.2949 1.0873 0.8306 0.7255 0.6542 0.5580
## (Showing 8 of 111 unconstrained eigenvalues)
##
## Constant added to distances: 0.1036708
We can access the constrained ordination plot:
## Warning: Removed 114 rows containing missing values (geom_point).
## Df SumOfSqs F Pr..F.
## treatment_invivo 3 3.675558 5.152841 0.001
## Residual 111 26.392361 NA NA
Since treatment has an effect we can further investigate the pairwise comparisons accross levels of treatment:
4.2.4.6 Usefull functions:
phyloseq_compute_bdiv()
function compute all kind of beta diversity distances. If phylo = TRUE,
it will also compute phylogenetic metrics -if you have a phylogenetic tree included in your phyloseq object - but will take some time.
The idea is not to loose too much time exploring many distances, but on the other hand it can help to better understand the pattern and guide the analysis. The included distances are binary jaccard: bjaccard, weigthed jaccard: wjaccard, bray-curtis, un-weighted unifrac: uunifrac, wunifrac and d_0 and d_0.5 from the generalized unifrac.
## Loading required package: GUniFrac
## Registered S3 method overwritten by 'rmutil':
## method from
## print.response httr
## Length Class Mode
## bray 6670 dist numeric
## bjaccard 6670 dist numeric
## wjaccard 6670 dist numeric
## uunifrac 6670 dist numeric
## wunifrac 6670 dist numeric
## d_0 6670 dist numeric
## d_0.5 6670 dist numeric
We can remove some of those distance from the list:
Compute ordinations:
physeq_1_rare %>%
phyloseq_plot_bdiv(dlist = dlist, # list of distance computed from a phyloseq object
ps_rare = ., # phyloseq object
m = "PCoA", # PCoA or NMDS
seed = 123, # for reproducibility
axis1 = 1, # axis to plot
axis2 = 2) -> plot_list
## [1] "bjaccard"
## [1] "wjaccard"
## [1] "uunifrac"
## [1] "wunifrac"
Explore all the ordination plots:
plot_list %>%
phyloseq_plot_ordinations_facet(color_group = "BW",
shape_group = "treatment_invivo") -> pcoas
## Warning in if (class(plot_list[[1]]) == "list") {: the condition has length > 1
## and only the first element will be used
Check % of explained variance for all ordinations:
phyloseq_ordinations_expl_var()
generates:
Perform PERMANOVA on all distances using lapply:
A way to apply a function to a list is to use lapply. This is an example to run PERMANOVA (Adonis vegan function on a list of distances). We can test here if SampleType influences community distances:
lapply(
dlist,
FUN = phyloseq_adonis2,
physeq = physeq_1_rare,
formula = paste0(c("treatment_invivo"), collapse=" * "),
nrep = 999,
strata = FALSE
) %>%
bind_rows(.id = "Distance") %>%
mutate_if(is.numeric, round, 3) %>%
# filter(! terms %in% (c("Residuals", "Total"))) %>%
DT::datatable()
Since it is significant, we can run pairwise comparisons to further investigate which groups are different from the other(s).
lapply(
dlist,
FUN = physeq_pairwise_permanovas_adonis2,
physeq = physeq_1_rare,
compare_header = "treatment_invivo",
n_perm = 999,
strat = FALSE
) %>%
bind_rows(.id = "Distance") %>%
mutate_if(is.numeric, round, 3) %>%
# filter(! terms %in% (c("Residuals", "Total"))) %>%
DT::datatable()
We could do the same for betadisper:
## $bjaccard
## $bjaccard$pval
## [1] 0.041
##
## $bjaccard$plot
## $bjaccard$plot$stats
## [,1] [,2] [,3] [,4]
## [1,] 0.3222634 0.3539865 0.3362548 0.3467928
## [2,] 0.3561419 0.3616801 0.3595863 0.3602318
## [3,] 0.3769350 0.3769323 0.3851296 0.3890985
## [4,] 0.3995027 0.4007802 0.4204568 0.4111401
## [5,] 0.4523347 0.4359965 0.4790630 0.4524837
##
## $bjaccard$plot$n
## [1] 53 18 24 21
##
## $bjaccard$plot$conf
## [,1] [,2] [,3] [,4]
## [1,] 0.3675244 0.3623711 0.3654979 0.3715461
## [2,] 0.3863456 0.3914935 0.4047613 0.4066509
##
## $bjaccard$plot$out
## S_251 S_264 S_303 S_304 S_348
## 0.4963659 0.4716019 0.6169250 0.6568167 0.6003817
##
## $bjaccard$plot$group
## [1] 1 1 4 4 4
##
## $bjaccard$plot$names
## [1] "none" "TreatA" "TreatB" "TreatC"
##
##
##
## $wjaccard
## $wjaccard$pval
## [1] 0.127
##
## $wjaccard$plot
## $wjaccard$plot$stats
## [,1] [,2] [,3] [,4]
## [1,] 0.3097886 0.3429259 0.3568736 0.3570027
## [2,] 0.3596359 0.3826550 0.4045826 0.3942019
## [3,] 0.3918497 0.4411128 0.4269739 0.4564468
## [4,] 0.5011181 0.4822541 0.5055536 0.5733960
## [5,] 0.6993111 0.5996836 0.6005157 0.7636644
##
## $wjaccard$plot$n
## [1] 53 18 24 21
##
## $wjaccard$plot$conf
## [,1] [,2] [,3] [,4]
## [1,] 0.3611439 0.4040211 0.3944092 0.3946635
## [2,] 0.4225556 0.4782044 0.4595387 0.5182301
##
## $wjaccard$plot$out
## S_251
## 0.7449809
##
## $wjaccard$plot$group
## [1] 1
##
## $wjaccard$plot$names
## [1] "none" "TreatA" "TreatB" "TreatC"
##
##
##
## $uunifrac
## $uunifrac$pval
## [1] 0.002
##
## $uunifrac$plot
## $uunifrac$plot$stats
## [,1] [,2] [,3] [,4]
## [1,] 0.1471045 0.1728876 0.1670796 0.1865194
## [2,] 0.1811844 0.2040400 0.2059661 0.2023125
## [3,] 0.1960632 0.2208035 0.2263794 0.2136097
## [4,] 0.2130277 0.2376040 0.2388214 0.2299618
## [5,] 0.2519337 0.2877896 0.2848743 0.2614012
##
## $uunifrac$plot$n
## [1] 53 18 24 21
##
## $uunifrac$plot$conf
## [,1] [,2] [,3] [,4]
## [1,] 0.1891522 0.2083039 0.2157830 0.2040767
## [2,] 0.2029741 0.2333031 0.2369758 0.2231428
##
## $uunifrac$plot$out
## S_251 S_303 S_304 S_348
## 0.3352518 0.4396205 0.4944090 0.3922345
##
## $uunifrac$plot$group
## [1] 1 4 4 4
##
## $uunifrac$plot$names
## [1] "none" "TreatA" "TreatB" "TreatC"
##
##
##
## $wunifrac
## $wunifrac$pval
## [1] 0.067
##
## $wunifrac$plot
## $wunifrac$plot$stats
## [,1] [,2] [,3] [,4]
## [1,] 0.08547445 0.1127499 0.1042706 0.1303766
## [2,] 0.11738724 0.1462398 0.1548101 0.1473732
## [3,] 0.13688849 0.1791199 0.1733416 0.2142894
## [4,] 0.19618800 0.2361543 0.2618620 0.2866256
## [5,] 0.28471590 0.3586046 0.3696712 0.4442491
##
## $wunifrac$plot$n
## [1] 53 18 24 21
##
## $wunifrac$plot$conf
## [,1] [,2] [,3] [,4]
## [1,] 0.1197864 0.1456349 0.1388156 0.1662774
## [2,] 0.1539906 0.2126049 0.2078675 0.2623014
##
## $wunifrac$plot$out
## S_224 S_229 S_231 S_251 S_308 S_310 S_353 S_471
## 0.3296568 0.4615768 0.4549740 0.5386644 0.3302266 0.3468041 0.3158119 0.3382327
## S_474 S_478 S_303 S_304
## 0.4846597 0.3582213 0.6905075 0.6993610
##
## $wunifrac$plot$group
## [1] 1 1 1 1 1 1 1 1 1 1 4 4
##
## $wunifrac$plot$names
## [1] "none" "TreatA" "TreatB" "TreatC"
3d PCoA:
physeq_1_rare %>%
phyloseq_plot_PCoA_3d(ps_rare = .,
dlist = dlist,
shape = FALSE,
color = "treatment_invivo")
## [1] "bjaccard"
## [1] "wjaccard"
## [1] "uunifrac"
## [1] "wunifrac"
## $bjaccard
##
## $wjaccard
##
## $uunifrac
##
## $wunifrac
adding taxa correlated with the ordination pattern:
First, generate a PCoA:
physeq_1_rare %>%
plot_ordination(physeq = .,
ordination = ord,
color = "treatment_invivo",
title = "PCoA Bray-Curtis",
label= NULL) +
theme_bw() -> pca
pca
phyloseq_add_taxa_vector()
:
We can use then the function phyloseq_add_taxa_vector()
to add top correlated taxa, here at the Genus level. The function uses `vegan::envfit()``.
physeq_1 %>%
phyloseq_add_taxa_vector(phyloseq = .,
dist = bc,
join_cbind = "join",
tax_rank_plot = "Genus",
figure_ord = pca,
fact = 0.25, pval_cutoff = 0.05,
top_r = 10) -> out #top 10 correlated genera
out$plot
We can export the data (correlation and taxa) of all the significant correlations:
## Axis.1 Axis.2 id pval r Kingdom Phylum
## 1 -0.07227232 0.60108796 ASV0385 0.001 0.3665300 Bacteria Bacteroidetes
## 2 0.76901561 -0.09789388 ASV0411 0.001 0.6009682 Bacteria Bacteroidetes
## 3 0.27075224 0.23924967 ASV0555 0.001 0.1305472 Bacteria Bacteroidetes
## 4 -0.57298504 -0.26051444 ASV1197 0.001 0.3961796 Bacteria Actinobacteria
## 5 0.23587135 0.63187487 ASV1253 0.001 0.4549011 Bacteria Verrucomicrobia
## 6 -0.37024644 0.36835532 ASV1386 0.001 0.2727681 Bacteria Proteobacteria
## 7 0.16372285 0.52030672 ASV1635 0.001 0.2975243 Bacteria Bacteroidetes
## 8 0.58105525 -0.17341278 ASV2364 0.001 0.3676972 Bacteria Firmicutes
## 9 0.01139161 -0.85679816 ASV3987 0.001 0.7342329 Bacteria Firmicutes
## 10 0.43448797 0.10670162 ASV4836 0.001 0.2001650 Bacteria Firmicutes
## Class Order Family
## 1 Bacteroidia Bacteroidales Prevotellaceae
## 2 Bacteroidia Bacteroidales Marinifilaceae
## 3 Bacteroidia Bacteroidales Rikenellaceae
## 4 Coriobacteriia Coriobacteriales Eggerthellaceae
## 5 Verrucomicrobiae Verrucomicrobiales Akkermansiaceae
## 6 Gammaproteobacteria Betaproteobacteriales Burkholderiaceae
## 7 Bacteroidia Bacteroidales Tannerellaceae
## 8 Clostridia Clostridiales Ruminococcaceae
## 9 Clostridia Clostridiales Lachnospiraceae
## 10 Erysipelotrichia Erysipelotrichales Erysipelotrichaceae
## tax_rank_plot Species
## 1 Prevotellaceae_UCG-001 unknown
## 2 Odoribacter unknown
## 3 Alistipes unknown
## 4 Enterorhabdus unknown
## 5 Akkermansia Akkermansia_muciniphila
## 6 Parasutterella unknown
## 7 Parabacteroides Parabacteroides_goldsteinii
## 8 Ruminococcaceae_UCG-014 unknown
## 9 Lachnospiraceae_NK4A136_group unknown
## 10 Turicibacter unknown
phyloseq_add_metadata_vector()
allows to see correlation between ordination and metadata also using vegan::envfit()
.
physeq_1 %>%
phyloseq_add_metadata_vector(dist = bc,
phyloseq = .,
figure_ord = out$plot,
metadata_sel = c("BW_percent", "BW", "colon_weight"),# , "histo_colo_total"),
top_r = 10, fact = 0.6, na.rm = TRUE) -> out2
out2$plot
phyloseq_generate_pcoa_per_variables()
can be usefull to generate PCoA per variable levels. We can apply this on the list of distances.
We will first reduce the distances to wjaccard and bjaccard:
## [1] "bjaccard" "wjaccard"
physeq_1 %>%
phyloseq_generate_pcoa_per_variables(tmp = ,
group = "treatment_invivo",
dist = dlist) -> pcoa_group
## [1] "bjaccard"
## [1] "wjaccard"
## [1] "bjaccard"
## [1] "wjaccard"
## [1] "bjaccard"
## [1] "wjaccard"
## [1] "bjaccard"
## [1] "wjaccard"
pcoa_group %>%
phyloseq_plot_ordinations_facet(plot_list = .,
return_eig_df = TRUE,
color_group = "day")
## $plots
##
## $eig_df
## .id...1 V1...2 .id...3 V1...4
## 1 none.bjaccard Axis.1 [10.3%] none.bjaccard Axis.2 [7.4%]
## 2 none.wjaccard Axis.1 [19.4%] none.wjaccard Axis.2 [10.1%]
## 3 TreatA.bjaccard Axis.1 [17.1%] TreatA.bjaccard Axis.2 [10.4%]
## 4 TreatA.wjaccard Axis.1 [25.2%] TreatA.wjaccard Axis.2 [13.9%]
## 5 TreatB.bjaccard Axis.1 [17.6%] TreatB.bjaccard Axis.2 [8.6%]
## 6 TreatB.wjaccard Axis.1 [24.4%] TreatB.wjaccard Axis.2 [12.2%]
## 7 TreatC.bjaccard Axis.1 [23%] TreatC.bjaccard Axis.2 [11%]
## 8 TreatC.wjaccard Axis.1 [24.6%] TreatC.wjaccard Axis.2 [16.6%]
4.3 Taxa/ASV centric appraoches:
source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_taxa_tests.R")
phyloseq_run_DESeq2_pair_plots_formula
phyloseq_Maaslin2
phyloseq_boxplot_abundance
phyloseq_A_B_ratio
phyloseq_run_compare_means
phyloseq_run_Deseq
phyloseq_run_ALDEx2
phyloseq_correlate_taxa
4.3.1 Correlations with metadata:
4.3.2 Differential abundance tests:
4.3.3 Taxaonomic ratios:
4.4 Miscellaneous:
4.4.0.1 physeq_simplify_tax():
physeq_simplify_tax()
agglomerate the table at a specific level and rename taxa id accordingly while removing the rest of the table:
physeq_1_st %>%
physeq_simplify_tax(tax_sel = c("Family")) -> physeq_1_st_fam
physeq_1_st_fam %>%
tax_table() %>%
head()
## Taxonomy Table: [ 6 taxa by 8 taxonomic ranks ]:
## Kingdom Phylum Class Order Genus Species Strain X.Family.
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## Clostridiaceae_1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## Prevotellaceae <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## Marinifilaceae <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## Rikenellaceae <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## Muribaculaceae <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## Saccharimonadaceae <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4.4.0.2 Obtain unique features:
sample_variables()
allows to list the names of the columns from the sample_data:
## [1] "Sample" "input" "filtered"
## [4] "filtered_pc" "denoisedF" "denoisedR"
## [7] "denoisedF_pc" "denoisedR_pc" "merged"
## [10] "merged_pc" "tabled" "chimera_out"
## [13] "length_filtered" "tabled_pc" "chimera_out_pc"
## [16] "length_filtered_pc" "day" "treatment"
## [19] "mouse_label" "BW" "BW_percent"
## [22] "BW_delta" "DAI" "treatment_grouped"
## [25] "treatment_invivo" "day_num"
physeq_get_unique()
allows to list the different levels of the categorical metadata:
## [1] none TreatB TreatA TreatC
## Levels: none TreatA TreatB TreatC
The variable treatment_invivo()
has 4 levels: none, TreatA, TreatB and TreatC.
Same for another metadata variable:
## [1] "m_32" "m_48" "m_18" "m_7" "m_22" "m_13" "m_34" "m_37" "m_19" "m_29"
## [11] "m_44" "m_26" "m_11" "m_2" "m_49" "m_35" "m_46" "m_51" "m_21" "m_54"
## [21] "m_53" "m_27" "m_36" "m_39" "m_28" "m_56" "m_52" "m_24" "m_10" "m_25"
## [31] "m_8" "m_55"
We can also apply this function to identify taxa. Below we extract the different phyla.
## [1] "Firmicutes" "Bacteroidetes" "Verrucomicrobia" "Proteobacteria"
## [5] "Cyanobacteria" "Actinobacteria" "Tenericutes" "Patescibacteria"
4.4.0.3 Getting top taxa per treatment levels:
The function phyloseq_ampvis_heatmap()
allows to generate taxonomic heatmaps and the parameter ntax
controls the top taxa among the explored samples.
source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_heatmap.R")
physeq_1_st %>%
phyloseq_ampvis_heatmap(physeq = .,
facet_by = "treatment_invivo" ,
group_by = "treatment_invivo",
ntax = 5) -> heat_overall
heat_overall
We might be also interested in visualizing the 5 most abundant taxa per groups: per the levels of treatment_invivo i.e., “none”, “TreatA”, “TreatB”, “TreatC”. The function physeq_most_abundant()
returns a vector of the most abundant taxa per levels.
physeq_1_st %>%
physeq_most_abundant("treatment_invivo",
tax_level = "Strain",
ntax = 5) -> most_ab_treat
most_ab_treat
## [1] "Akkermansia Akkermansia_muciniphila ASV1253"
## [2] "unknown Alloprevotella (Genus) ASV0365"
## [3] "unknown Clostridium_sensu_stricto_1 (Genus) ASV0066"
## [4] "unknown Lachnospiraceae_NK4A136_group (Genus) ASV3104"
## [5] "unknown Lachnospiraceae_NK4A136_group (Genus) ASV3987"
## [6] "unknown Lactobacillus (Genus) ASV4998"
## [7] "unknown Muribaculaceae (Family) ASV0840"
## [8] "unknown Odoribacter (Genus) ASV0411"
## [9] "unknown Prevotellaceae_UCG-001 (Genus) ASV0385"
## [10] "unknown Turicibacter (Genus) ASV4836"
There are 9 ASV when considering the most abundant among the 4 groups. We can visualize those using a heatmap. We will transform the data in proportion before subseting the taxa of interest and generating the heatmap omitting the transformation.
physeq_1_st %>%
transform_sample_counts(function(x) x/sum(x) * 100) %>% # transform as percentage before filtering
subset_taxa(Strain %in% most_ab_treat) %>% # extract only the taxa to display - after percentage normalisation
phyloseq_ampvis_heatmap(physeq = .,
transform = FALSE, # extract only the taxa to display - after percentage normalisation
facet_by = "treatment_invivo" ,
group_by = "treatment_invivo",
ntax = Inf) -> heat_top_taxa_per_group
heat_top_taxa_per_group
4.4.0.4 Generate color palette from metadata:
Important aspect to keep homogeneity between plots in terms of colors - group assignments.
source("https://raw.githubusercontent.com/fconstancias/DivComAnalyses/master/R/phyloseq_varia.R")
physeq_1_st %>%
generate_color_palette(var = "treatment_invivo",
pal = "npg",
print = FALSE) -> treat_pal
## Loading required package: ggpubr
##
## Attaching package: 'ggpubr'
## The following object is masked from 'package:ape':
##
## rotate
## none TreatB TreatA TreatC
## "#E64B35FF" "#4DBBD5FF" "#00A087FF" "#3C5488FF"
We can use show_col()
from the scales package to visualise the palette. Using scales::show_col()
allows to run the function without loading the full package.
Compare rarefied vs un-rarefied alpha-diversity metrics:
This might be misleading when comparing Observed (i.e., number of observed ASV - ASV richness) since as we have seen rarefaction - with a decent sequencing depth i.e., when the the rarefaction curves starts saturating - removes some ASV which are very low abundant and prevalent. We will use that function to illustrate how we can keep colors assigned to treatment levels homogeneous and how to combine plots.
physeq_1_st %>%
phyloseq_rarefied_unrarefied_richness(measure = "Observed",
sample_size = physeq_1_st %>% sample_sums() %>% min(),
color_data = "treatment_invivo") -> p1
p1
We can also specify the colors manually using the
treat_pal
vector we created which associates levels of treatments and colors:
## none TreatB TreatA TreatC
## "#E64B35FF" "#4DBBD5FF" "#00A087FF" "#3C5488FF"
The colors will also be preserved if one of the level is filtered out:
physeq_1_st %>%
subset_samples(treatment_invivo != "none") %>%
phyloseq_rarefied_unrarefied_richness(measure = "Observed",
sample_size = physeq_1_st %>% sample_sums() %>% min(),
color_data = "treatment_invivo") + scale_color_manual(values = treat_pal) # since the output of the function is a ggplot we can use + to finetune the graph
## `set.seed(123)` was used to initialize repeatable random subsampling.
## Please record this for your records so others can reproduce.
## Try `set.seed(123); .Random.seed` for the full vector
## ...
## 759OTUs were removed because they are no longer
## present in any sample after random subsampling
## ...
## `geom_smooth()` using formula 'y ~ x'
Using a richness estimator such as Chao1 or ACE might be more meaningfull.
physeq_1_st %>%
phyloseq_rarefied_unrarefied_richness(measure = "Chao1",
sample_size = physeq_1_st %>% sample_sums() %>% min(),
color_data = "treatment_invivo") + scale_color_manual(values = treat_pal) -> p2
p2
Using Shannon we see that the value does not change much - because Shannon metric is also impacted by proportion of taxa:
physeq_1_st %>%
phyloseq_rarefied_unrarefied_richness(measure = "Shannon",
sample_size = physeq_1_st %>% sample_sums() %>% min(),
color_data = "treatment_invivo") + scale_color_manual(values = treat_pal) -> p3
p3
Even rarefying with 2000 sequences will preserve the Shannon pattern:
physeq_1_st %>%
phyloseq_rarefied_unrarefied_richness(measure = "Shannon",
sample_size = 2000,
color_data = "treatment_invivo") + scale_color_manual(values = treat_pal)
We can combine the graphs using ggarrange
from the ggpubr
package:
4.4.0.5 Compare phyloseq taxonomy of two phyloseq objects:
# source("https://raw.githubusercontent.com/fconstancias/metabaRpipe/master/Rscripts/functions.R")
# physeq_1_rare %>%
# phyloseq_DECIPHER_tax(physeq = .,
# threshold = 60,
# db="~/db/DADA2/SILVA_SSU_r132_March2018.RData" # check where you downloaded your database
# tryRC = TRUE,
# return = TRUE) -> physeq_1_rare_2
4.4.0.7 Exporting tables:
as tsv.
as an excel file.
# physeq_1 %>%
# sample_data() %>%
# xlsx::write.xlsx("my_sample_data.xlsx",
# sheetName = "sample_data",
# append = TRUE)
Good opportunity to introduce the phloseq_export_otu_tax()
function, it basically exports otu_table()
and tax_table()
of a phyloseq object.
source("https://raw.githubusercontent.com/fconstancias/metabaRpipe/master/Rscripts/functions.R")
physeq_1 %>%
phloseq_export_otu_tax() %>%
head() %>%
DT::datatable()
We can use write.xlsx
under another sheet from the same excel file:
4.4.1 To explore:
physeq_1_rare %>%
phyloseq_ntaxa_by_tax(TaxRank = "Genus",
add_meta_data = FALSE) %>%
arrange(-N.OTU) %>%
DT::datatable()
## Warning in phyloseq::psmelt(physeq): The sample variables:
## Sample
## have been renamed to:
## sample_Sample
## to avoid conflicts with special phyloseq plot attribute names.
physeq_1 %>%
phyloseq_coverage(.,
correct_singletons = FALSE,
add_attr = T) %>%
arrange(-SampleCoverage) %>%
DT::datatable()
# physeq_1 %>%
# phyloseq_effort_div_rangeplot(range = seq(from = 10, to = 100, by = 10),
# # range_type = "coverage",
# show_plot = TRUE)
## Parameter Phys1
## 1 Average number of reads per OTU 2536.0731838
## 2 Average number of reads per sample 40926.9741379
## 3 Average OTU occurrence, percents 21.1322023
## 4 Coefficient of quartile variation in OTU abundance 0.9865244
## 5 Coefficient of quartile variation in sample abundance 0.1653620
## 6 Data sparsity (number of zeros) 171263.0000000
## 7 Data sparsity (percentage of zeros) 78.8677977
## 8 Max total OTU abundance 889239.0000000
## 9 Max total sample abundance 124774.0000000
## 10 Median number of reads per OTU 39.0000000
## 11 Median number of reads per sample 37874.0000000
## 12 Median OTU occurrence, percents 5.1724138
## 13 Min total OTU abundance 1.0000000
## 14 Min total sample abundance 15459.0000000
## 15 Number of OTUs 1872.0000000
## 16 Number of samples 116.0000000
## 17 Number of singletons 71.0000000
## 18 Percentage of singletons 3.7927350
## 19 Q1 of total OTU abundance 3.0000000
## 20 Q1 of total sample abundance 33680.2500000
## 21 Q3 of total OTU abundance 442.2500000
## 22 Q3 of total sample abundance 47026.0000000
## 23 Total number of reads 4747529.0000000
## 231 Percentage of reads 100.0000000
## 151 Percentage of OTUs 100.0000000
## Phys2
## 1 1132.8136450
## 2 15459.0000000
## 3 21.5653386
## 4 0.9844358
## 5 0.0000000
## 6 144028.0000000
## 7 78.4346614
## 8 347030.0000000
## 9 15459.0000000
## 10 34.0000000
## 11 15459.0000000
## 12 8.6206897
## 13 1.0000000
## 14 15459.0000000
## 15 1583.0000000
## 16 116.0000000
## 17 258.0000000
## 18 16.2981680
## 19 2.0000000
## 20 15459.0000000
## 21 255.0000000
## 22 15459.0000000
## 23 1793244.0000000
## 231 37.7721547
## 151 84.5619658
# source("https://raw.githubusercontent.com/vmikk/metagMisc/master/R/phyloseq_taxonomic_resolution.R")
#
# physeq_1 %>%
# phyloseq_taxonomic_resolution()
physeq_1 %>%
phyloseq_get_strains() %>%
physeq_glom_rename(rename_ASV = "Strain") %>%
prevalence(count = TRUE, sort = TRUE) %>%
head()
## Joining, by = "ASV"
## unknown Lactobacillus (Genus) ASV5082
## 116
## unknown Lachnospiraceae (Family) ASV4233
## 116
## unknown Peptococcaceae (Family) ASV3327
## 116
## unknown Desulfovibrionaceae (Family) ASV1903
## 116
## Bacteroides Bacteroides_acidifaciens ASV1690
## 116
## unknown Bacteroides (Genus) ASV1684
## 116
## [1] 116
https://cran.r-project.org/web/packages/iNEXT/vignettes/Introduction.html
# physeq_1 %>%
# phyloseq_get_strains() %>%
# physeq_glom_rename(rename_ASV = "Strain") %>%
# prepare_inext(., correct_singletons = F) -> inext
#
# inext %>%
# iNEXT::iNEXT(., q=c(0, 1, 2),
# datatype = "incidence_freq",
# nboot = 2,
# se = FALSE) -> out
#
# out %>%
# iNEXT::ggiNEXT(., type=1, facet.var="site")
# physeq_1_st %>%
# # subset_samples(PCR_plate != "P3" | sample_number > 30) %>%
# prune_taxa(taxa = taxa_sums(physeq_1_st) > 1000) %>%
# subset_taxa(Phylum != "unknown" & Order %in% c("Clostridiales", "Bacteroidales")) %>%
# # filter_taxa(function(x) sum(x) > 0, TRUE) %>%
# rarefy_even_depth(sample.size = 1000, rngseed = 123) %>%
# tax_glom(taxrank = "Order") %>%
# transform_sample_counts(function(x) x/sum(x) * 100)
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os macOS High Sierra 10.13.6
## system x86_64, darwin17.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Europe/Paris
## date 2022-04-13
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib
## abind 1.4-5 2016-07-21 [1]
## ade4 1.7-17 2021-06-17 [1]
## ampvis2 * 2.7.17 2022-03-25 [1]
## ape * 5.5 2021-04-25 [1]
## assertthat 0.2.1 2019-03-21 [1]
## backports 1.4.1 2021-12-13 [1]
## Biobase 2.50.0 2021-02-17 [1]
## BiocGenerics 0.34.0 2020-04-27 [1]
## biomformat 1.16.0 2020-04-27 [1]
## Biostrings 2.56.0 2020-04-27 [1]
## bit 4.0.4 2020-08-04 [1]
## bit64 4.0.5 2020-08-30 [1]
## bookdown 0.25 2022-03-16 [1]
## broom 0.7.11 2022-01-03 [1]
## bslib 0.2.5.1 2021-05-18 [1]
## cachem 1.0.6 2021-08-19 [1]
## callr 3.7.0 2021-04-20 [1]
## car 3.0-12 2021-11-06 [1]
## carData 3.0-5 2022-01-06 [1]
## cellranger 1.1.0 2016-07-27 [1]
## cli 3.1.0 2021-10-27 [1]
## cluster 2.1.2 2021-04-17 [1]
## codetools 0.2-18 2020-11-04 [1]
## colorspace 2.0-2 2021-06-24 [1]
## cowplot 1.1.1 2020-12-30 [1]
## crayon 1.4.2 2021-10-29 [1]
## crosstalk 1.2.0 2021-11-04 [1]
## data.table 1.14.2 2021-09-27 [1]
## DBI 1.1.1 2021-01-15 [1]
## dbplyr 2.1.1 2021-04-06 [1]
## desc 1.3.0 2021-03-05 [1]
## devtools 2.4.2 2021-06-07 [1]
## digest 0.6.29 2021-12-01 [1]
## dplyr * 1.0.7 2021-06-18 [1]
## DT 0.20 2021-11-15 [1]
## ellipsis 0.3.2 2021-04-29 [1]
## evaluate 0.14 2019-05-28 [1]
## fansi 0.5.0 2021-05-25 [1]
## fantaxtic 0.1.0 2020-08-25 [1]
## farver 2.1.0 2021-02-28 [1]
## fastmap 1.1.0 2021-01-25 [1]
## forcats * 0.5.1 2021-01-27 [1]
## foreach 1.5.1 2020-10-15 [1]
## fs 1.5.2 2021-12-08 [1]
## generics 0.1.1 2021-10-25 [1]
## GGally 2.1.2 2021-06-21 [1]
## gghalves 0.1.1 2020-11-10 [1]
## ggord * 1.1.6 2021-08-16 [1]
## ggplot2 * 3.3.5 2021-06-25 [1]
## ggpubr 0.4.0 2020-06-27 [1]
## ggrepel 0.9.1 2021-01-15 [1]
## ggsci 2.9 2018-05-14 [1]
## ggsignif 0.6.3 2021-09-09 [1]
## ggvegan * 0.1-0 2021-09-02 [1]
## glue 1.6.2 2022-02-24 [1]
## gridExtra 2.3 2017-09-09 [1]
## gtable 0.3.0 2019-03-25 [1]
## GUniFrac * 1.3 2021-08-18 [1]
## haven 2.4.3 2021-08-04 [1]
## highr 0.9 2021-04-16 [1]
## hms 1.1.0 2021-05-17 [1]
## htmltools 0.5.2 2021-08-25 [1]
## htmlwidgets 1.5.4 2021-09-08 [1]
## httr 1.4.2 2020-07-20 [1]
## igraph 1.2.6 2020-10-06 [1]
## iNEXT 2.0.20 2020-01-28 [1]
## IRanges 2.22.2 2020-05-21 [1]
## iterators 1.0.13 2020-10-15 [1]
## jquerylib 0.1.4 2021-04-26 [1]
## jsonlite 1.8.0 2022-02-22 [1]
## knitr 1.38 2022-03-25 [1]
## labeling 0.4.2 2020-10-20 [1]
## lattice * 0.20-44 2021-05-02 [1]
## lazyeval 0.2.2 2019-03-15 [1]
## lifecycle 1.0.1 2021-09-24 [1]
## lubridate 1.8.0 2021-10-07 [1]
## magrittr 2.0.3 2022-03-30 [1]
## MASS 7.3-54 2021-05-03 [1]
## Matrix 1.3-4 2021-06-01 [1]
## matrixStats 0.61.0 2021-09-17 [1]
## memoise 2.0.0 2021-01-26 [1]
## metagMisc * 0.0.4 2022-04-12 [1]
## mgcv 1.8-36 2021-06-01 [1]
## microbiome * 1.10.0 2020-04-27 [1]
## microbiomeutilities 1.00.11 2020-11-10 [1]
## modelr 0.1.8 2020-05-19 [1]
## multtest 2.44.0 2020-04-27 [1]
## munsell 0.5.0 2018-06-12 [1]
## nlme 3.1-152 2021-02-04 [1]
## permute * 0.9-5 2019-03-12 [1]
## pheatmap 1.0.12 2019-01-04 [1]
## phyloseq * 1.34.0 2021-02-17 [1]
## pillar 1.6.4 2021-10-18 [1]
## pkgbuild 1.2.0 2020-12-15 [1]
## pkgconfig 2.0.3 2019-09-22 [1]
## pkgload 1.2.1 2021-04-06 [1]
## plotly 4.9.4.1 2021-06-18 [1]
## plyr 1.8.6 2020-03-03 [1]
## prettyunits 1.1.1 2020-01-24 [1]
## processx 3.5.2 2021-04-30 [1]
## ps 1.6.0 2021-02-28 [1]
## purrr * 0.3.4 2020-04-17 [1]
## R6 2.5.1 2021-08-19 [1]
## RColorBrewer 1.1-2 2014-12-07 [1]
## Rcpp 1.0.7 2021-07-07 [1]
## readr * 2.0.1 2021-08-10 [1]
## readxl 1.3.1 2019-03-13 [1]
## remotes 2.4.0 2021-06-02 [1]
## reprex 2.0.1 2021-08-05 [1]
## reshape 0.8.8 2018-10-23 [1]
## reshape2 * 1.4.4 2020-04-09 [1]
## rhdf5 2.32.4 2020-10-05 [1]
## Rhdf5lib 1.10.1 2020-07-09 [1]
## rlang 1.0.2 2022-03-04 [1]
## rmarkdown 2.13 2022-03-10 [1]
## rmdformats 1.0.3.9000 2022-04-11 [1]
## rmutil 1.1.5 2020-06-09 [1]
## rprojroot 2.0.2 2020-11-15 [1]
## rstatix 0.7.0 2021-02-13 [1]
## rstudioapi 0.13 2020-11-12 [1]
## Rtsne 0.15 2018-11-10 [1]
## rvest 1.0.1 2021-07-26 [1]
## S4Vectors 0.26.1 2020-05-16 [1]
## sass 0.4.1 2022-03-23 [1]
## scales * 1.1.1 2020-05-11 [1]
## sessioninfo 1.1.1 2018-11-05 [1]
## speedyseq * 0.5.0 2020-09-29 [1]
## statmod 1.4.36 2021-05-10 [1]
## stringi 1.7.6 2021-11-29 [1]
## stringr * 1.4.0 2019-02-10 [1]
## survival 3.2-13 2021-08-24 [1]
## testthat 3.0.4 2021-07-01 [1]
## tibble * 3.1.6 2021-11-07 [1]
## tidyr * 1.1.4 2021-09-27 [1]
## tidyselect 1.1.1 2021-04-30 [1]
## tidyverse * 1.3.1 2021-04-15 [1]
## tzdb 0.1.2 2021-07-20 [1]
## usethis 2.0.1 2021-02-10 [1]
## utf8 1.2.2 2021-07-24 [1]
## vctrs 0.3.8 2021-04-29 [1]
## vegan * 2.5-7 2020-11-28 [1]
## viridisLite 0.4.0 2021-04-13 [1]
## vroom 1.5.4 2021-08-05 [1]
## withr 2.4.3 2021-11-30 [1]
## xfun 0.30 2022-03-02 [1]
## xml2 1.3.2 2020-04-23 [1]
## XVector 0.28.0 2020-04-27 [1]
## yaml 2.3.5 2022-02-21 [1]
## zlibbioc 1.34.0 2020-04-27 [1]
## source
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## Github (MadsAlbertsen/ampvis2@f48b708)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## bioc_git2r (@9927f90)
## Bioconductor
## Bioconductor
## Bioconductor
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.5)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.1)
## CRAN (R 4.0.2)
## Github (gmteunisse/Fantaxtic@4653973)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## Github (erocoar/gghalves@0e0ea16)
## https://fawda123.r-universe.dev (R 4.0.5)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## Github (gavinsimpson/ggvegan@cabb641)
## CRAN (R 4.0.5)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## Bioconductor
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.5)
## CRAN (R 4.0.5)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.5)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## Github (vmikk/metagMisc@9cb3131)
## CRAN (R 4.0.2)
## Bioconductor
## Github (microsud/microbiomeutilities@fd0dd94)
## CRAN (R 4.0.2)
## Bioconductor
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## bioc_git2r (@cbed93e)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## Bioconductor
## Bioconductor
## CRAN (R 4.0.5)
## CRAN (R 4.0.5)
## Github (juba/rmdformats@b151102)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## Bioconductor
## CRAN (R 4.0.5)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## Github (mikemc/speedyseq@64932ce)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.2)
## CRAN (R 4.0.5)
## CRAN (R 4.0.2)
## Bioconductor
## CRAN (R 4.0.5)
## Bioconductor
##
## [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
5 Alternative analysis approaches using user-friendlier platformes:
If you do not have experience with R
and you do not want to invest time learning it or you want basic descriptions of your data, the easiest way to proceed is to use user friendly interfaces.
There are several options:
5.1 R-Shiny-interfaces:
These are R
based interactive interfaces enabling a smooth code free experience.
Select your data > RDS allows to directly load the phyloseq object.
5.2 qiime2:
QIIME™ (pronounced chime) stands for Quantitative Insights Into Microbial Ecology. It is a bioinformatic pipeline developed to ease the process of bioinformatic analysis of raw sequencing microbiome data - not explored here since we have processed the data using metabaRpipe. It is also useful to visualize and analyze the processed data, in a user friendly experience:
5.2.1 Generate qiime2
compatibles files from the phyloseq object:
Check the metabaRpipe tutorial here or there
Rscript metabaRpipe/Rscripts/phyloseq_export_qiime.Rscript \
-i dada2/physeq_phylo/phyloseq_phylo.RDS \
-o qiime2 \
-f metabaRpipe/Rscripts/functions.R
Alternatively, this can also be run from R/Rstudio:
require(tidyverse); source("https://raw.githubusercontent.com/fconstancias/metabaRpipe/master/Rscripts/functions.R")
readRDS("dada2/phyloseq.RDS") %>%
physeq_export_qiime(output_dir = "~/qiime2/")
This command will generate all the necessary processed files to analyze the metabarcoding data and as seen here.
# List/ check the files are here:
ls
asv.fna asv_neweek.qza asv_rep_set.qza qiime2_mapping_file.txt qiime2_otu.qza tax.txt
asv_biom.biom asv_neweek.tre qiime1_mapping_file.txt qiime2_metadata.qzv qiime2_taxonomy.qza
We have generated:
- A fasta file containing the sequence of the ASV:
asv.fna
. This corresponds to theps %>% refseq()
inR
. - A text file with the metadata:
qiime2_mapping_file.txt
. This corresponds to theps %>% sample_data()
inR
. - A text with the taxonomic path of the ASV:
tax.txt
. This corresponds to theps %>% sample_data()
inR
. - A biom file with the sample wise ASV counts:
asv_biom.biom
. This to theps %>% otu_table()
inR
. - A neweek ASV phylogenetic tree :
asv_neweek.tre
. This to theps %>% phy_tree()
inR
.
5.2.2 Import the data in qiime2
:
- Install
qiime2
in a dedicated environment https://docs.qiime2.org/2022.2/install/native/
wget https://data.qiime2.org/distro/core/qiime2-2022.2-py38-osx-conda.yml
conda env create -n qiime2-2022.2 --file qiime2-2022.2-py38-osx-conda.yml
# OPTIONAL CLEANUP
rm qiime2-2022.2-py38-osx-conda.yml
# Activate the conda environment
conda activate qiime2-2022.2
# Test your installation
qiime --help
- Activate the environment change directory:
# Activate the qiime conda environment
conda activate qiime2-2022.2
# Navigate were the output of phyloseq_export_qiime.Rscript were generated:
cd qiime2/
# List/ check the files are here:
ls
asv.fna asv_biom.biom asv_neweek.tre qiime1_mapping_file.txt qiime2_mapping_file.txt tax.txt
- Import the data in
qiime2
compatible format:
The following commands are required to import the files in qza
qiime2
specific format
# Import OTU/ASV table:
qiime tools import \
--input-path asv_biom.biom \
--type 'FeatureTable[Frequency]' \
--input-format BIOMV100Format \
--output-path qiime2_otu.qza
# Import taxonomic table:
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path tax.txt \
--output-path qiime2_taxonomy.qza
# Import phylogenetic tree:
qiime tools import \
--input-path asv_neweek.tre \
--output-path asv_neweek.qza \
--type 'Phylogeny[Rooted]'
# Import ASV sequences:
qiime tools import \
--input-path asv.fna \
--output-path asv_rep_set.qza \
--type 'FeatureData[Sequence]'
# Import metadata:
qiime metadata tabulate \
--m-input-file qiime2_mapping_file.txt \
--o-visualization qiime2_metadata.qzv
The import successfully generated .qza files we will then use to visualize/ analyze the data:
5.2.3 Diversity analyses:
5.2.4 Data summary:
The first step is to summarize the ASV table using qiime feature-table summarize
qiime feature-table summarize \
--i-table qiime2_otu.qza \
--o-visualization qiime2_otu.qzv \
--m-sample-metadata-file qiime2_mapping_file.txt
This will generate a qiime2_otu.qzv
which we can open using:
In addition to the Overview
, the Interactive Sample Detail
summaries the number of Features (i.e., here ASV) among samples. We can adjust the Sampling Depth
from the Plot Controls
panel to see how many samples would be discarded when considering a minimum sampling depth cutoff or a rarefaction threshold.
An example of such summary can be found here
5.2.5 Rarafaction analysis:
Another meaningful preliminary analysis is to perform rarefaction curves investigating the changes of alpha-diversity metrics (e.g., Richness or Observed_features) as a function of the sequencing depth.
qiime diversity alpha-rarefaction \
--i-phylogeny asv_neweek.qza \
--i-table qiime2_otu.qza \
--p-max-depth 5000 \
--o-visualization CORE-METRICS-RESULTS/alpha-rarefaction.qzv
An example of rarefaction curves can be found here. Ideally they should reach a plateau meaning the sequencing effort allows an exhaustive description of the diversity.
5.2.6 Generating core-metrics for diversity analyses.
An important parameter here is the --p-sampling-depth
which is the sequencing depth at which all the samples will be rarefied
in order to avoid biases due to uneven sequencing depth.
qiime diversity core-metrics-phylogenetic \
--i-phylogeny asv_neweek.qza \
--i-table qiime2_otu.qza \
--p-sampling-depth 3500 \
--m-metadata-file qiime2_mapping_file.txt \
--output-dir CORE-METRICS-RESULTS
This command generates a lot of qza
files corresponding to different alpha
(i.e., faith_pd, observed_features, shannon, evenness) and beta
diversity metrics (i.e., unweighted_unifrac, weighted_unifrac, jaccard_distance, bray_curtis_distance) and including PCoA ordniations for beta diversity metrics. (e.g., jaccard_pcoa_results.qza).
Saved FeatureTable[Frequency] to: CORE-METRICS-RESULTS/rarefied_table.qza
Saved SampleData[AlphaDiversity] to: CORE-METRICS-RESULTS/faith_pd_vector.qza
Saved SampleData[AlphaDiversity] to: CORE-METRICS-RESULTS/observed_features_vector.qza
Saved SampleData[AlphaDiversity] to: CORE-METRICS-RESULTS/shannon_vector.qza
Saved SampleData[AlphaDiversity] to: CORE-METRICS-RESULTS/evenness_vector.qza
Saved DistanceMatrix to: CORE-METRICS-RESULTS/unweighted_unifrac_distance_matrix.qza
Saved DistanceMatrix to: CORE-METRICS-RESULTS/weighted_unifrac_distance_matrix.qza
Saved DistanceMatrix to: CORE-METRICS-RESULTS/jaccard_distance_matrix.qza
Saved DistanceMatrix to: CORE-METRICS-RESULTS/bray_curtis_distance_matrix.qza
Saved PCoAResults to: CORE-METRICS-RESULTS/unweighted_unifrac_pcoa_results.qza
Saved PCoAResults to: CORE-METRICS-RESULTS/weighted_unifrac_pcoa_results.qza
Saved PCoAResults to: CORE-METRICS-RESULTS/jaccard_pcoa_results.qza
Saved PCoAResults to: CORE-METRICS-RESULTS/bray_curtis_pcoa_results.qza
Saved Visualization to: CORE-METRICS-RESULTS/unweighted_unifrac_emperor.qzv
Saved Visualization to: CORE-METRICS-RESULTS/weighted_unifrac_emperor.qzv
Saved Visualization to: CORE-METRICS-RESULTS/jaccard_emperor.qzv
Saved Visualization to: CORE-METRICS-RESULTS/bray_curtis_emperor.qzv
5.2.6.1 Alpha-diversity:
Alpha diversity is often defined as univariate within sample community characteristic, for more information please refer to slides 63, 66 to 79 Mahendra Mariadassou’s Lecture Now we have generated different metrics, we can visualize and explore associations with the metadata.
qiime diversity alpha-group-significance \
--i-alpha-diversity CORE-METRICS-RESULTS/faith_pd_vector.qza \
--m-metadata-file qiime2_mapping_file.txt \
--o-visualization CORE-METRICS-RESULTS/faith-pd-group-significance.qzv
qiime diversity alpha-group-significance \
--i-alpha-diversity CORE-METRICS-RESULTS/evenness_vector.qza \
--m-metadata-file qiime2_mapping_file.txt \
--o-visualization CORE-METRICS-RESULTS/evenness-group-significance.qzv
You will get errors related to the mapping file, we can quickly generate a new artificial mapping file linking samples to group A or B.
vim qiime2_mapping_file_2.txt
sampleid group
R1F1-S66 A
R1F2-S300 A
R1F3-S90 B
Y2A15-2M-06-S78 B
Y2A15-2M-12-S77 B
Y3-R1F4-S136 B
Let’s run the functions again:
qiime diversity alpha-group-significance \
--i-alpha-diversity CORE-METRICS-RESULTS/faith_pd_vector.qza \
--m-metadata-file qiime2_mapping_file_2.txt \
--o-visualization CORE-METRICS-RESULTS/faith-pd-group-significance.qzv
qiime diversity alpha-group-significance \
--i-alpha-diversity CORE-METRICS-RESULTS/evenness_vector.qza \
--m-metadata-file qiime2_mapping_file_2.txt \
--o-visualization CORE-METRICS-RESULTS/evenness-group-significance.qzv
And visualize the output, for instance for the evenness metric:
Here and example of the output.
If you want to explore the results for all the metrics you will have to open all the files and just considering one metadata covariate.
5.2.6.2 Beta-diversity:
Beta-diversity is often defined as multidimensional between sample community characteristic, for more information please refer to slides 63, 80 to 92 of Mahendra Mariadassou’s Lecture. This type of data is multidimensional and we need specific multidimensional approaches to visualize analyze such data. Now we have generated different metrics, we can visualize and explore associations with the metadata.
5.2.6.2.1 PCoA ordination:
Principal Coordinate analysis is and unconstrained ordination aiming at exploring the main variation in the data. It is similar as a PCA but can deal with any type of metric.
Below are the commands to generate a PCoA using the unweighted-unifrac
metric we generated using the qiime diversity core-metrics-phylogenetic
with the data rarefied at 3500
sequences per sample.
qiime emperor plot \
--i-pcoa CORE-METRICS-RESULTS/unweighted_unifrac_pcoa_results.qza \
--m-metadata-file qiime2_mapping_file_2.txt \
--o-visualization CORE-METRICS-RESULTS/unweighted-unifrac-emperor.qzv
We can then visualize the 3 first axes of the PCoA using the following command:
An example of ordination plot can be found here
5.2.7 Taxonomic visualisation:
The diversity analyses we have explored are not incorporating any taxonomic information and are considering ASV (i.e., features in qiime2
). We can visualize community composition in terms of taxa proportion among samples using qiime taxa barplot
:
qiime taxa barplot \
--i-table qiime2_otu.qza \
--i-taxonomy qiime2_taxonomy.qza \
--m-metadata-file qiime2_mapping_file_2.txt \
--o-visualization tax-bar-plots.qzv
The function generates a qzv
plot we can explore using qiime tools view
:
An example of the output can be found here
5.2.8 Filtering data:
There are many ways to filter the data: the samples, the ASV based on metadata, proportions, taxonomic information… Below an example on sample filtering based on metadata.
5.2.8.1 Based on metadata:
qiime feature-table filter-samples \
--i-table qiime2_otu.qza \
--m-metadata-file qiime2_mapping_file_2.txt \
--p-where "[group]='B'" \
--o-filtered-table qiime2_otu_groupB.qza
Analyzing the filtered dataset would require to start again with the beginning qiime diversity core-metrics-phylogenetic
which will lead to loads of folders, outputs and makes it difficult to keep track of the code and analyses when things are getting a bit complex.