Reproducibilty is an essential aspect of data analysis. Especially for sequencing data for microbial community profilling. There are lack of standards to analysing and reporting data. This is mostly because there are simply some aspects of the data that we do not control. In such cases it is important to have rational decision making and documenting it. In almost all microbiome profilling data choices regarding filtering parameters (OTU count, OTU prevalences), normalisation, transformation are routine practice. Pre-processing is an essential step and it is crucial that it is documented. There are some common approaches to look at the data.
1. How many reads/samples? if there is large difference this may affect the choices you make for alpha diversity, betadiversity and differential abundance testing. 2. How many reads/OTU ? this is aslo important to see if there are lot of OTUs with small mean & large coefficient of variation (C.V.).
This tempate is for microbiome data analysis, handling and visualisation. This makes it easy to follow some standard procedures for looking at your data and reporting your results. This first-hand look at your data will aid in making different choices for more thorough and in-dept investigation.
Check this link on Organize your data and code.
You can find the source code at microbiomeutilities github repo.
A step wise guide can be found at microbiomeutilities wiki.
Fill the details accordingly.
Author : Sudarshan
Project : Mock
#generate a random number
n <- runif(1, 0, 10^6) # keep this number stored as below instead of XXX when re running the codes.
message("random number set.seed is below")
## random number set.seed is below
print(n)
## [1] 657260.8
set.seed(n)
Check the code chunk next to this of left side to see folder names where files are stored.
message("Working directory is ", work_dir)
## Working directory is F:/R_studio_pacakes/MICROBIOMEUTILITIES/microbiomeutilities_2018/microbiomeutilities
if(file.exists("QC")) {
message("QC folder already exists, data will be overwritten")
} else{
message("QC folder will be created in ", out_dir)
dir.create(paste0(out_dir, "/QC"))
}
## QC folder already exists, data will be overwritten
if(file.exists("AlphaDiversity")) {
message("AlphaDiversity folder already exists, data will be overwritten")
} else{
message("AlphaDiversity folder will be created in ", out_dir)
dir.create(paste0(out_dir, "/AlphaDiversity"))
}
## AlphaDiversity folder already exists, data will be overwritten
if(file.exists("BetaDiversity")) {
message("BetaDiversity folder already exists, data will be overwritten")
} else{
message("BetaDiversity folder will be created in ", out_dir)
dir.create(paste0(out_dir, "/BetaDiversity"))
}
## BetaDiversity folder already exists, data will be overwritten
if(file.exists("Others")) {
message("Others folder already exists, data will be overwritten")
} else{
message("Others folder will be created in ", out_dir)
dir.create(paste0(out_dir, "/Others"))
}
## Others folder already exists, data will be overwritten
if(file.exists("PhyloseqObjects")) {
message("PhyloseqObjects folder already exists, data will be overwritten")
} else{
message("PhyloseqObjects folder will be created in ", out_dir)
dir.create(paste0(out_dir, "/PhyloseqObjects"))
}
## PhyloseqObjects folder already exists, data will be overwritten
We have to load libraries click on the code tab on the left hand die of this doc for looking at the libraries that are installed. It is good to install these packages before.
library(microbiome)
library(microbiomeutilities)
# install.packages("picante",repos="http://R-Forge.R-project.org")
# library(picante)
library(picante)
library(data.table)
library(DT)
library(RColorBrewer)
library(phyloseq)
library(tibble)
library(ggpubr)
This code chunk reads the input files and creates a phyloseq object.
ps0 <- read_phyloseq(otu.file = otufile,
metadata.file = mapping,
taxonomy.file = taxonomy,
type = "biom")
## Time to complete depends on OTU file size
if (!is.na(treefilename)){
tree <- read.tree(treefilename)
ps0 <- merge_phyloseq(ps0, tree)
} else{
message("No tree available")
}
saveRDS(ps0, paste0(out_dir, "/PhyloseqObjects/ps_raw.rds"))
message("Raw phyloseq object, confirm the number of samples and variables (as in columns of mapping file)")
## Raw phyloseq object, confirm the number of samples and variables (as in columns of mapping file)
message("Below is the content of raw phyloseqobject stored as ps_raw.rds")
## Below is the content of raw phyloseqobject stored as ps_raw.rds
print(ps0)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 466 taxa and 57 samples ]
## sample_data() Sample Data: [ 57 samples by 14 sample variables ]
## tax_table() Taxonomy Table: [ 466 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 466 tips and 464 internal nodes ]
print("Below is the summary of the phyloseq object")
## [1] "Below is the summary of the phyloseq object"
summarize_phyloseq(ps0)
## Compositional = NO
## 1] Min. number of reads = 1867
## 2] Max. number of reads = 315845
## 3] Total number of reads = 5251399
## 4] Average number of reads = 92129.8070175439
## 5] Median number of reads = 77281
## 7] Sparsity = 0.856938483547926
## 6] Any OTU sum to 1 or less? NO
## 8] Number of singletons = 0
## 9] Percent of OTUs that are singletons 0
## 10] Number of sample variables are: 14
## BarcodeSequence
## LibraryNumber
## Direction
## LibraryName
## Facet
## MC_type1
## MC_type2
## ProjectName
## Region
## Mock_type
## Mock_replicate
## BarcodeNumber
## Description
## MockTypeGroup
Check the library sizes for each of the samples within the main variable. Do you think it is okay or there is difference in library sizes and will this affect you downstream statistical analysis?
SeqDepth <- colSums(otu_table(ps0))
sample_data(ps0)$SeqDepth <- SeqDepth
meta.df <- meta(ps0)
qc_plot1 <- plot_read_distribution(ps0, groups= VariableA, plot.type= 'density')
## [1] "Done plotting"
print(qc_plot1)
ggsave(paste0(out_dir,"/QC/ReadDistribution_density.pdf"))
## Saving 7 x 5 in image
message("QC plots for Read Distribution stored in QC folder as ReadDistribution.pdf")
## QC plots for Read Distribution stored in QC folder as ReadDistribution.pdf
message("Investigating library sizes")
## Investigating library sizes
SeqDepth <- colSums(otu_table(ps0))
sample_data(ps0)$SeqDepth <- SeqDepth
meta.df <- meta(ps0)
lib.hist <- ggplot(meta.df, aes(x = SeqDepth)) + geom_histogram() +
facet_wrap(~meta.df[,VariableA]) +
xlab("Library size")
print(lib.hist)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggsave(paste0(out_dir,"/QC/ReadDistribution_density_hist.pdf"))
## Saving 7 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
message("QC plots for library sizes stored in QC folder as ReadDistribution_density_hist.pdf")
## QC plots for library sizes stored in QC folder as ReadDistribution_density_hist.pdf
If any sample has less than 2000 reads, it will be removed
if(min(sample_sums(ps0)) < 2000){
print("There are sample(s) less than 2000 reads, these will be removed")
ps0 = prune_samples(sample_sums(ps0)>=2000, ps0)
} else {
print("No samples below 2000 reads")
print(ps0)
}
## [1] "There are sample(s) less than 2000 reads, these will be removed"
message("Investigating OTU counts distribution")
## Investigating OTU counts distribution
taxasums = rowSums(otu_table(ps0))
taxatable <- as.data.frame.matrix(tax_table(ps0))
tax_plot1 <- ggplot(taxatable, aes(x = taxasums, color = taxatable[,"Phylum"])) +
geom_line(size = 1.5, stat = "density") +
xlab("OTU Counts") + theme_bw() + scale_x_log10()
print(tax_plot1)
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 16 rows containing non-finite values (stat_density).
ggsave(paste0(out_dir,"/QC/Distribution_OTU_Counts_by_phyla.pdf"), height = 6, width = 14)
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 16 rows containing non-finite values (stat_density).
message("Investigating OTU counts distribution")
## Investigating OTU counts distribution
tax_plot2 <- ggplot(taxatable, aes(x = taxasums, fill = taxatable[,"Phylum"])) +
geom_histogram(bins = 30, alpha = 0.5, position = "identity") +
xlab("OTU Counts") + theme_bw() + scale_x_log10()
print(tax_plot2)
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 16 rows containing non-finite values (stat_bin).
ggsave(paste0(out_dir,"/QC/Distribution_OTU_Counts_by_phyla_hist.pdf"), height = 6, width = 10)
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 16 rows containing non-finite values (stat_bin).
message("QC plots for library sizes stored in QC folder as Distribution_OTU_Counts.pdf")
## QC plots for library sizes stored in QC folder as Distribution_OTU_Counts.pdf
Check which of the OTUs are present in low abundance and low prevalence. You might want to remove them depending on the research question.
# for sanity
prev.plot <- plot_taxa_prevalence(ps0, "Phylum")
prev.plot
ggsave(paste0(out_dir,"/QC/OTU_prevalence_phyla.pdf"), height = 8, width = 16)
# for sanity
ps1 <- prune_taxa(taxa_sums(ps0) > 0, ps0)
This is variance for all OTU counts without filtering for min number of reads/OTU and prevalence.
Variance.plot.a <- qplot(log10(apply(otu_table(ps1), 1, var)),
xlab = "log10(variance)",
main = "Variance in OTUs") +
ggtitle("before filtering") + theme_minimal()
print(Variance.plot.a)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggsave(paste0(out_dir, "/QC/Variance before filtering.pdf"))
## Saving 7 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
message("QC plots for OTU variance stored in QC folder as Variance before filtering.pdf")
## QC plots for OTU variance stored in QC folder as Variance before filtering.pdf
This is variance for all OTU counts after filtering for min number of reads/OTU and prevalence.
if (filterpseq == TRUE) {
message(paste0("Filtering OTUs with less than ", filterCount, " counts"))
message(paste0("in at least ", filterPrev*100, " % of the samples "))
ps2 = filter_taxa(ps1, function(x) sum(x > filterCount) > (filterPrev * length(x)), TRUE)
message("Saving the transformed phyloseq object as ps_filtered.rds")
saveRDS(ps2, paste0(out_dir,"/PhyloseqObjects/ps_filtered.rds"))
message("Below is the content of filtered phyloseqobject (based on filterCount and filterPrev) stored as ps_filtered.rds")
print(ps2)
Variance.plot.b <- qplot(log10(apply(otu_table(ps2), 1, var)),
xlab = "log10(variance)",
main = "Variance in OTUs") +
ggtitle("after filtering") +
theme_minimal()
print(Variance.plot.b)
ggsave(paste0(out_dir,"/QC/Variance After filtering.pdf"))
} else
{
message("filterpseq was false. Did not filter and hence will not save the filtered phyloseq")
ps2 <- ps1
}
## Filtering OTUs with less than 4 counts
## in at least 1 % of the samples
## Saving the transformed phyloseq object as ps_filtered.rds
## Below is the content of filtered phyloseqobject (based on filterCount and filterPrev) stored as ps_filtered.rds
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 401 taxa and 56 samples ]
## sample_data() Sample Data: [ 56 samples by 15 sample variables ]
## tax_table() Taxonomy Table: [ 401 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 401 tips and 400 internal nodes ]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Saving 7 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is coefficient of variation for all OTU counts without filtering for min number of reads/OTU and prevalence.
cv_plot <- plot_taxa_cv(ps1, "hist")
print(cv_plot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggsave(paste0(out_dir, "/QC/Coefficient of variation before filtering.pdf"))
## Saving 7 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
message("QC plots for OTU variance stored in QC folder as Variance before filtering.pdf")
## QC plots for OTU variance stored in QC folder as Variance before filtering.pdf
This is coefficient of variation for all OTU counts after filtering for min number of reads/OTU and prevalence.
if (filterpseq == TRUE) {
message(paste0("Filtering OTUs with less than ", filterCount, " counts"))
message(paste0("in at least ", filterPrev*100, " % of the samples "))
ps2 = filter_taxa(ps1, function(x) sum(x > filterCount) > (filterPrev * length(x)), TRUE)
message("Using the filtered phyloseq object as ps_filtered.rds")
message("Below is the content of filtered phyloseqobject (based on filterCount and filterPrev) stored as ps_filtered.rds")
print(ps2)
cv_plot <- plot_taxa_cv(ps2, "hist")
print(cv_plot)
ggsave(paste0(out_dir,"/QC/Coefficient of variation After filtering.pdf"))
} else
{
message("filterpseq was false. Did not filter and hence will not save the filtered phyloseq")
ps2 <- ps1
}
## Filtering OTUs with less than 4 counts
## in at least 1 % of the samples
## Using the filtered phyloseq object as ps_filtered.rds
## Below is the content of filtered phyloseqobject (based on filterCount and filterPrev) stored as ps_filtered.rds
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 401 taxa and 56 samples ]
## sample_data() Sample Data: [ 56 samples by 15 sample variables ]
## tax_table() Taxonomy Table: [ 401 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 401 tips and 400 internal nodes ]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Saving 7 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Kurtosis is essentially a measure of how much weight is at the tails of the distribution relative to the weight around the location.
If you have large differences in your library sizes then you may have to normalise your data.
message("Using the raw phyloseq to check for kurtosis in library size")
## Using the raw phyloseq to check for kurtosis in library size
df <- data.table(NumberReads = sample_sums(ps0), SampleID = sample_names(ps0))
require(moments)
n <- kurtosis(df$NumberReads)
if (n > 3) {
message("Your library size is heavily tailed, considering normalising them for further analysis")
} else {
message("The variation in library sizes is below kurtosis value of 3 may indicate no need for rarefying")
}
## Your library size is heavily tailed, considering normalising them for further analysis
Alpha diversity measures are standard calculations in microbial ecology. The differences in richness and eveness between groups may have importance to understanding the ecology. There are numerous measures we use the defaults from microbiome
R package and also the phylogenetic diversity from picante
R package.
The caculations can be done on rarefied or non-rarefied data which can be specified by the samsize
option in “Set project attributes” option above.
Below you can find
For more on this check Microbiome:Diversities
if (!is.na(samsize)) {
ps3 <- rarefy_even_depth(ps2, sample.size = samsize)
saveRDS(ps3, paste0(out_dir, "/phyloseqObjects/ps_rarefyied.rds"))
} else{
ps3 <- ps2
}
metadf <- meta(ps3)
metadf$sam_rep_nw <- rownames(metadf)
adiv.meta <- estimate_richness(ps3)
## Warning in estimate_richness(ps3): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
colnames(adiv.meta)
## [1] "Observed" "Chao1" "se.chao1" "ACE" "se.ACE"
## [6] "Shannon" "Simpson" "InvSimpson" "Fisher"
adiv.meta$sam_rep_nw <- rownames(adiv.meta)
adiv.nw <- reshape2::melt(adiv.meta)
## Using sam_rep_nw as id variables
colnames(adiv.nw) <- c("sam_rep_nw","Diversity","div.val")
meta_df_nw <- reshape2::melt(metadf)
## Using BarcodeSequence, Direction, LibraryName, Facet, MC_type1, MC_type2, ProjectName, Region, Mock_type, Mock_replicate, Description, MockTypeGroup, sam_rep_nw as id variables
meta_adiv <- merge.data.frame(meta_df_nw, adiv.nw, by = "sam_rep_nw")
colnames(meta_adiv)
## [1] "sam_rep_nw" "BarcodeSequence" "Direction"
## [4] "LibraryName" "Facet" "MC_type1"
## [7] "MC_type2" "ProjectName" "Region"
## [10] "Mock_type" "Mock_replicate" "Description"
## [13] "MockTypeGroup" "variable" "value"
## [16] "Diversity" "div.val"
p <- ggqqplot(meta_adiv, "div.val",
facet.by = c("Diversity", VariableA),
color = VariableA)
p <- facet(p , facet.by = c("Diversity", VariableA), scales = "free")
print(p)
## Warning: Removed 294 rows containing non-finite values (stat_qq).
## Warning: Removed 294 rows containing non-finite values (stat_qq_line).
## Warning: Removed 294 rows containing non-finite values (stat_qq_line).
#Create 2x2 plot environment
ggsave(paste0(out_dir,"/AlphaDiversity/Non-phylogenetic_alpha_diversity_qqnorm.pdf"), height = 8, width= 12)
## Warning: Removed 294 rows containing non-finite values (stat_qq).
## Warning: Removed 294 rows containing non-finite values (stat_qq_line).
## Warning: Removed 294 rows containing non-finite values (stat_qq_line).
#shapiro.test
shapiro.test(adiv.meta$Observed)
##
## Shapiro-Wilk normality test
##
## data: adiv.meta$Observed
## W = 0.94917, p-value = 0.01953
shapiro.test(adiv.meta$Chao1)
##
## Shapiro-Wilk normality test
##
## data: adiv.meta$Chao1
## W = 0.94917, p-value = 0.01953
shapiro.test(adiv.meta$ACE)
##
## Shapiro-Wilk normality test
##
## data: adiv.meta$ACE
## W = 0.9755, p-value = 0.935
shapiro.test(adiv.meta$Shannon)
##
## Shapiro-Wilk normality test
##
## data: adiv.meta$Shannon
## W = 0.91243, p-value = 0.0006159
shapiro.test(adiv.meta$Simpson)
##
## Shapiro-Wilk normality test
##
## data: adiv.meta$Simpson
## W = 0.88701, p-value = 7.887e-05
shapiro.test(adiv.meta$InvSimpson)
##
## Shapiro-Wilk normality test
##
## data: adiv.meta$InvSimpson
## W = 0.96334, p-value = 0.08657
shapiro.test(adiv.meta$Fisher)
##
## Shapiro-Wilk normality test
##
## data: adiv.meta$Fisher
## W = 0.77883, p-value = 8.925e-08
alpha_div <- plot_richness(ps3, x = VariableA,
color = VariableB,
measures = c("Observed",
"Chao1",
"Shannon",
"InvSimpson")) + geom_boxplot()
## Warning in estimate_richness(physeq, split = TRUE, measures = measures): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
alpha_div <- alpha_div + theme_bw() + geom_point(size = 2) + ggtitle("Non phylogenetic diversity") + scale_color_brewer(palette = col.palette) + rotate_x_text()
print(alpha_div)
## Warning: Removed 168 rows containing missing values (geom_errorbar).
ggsave(paste0(out_dir,"/AlphaDiversity/Non-phylogenetic_alpha_diversity.pdf"),
height = 6, width = 18)
## Warning: Removed 168 rows containing missing values (geom_errorbar).
if (!is.na(samsize)){
message("Non-phylogenetic_alpha_diversity on RAREFIED data stored in AlphaDiversity folder")
message("Non-phylogenetic_alpha_diversity.pdf")
} else{
message("Non-phylogenetic_alpha_diversity on NON-RAREFIED data stored in AlphaDiversity folder")
message("Non-phylogenetic_alpha_diversity.pdf")
}
## Non-phylogenetic_alpha_diversity on NON-RAREFIED data stored in AlphaDiversity folder
## Non-phylogenetic_alpha_diversity.pdf
For more on this check Picante.
if (!is.na(treefilename)){
message("If sam.size was provided then rarefyied phyloseq object will be used to calculate PD")
print(ps3)
otu_table_ps3 <- as.data.frame(ps3@otu_table)
metadata_table_ps3 <- meta(ps3)
message("include.root in pd is set to FALSE by default")
df.pd <- pd(t(otu_table_ps3), tree,include.root=F) # t(ou_table) transposes the table for use in picante and the tre file comes from the first code chunck we used to read tree file (see making a phyloseq object section).
datatable(df.pd)
# now we need to plot PD
# check above how to get the metadata file from phyloseq object.
# We will add the results of PD to this file and then plot.
select.meta <- metadata_table_ps3[,c(VariableA,VariableB)] #, "Phyogenetic_diversity"]
select.meta$Phyogenetic_diversity <- df.pd$PD
colnames(select.meta) <- c("VariableA", "VariableB", "Phyogenetic_diversity")
shapiro.test(select.meta$Phyogenetic_diversity)
qqnorm(select.meta$Phyogenetic_diversity)
plot.pd <- ggplot(select.meta, aes(VariableA, Phyogenetic_diversity)) +
geom_boxplot(aes(fill = VariableB)) + geom_point(size = 2) +
theme(axis.text.x = element_text(size=14, angle = 90)) +
theme_bw() + scale_fill_brewer(palette = col.palette)
print(plot.pd)
ggsave(paste0(out_dir, "/AlphaDiversity/Phylogenetic_diversityon_nonRafrefied_data.pdf"),
plot = plot.pd,
height = 6, width = 18)
} else{
message("No tree supplied, PD cannot be calculated")
}
## If sam.size was provided then rarefyied phyloseq object will be used to calculate PD
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 401 taxa and 56 samples ]
## sample_data() Sample Data: [ 56 samples by 15 sample variables ]
## tax_table() Taxonomy Table: [ 401 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 401 tips and 400 internal nodes ]
## include.root in pd is set to FALSE by default
Having a look at the phyloegnetic composition of you data is useful for many reasons. For this purpose we have two Phylum and Family level plots.
ps3.com <- ps1 # create a new pseq object
# We need to set Palette
taxic <- as.data.frame(ps3.com@tax_table) # this will help in setting large color options
colourCount = length(unique(taxic$Phylum)) #define number of variable colors based on number of Family (change the level accordingly to phylum/class/order)
getPalette = colorRampPalette(brewer.pal(12, col.palette)) # change the palette as well as the number of colors will change according to palette.
# now edit the unclassified taxa
tax_table(ps3.com)[tax_table(ps3.com)[, "Phylum"] == "f__", "Phylum"] <- "f__Unclassified Phylum"
# We will also remove the 'f__' patterns for cleaner labels
tax_table(ps3.com)[, colnames(tax_table(ps3.com))] <- gsub(tax_table(ps3.com)[,
colnames(tax_table(ps3.com))], pattern = "[a-z]__", replacement = "")
otu.df <- as.data.frame(otu_table(ps3.com)) # make a dataframe for OTU information.
# head(otu.df) # check the rows and columns
taxic$OTU <- row.names.data.frame(otu.df) # Add the OTU ids from OTU table into the taxa table at the end.
colnames(taxic) # You can see that we now have extra taxonomy levels.
## [1] "Domain" "Phylum" "Class" "Order" "Family" "Genus" "OTU"
taxmat <- as.matrix(taxic) # convert it into a matrix.
new.tax <- tax_table(taxmat) # convert into phyloseq compaitble file.
tax_table(ps3.com) <- new.tax # incroporate into phyloseq Object
# it would be nice to have the Taxonomic names in italics.
# for that we set this
guide_italics <- guides(fill = guide_legend(label.theme = element_text(size = 15,
face = "italic", colour = "Black", angle = 0)))
## Now we need to plot at family level, We can do it as follows:
# first remove the phy_tree
ps3.com@phy_tree <- NULL
lev0 = "Phylum"
tax_table(ps3.com)[,lev0][is.na(tax_table(ps3.com)[,lev0])] <- paste0(tolower(substring(lev0, 1, 1)), "__")
ps3.com.phy <- aggregate_taxa(ps3.com, "Phylum")
ps3.com.phy.rel <- microbiome::transform(ps3.com.phy, "compositional")
plot.composition.relAbun.phy <- plot_composition(ps3.com.phy.rel) + theme(legend.position = "bottom") +
scale_fill_manual(values = getPalette(colourCount)) + theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Relative abundance Phylum level") + guide_italics
plot.composition.relAbun.phy
if (nrow(metadf) > 30) {
ggsave(paste0(out_dir, "/Others/compositionbarplot_Phylum.pdf"), plot = plot.composition.relAbun.phy, height = 8, width = 28)
} else {
ggsave(paste0(out_dir, "/Others/compositionbarplot_Phylum.pdf"), plot = plot.composition.relAbun.phy,
height = 8, width = 18)
}
Top 5 Phyla are shown below:
pn0 <- plot_taxa_boxplot(ps3.com,
taxonomic.level = "Phylum",
top.otu = 6, VariableA,
title = "Relative abundance Phylum level", color = "Paired")
## The phy_tree slot is empty, easy to make the plot
pn0
ggsave(paste0(out_dir,"/Others/compositionboxplot_Phylum.pdf"), height = 6, width = 12)
colourCount = length(unique(taxic$Family)) #define number of variable colors based on number of Family (change the level accordingly to phylum/class/order)
getPalette = colorRampPalette(brewer.pal(12, col.palette)) # change the palette as well as the number of colors will change according to palette.
lev = "Family"
tax_table(ps3.com)[,lev][is.na(tax_table(ps3.com)[,lev])] <- paste0(tolower(substring(lev, 1, 1)), "__")
ps3.com.fam <- aggregate_taxa(ps3.com, "Family")
ps3.com.fam.rel <- microbiome::transform(ps3.com.fam, "compositional")
plot.composition.relAbun.fam <- plot_composition(ps3.com.fam.rel) + theme(legend.position = "bottom") +
scale_fill_manual(values = getPalette(colourCount)) + theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Relative abundance Family level") + guide_italics
plot.composition.relAbun.fam
if (nrow(metadf) > 30) {
ggsave(paste0(out_dir,"/Others/compositionbarplot_Family.pdf"), plot = plot.composition.relAbun.fam, height = 8, width = 28)
} else {
ggsave(paste0(out_dir,"/Others/compositionbarplot_Family.pdf"), plot = plot.composition.relAbun.fam,
height = 8, width = 18)
}
Top 10 Families are shown below:
pn1 <- plot_taxa_boxplot(ps3.com,
taxonomic.level = "Family",
top.otu = 10, VariableA,
title = "Relative abundance Family level", color = "Paired")
## The phy_tree slot is empty, easy to make the plot
pn1
ggsave(paste0(out_dir,"/Others/compositionboxplot_Family.pdf"), height = 6, width = 16)
The counts are compositionally tranformed and then used for ordinations.
ps3.rel <- microbiome::transform(ps3, "compositional")
bc.pcoa <- phyloseq::ordinate(ps3.rel, method = "PCoA", distance = "bray")
bc.pcoa.plot <- plot_ordination(ps3.rel, bc.pcoa,
type = "split", axes = 1:2,
color = VariableA, shape = VariableB,
label = NULL,
title = "Bray-Curtis distance PCoA",
justDF = FALSE)
bc.pcoa.plot <- bc.pcoa.plot + theme_bw() + geom_point(size = 2)
print(bc.pcoa.plot)
ggsave(paste0(out_dir,"/BetaDiversity/Bray-Curtis distance PCoA.pdf"), plot = bc.pcoa.plot,
height = 6, width = 10)
# Calculate bray curtis distance matrix
ps3_bray <- phyloseq::distance(ps3.rel, method = "bray")
# use meta data from phylogenetic div code chunk.
metadata_table_ps3 <- meta(ps3.rel)
select.meta2 <- metadata_table_ps3[,c(VariableA,VariableB)] #, "Phyogenetic_diversity"]
colnames(select.meta2) <- c("VariableA", "VariableB")
# Adonis test
adonis(ps3_bray ~ VariableA, data = select.meta2)
##
## Call:
## adonis(formula = ps3_bray ~ VariableA, data = select.meta2)
##
## Permutation: free
## Number of permutations: 999
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
## VariableA 11 5.6279 0.51163 2.7966 0.41147 0.001 ***
## Residuals 44 8.0495 0.18294 0.58853
## Total 55 13.6774 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Homogeneity of dispersion test
beta.bray <- betadisper(ps3_bray, select.meta2$VariableA)
permutest(beta.bray)
##
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
##
## Response: Distances
## Df Sum Sq Mean Sq F N.Perm Pr(>F)
## Groups 11 1.0569 0.09608 0.8646 999 0.548
## Residuals 44 4.8897 0.11113
The counts are converted to relative abundance and then used for ordinations.
if (!is.na(treefilename)){
wunifrac.pcoa <- ordinate(ps3.rel, method = "PCoA", distance = "wunifrac")
wunifrac.pcoa.plot <- plot_ordination(ps3.rel, wunifrac.pcoa,
type = "split", axes = 1:2,
color = VariableA, shape = VariableB,
label = NULL,
title = "Weighted Unifrac distance PCoA",
justDF = FALSE)
wunifrac.pcoa.plot <- wunifrac.pcoa.plot + theme_bw() + geom_point(size = 2)
print(wunifrac.pcoa.plot)
ggsave(paste0(out_dir,"/BetaDiversity/Weighted Unifrac distance PCoA.pdf"), plot = wunifrac.pcoa.plot,
height = 6, width = 10)
# Calculate bray curtis distance matrix
ps3_wunifrac <- phyloseq::distance(ps3.rel, method = "wunifrac")
# use meta data from phylogenetic div code chunk.
# Adonis test
adonis(ps3_wunifrac ~ VariableA, data = select.meta2)
# Homogeneity of dispersion test
beta.wunifrac <- betadisper(ps3_wunifrac, select.meta2$VariableA)
permutest(beta.wunifrac)
} else {
message("No tree supplied, cannot calculate unifrac distances")
}
##
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
##
## Response: Distances
## Df Sum Sq Mean Sq F N.Perm Pr(>F)
## Groups 11 0.024573 0.0022339 2.1048 999 0.09 .
## Residuals 44 0.046700 0.0010614
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The counts are converted to relative abundnces and then used for ordinations.
if (!is.na(treefilename)){
unifrac.pcoa <- ordinate(ps3.rel, method = "PCoA", distance = "unifrac")
unifrac.pcoa.plot <- plot_ordination(ps3.rel, unifrac.pcoa,
type = "split", axes = 1:2,
color = VariableA, shape = VariableB,
label = NULL,
title = "Unweighted Unifrac distance PCoA",
justDF = FALSE)
unifrac.pcoa.plot <- unifrac.pcoa.plot + theme_bw() + geom_point(size = 2)
print(unifrac.pcoa.plot)
ggsave(paste0(out_dir,"/BetaDiversity/Unweighted Unifrac distance PCoA.pdf"), plot = unifrac.pcoa.plot,
height = 6, width = 10)
message("Unweighted Unifrac distance gives negative values and standard anova and anoism cannot be used")
} else {
message("No tree supplied, cannot calculate unifrac distances")
}
## Unweighted Unifrac distance gives negative values and standard anova and anoism cannot be used
if (heatmap == TRUE) {
heat.sample <- plot_taxa_heatmap(ps3, subset.top = 50,
VariableA = VariableA,
heatcolors =brewer.pal(9, "Blues"),
transformation = "compositional",
file = "./Others/Heatmap rel abun top 50 OTUs.tiff",
height = 9, width = 10)
print(paste0("heatmap saved in ", out_dir))
# Duplicate for printing in the HTML file
heat.sample.dup <- plot_taxa_heatmap(ps3, subset.top = 50,
VariableA = VariableA,
heatcolors =brewer.pal(9, "Blues"),
transformation = "compositional")
}
## Top 50 OTUs selected
## First converted to compositional
## then top taxa were selected
## [1] "heatmap saved in ."
## Top 50 OTUs selected
## First converted to compositional
## then top taxa were selected
The versions of the R software and Bioconductor packages used for this analysis are listed below. It is important to save them if one wants to re-perform the analysis in the same conditions.
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252
## [3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
## [5] LC_TIME=Dutch_Netherlands.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] viridis_0.5.1 viridisLite_0.3.0
## [3] bindrcpp_0.2.2 moments_0.14
## [5] ggpubr_0.1.6 magrittr_1.5
## [7] tibble_1.4.2 RColorBrewer_1.1-2
## [9] DT_0.4 data.table_1.10.4-3
## [11] picante_1.6-2 nlme_3.1-137
## [13] vegan_2.4-4 lattice_0.20-35
## [15] permute_0.9-4 ape_5.1
## [17] microbiomeutilities_0.99.0 microbiome_1.1.10012
## [19] ggplot2_2.2.1 phyloseq_1.23.1
##
## loaded via a namespace (and not attached):
## [1] Biobase_2.36.2 tidyr_0.8.0 jsonlite_1.5
## [4] splines_3.4.1 foreach_1.4.4 shiny_1.0.5
## [7] assertthat_0.2.0 stats4_3.4.1 yaml_2.1.18
## [10] ggrepel_0.7.0 pillar_1.2.1 backports_1.1.2
## [13] glue_1.2.0 digest_0.6.15 XVector_0.16.0
## [16] colorspace_1.3-2 htmltools_0.3.6 httpuv_1.3.6.2
## [19] Matrix_1.2-14 plyr_1.8.4 pkgconfig_2.0.1
## [22] pheatmap_1.0.8 zlibbioc_1.22.0 xtable_1.8-2
## [25] purrr_0.2.4 scales_0.5.0 Rtsne_0.13
## [28] mgcv_1.8-23 IRanges_2.10.5 BiocGenerics_0.22.1
## [31] lazyeval_0.2.1 mime_0.5 survival_2.41-3
## [34] evaluate_0.10.1 MASS_7.3-49 tools_3.4.1
## [37] stringr_1.3.0 S4Vectors_0.14.7 munsell_0.4.3
## [40] cluster_2.0.7-1 Biostrings_2.44.2 ade4_1.7-11
## [43] compiler_3.4.1 rlang_0.2.0.9001 rhdf5_2.20.0
## [46] grid_3.4.1 iterators_1.0.9 biomformat_1.4.0
## [49] htmlwidgets_1.0 crosstalk_1.0.0 igraph_1.2.1
## [52] labeling_0.3 rmarkdown_1.9 gtable_0.2.0
## [55] codetools_0.2-15 multtest_2.32.0 reshape2_1.4.3
## [58] R6_2.2.2 gridExtra_2.3 knitr_1.20
## [61] dplyr_0.7.4 bindr_0.1.1 rprojroot_1.3-2
## [64] stringi_1.1.7 parallel_3.4.1 Rcpp_0.12.16
## [67] tidyselect_0.2.4