Investigating differential abundance methods in microbiome data: a benchmark study

1. University of Padova

This is the repository containing the results shown in Cappellato M., Baruzzo G., Di Camillo B. "Investigating differential abundance methods in microbiome data: a benchmark study." (2021).

In the GitLab repository here there is the R package metaBenchDA used to perform the benchmarking study previously cited. The R package contains the data and the code to perform the simulation framework, to run all the differential abundance analysis methods and to assess methods' performance. The GitLab repository contains also the Docker image metabenchda:2.0.0; the Docker image contains the R package metaBenchDA and the code to reproduce the results shown in the paper.

As an alternative, to reproduce the results in the main manuscript follow the instructions here.

The results folder here in Zenodo contains all the files needed to run the code in every point of the framework. The results folder contains:

lib_unbalanced results in the scenario with DA features and setting both the intensity threshold \(\theta = 1 \cdot f_{2}\)and unbalanced sequencing depth between conditions.
metagenomeSeq_res results running two models implemented in metagenomeSeq, namely the zero-inflated Gaussian (ZIG) model and the zero-inflated Log-Gaussian (ZILG) mixture model, in the scenario with DA features and setting the intensity threshold \(\theta = 1 \cdot f_{2}\).
new_datasets results in the scenario with DA features and setting the intensity threshold \(\theta = 1 \cdot f_{2}\) for AnimalGut and Soil datasets.
NOth results in the scenario with DA features and without setting the intensity threshold \(\theta = 1 \cdot 0\) .
NULL results in the scenario without DA features.
WITHth results in the scenario with DA features and setting the intensity threshold \(\theta = 1 \cdot f_{2}\).
WITHth_GMPR results in the scenario with DA features and setting the intensity threshold \(\theta = 1 \cdot f_{2}\). We test the methods with GMPR normalisation.
WITHth_HALFvar results in the scenario with DA features and decreasing the variability parameter \(\varphi = \dfrac{\varphi}{2}\).

Inside the folders the results considering structural zeros as TN are available in the reassign folder, while considering the choice of methods in methodoutput.

Inside each folder there are:

simulation all the dataset simulated. All results from the SECTION 1: GENERATE DATASET in the script file. In metagenomeSeq_res and WITHth_GMPR this folder is absent since we run the methods on the WITHth simulation scenario.
- SSXX_PPYY_FCZ1-Z2 where XX is the sample size, YY is the percentage of DA features, Z1-Z2 is the Fold Change limit. In the NULL configuration PPYY is not present.
  - WW_SSXX_PPYY_FCZ1-Z2 where WW is the name of the dataset from which the simulation parameters are estimated.
    - NAMECONF_WW_SSXX_PPYY_FCZ1-Z2_simN.RData where NAMECONF is the name of the configuration (e.g. NULL, NOth ...), N is the number of the simulation.
methods the output of each method involved in the comparison. All results from the SECTION 2: RUN DA METHODS in the script file
- SSXX_PPYY_FCZ1-Z2 where XX is the sample size, YY is the percentage of DA features, Z1-Z2 is the Fold Change limit. In the NULL configuration PPYY is not present.
  - WW_SSXX_PPYY_FCZ1-Z2 where WW is the name of the dataset from which the simulation parameters are estimated.
    - METHODNAME one folder for each method. In WITHth_GMPR folder each metod is labelled with _gmpr.
      - NAMECONF_WW_SSXX_PPYY_FCZ1-Z2_simN_METHODNAME.RData where NAMECONF is the name of the configuration (e.g. NULL, NOth ...), N is the number of the simulation.
metrics performance evaluation. All results from the SECTION 3: RUN METRICS in the script file
- methodoutput/reassign where methodoutput means that the output of the methods has been considered, while reassign that the structural zeros are considered as TN
  - SSXX_PPYY_FCZ1-Z2 where XX is the sample size, YY is the percentage of DA features, Z1-Z2 is the Fold Change limit. In the NULL configuration PPYY is not present.
    - WW_SSXX_PPYY_FCZ1-Z2 where WW is the name of the dataset from which the simulation parameters are estimated.
      - NAMECONF_methodoutput/reassign_WW_SSXX_PPYY_FCZ1-Z2_NAMEMETRIC.RData where NAMECONF is the name of the configuration (e.g. NULL, NOth ...), NAMEMETRIC is the name of the metric.
figure all the figures in the manuscript. All the figures are generated from the script file Figures.R.
- methodoutput/reassign where methodoutput means that the output of the methods has been considered, while reassign that the structural zeros are considered as TN
  - NAMECONF_methodoutput/reassign_NAMEFIGURE.jpeg where NAMECONF is the name of the configuration (e.g. NULL, NOth ...), NAMEFIGURE is the name of the metrics e.g. RECALL_boxplot is the recall values of each method in each configuration of SS, PP, dataset.

Files

results.zip

Files (3.0 GB)

Name	Size	Download all
results.zip md5:f912cdf33e625e350fda7c835dd496a7	3.0 GB	Preview Download

362

Views

Downloads

Show more details

	All versions	This version
Views	362	259
Downloads	59	45
Data volume	175.3 GB	147.2 GB

More info on how stats are collected....

DOI

Resource type

Other

Publisher

Zenodo

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: May 30, 2022
Modified: May 30, 2022

Investigating differential abundance methods in microbiome data: a benchmark study

Authors/Creators

Description

Files

results.zip

Files (3.0 GB)