1. Associated works This dataset is associated with the manuscript: Accounting for 16S rRNA copy number prediction uncertainty and its implications in bacterial diversity analyses It can also be used as example data for the following GitHub sites: https://github.com/wu-lab-uva/RasperGade16S 2. Folders and files Once decompressed, there should be 8 folders: 2.1 Reference/ The Reference/ folder contains the data used to generate a 16S GCN prediction reference. It contains the following files: reference.fna the raw reference 16S rRNA gene sequences reference.trimmed.afa the trimmed alignment for reference 16S rRNA genes reference.tre the inferred phylogeny in Newick format taxids.RDS a binary file storing the NCBI taxid (at species level) of each reference genome 16S_GCN.txt the number of 16S GCN annotated in each reference genome binary_partition.RDS a binary file storing the results of partitioning the phylogeny homogeneous_data.RDS a binary file for easy loading of the phylogeny and the 16S GCN in R homogeneous_model.RDS a binary file containing the fitted BM and PE models lineage_table.RDS a binary file containing the full lineage information for each species taxid prepared_reference.RDS a binary file containing the the final reference dataset that can be readily used to predict 16S GCN rescaled_data_model.RDS a binary file containing the rescaled phylogeny RAxML_bestTree.barnnap.tre the phylogeny built using HMM profiles from BARNNAP in Newick format 2.2 CV/ The CV/ folder contains the binary files in the cross-validation processs. In contains the following files: CV.NSTD.RDS the NSTD for each test tip in CV of empirical GCN GCN.BM.CV.RDS predicted GCN in CV with BM for empirical GCN GCN.PE.CV.RDS predicted GCN in CV with PE for empirical GCN GCN.MP_EMP.CV.RDS predicted GCN in CV with PICRUST2 for empirical GCN GCN.testset.RDS the tips in each test set in CV of empirical GCN 2.3 Sim/ The Sim/ folder contains the binary files in the evaluation of 16S GCN correction on microbial composition analyses. In contains the following files: Sim_2env_100otu_2f_turnover_PERMANOVA.RDS the PERMANOVA result for 100 signature OTUs per environment using Bray-Curtis distance Sim_2env_100otu_2f_turnover_Aitchison_PERMANOVA.RDS the PERMANOVA result for 100 signature OTUs per environment using Aitchison distance Sim_2env_100otu_2f_turnover_UniFrac_PERMANOVA.RDS the PERMANOVA result for 100 signature OTUs per environment using weighted UniFrac distance Sim_2env_100otu_2f_turnover_beta.RDS the raw beta diversity result for 100 signature OTUs per environment using Bray-Curtis distance Sim_2env_100otu_2f_turnover_Aitchison_beta.RDS the raw beta diversity result for 100 signature OTUs per environment using Aitchison distance Sim_2env_100otu_2f_turnover_UniFrac_beta.RDS the raw beta diversity result for 100 signature OTUs per environment using weighted UniFrac distance Sim_2env_100otu_2f_turnover_rf_test.RDS the random forest result for 100 signature OTUs per environment Sim_2env_20otu_2f_turnover_PERMANOVA.RDS the PERMANOVA result for 20 signature OTUs per environment using Bray-Curtis distance Sim_2env_20otu_2f_turnover_Aitchison_PERMANOVA.RDS the PERMANOVA result for 20 signature OTUs per environment using Aitchison distance Sim_2env_20otu_2f_turnover_UniFrac_PERMANOVA.RDS the PERMANOVA result for 20 signature OTUs per environment using weighted UniFrac distance Sim_2env_20otu_2f_turnover_beta.RDS the raw beta result for 20 signature OTUs per environment using Bray-Curtis distance Sim_2env_20otu_2f_turnover_Aitchison_beta.RDS the raw beta result for 20 signature OTUs per environment using Aitchison distance Sim_2env_20otu_2f_turnover_UniFrac_beta.RDS the raw beta result for 20 signature OTUs per environment using weighted UniFrac distance Sim_2env_20otu_2f_turnover_rf_test.RDS the random forest result for 20 signature OTUs per environment Sim_2env_5otu_2f_turnover_PERMANOVA.RDS the PERMANOVA result for 5 signature OTUs per environment using Bray-Curtis distance Sim_2env_5otu_2f_turnover_Aitchison_PERMANOVA.RDS the PERMANOVA result for 5 signature OTUs per environment using Aitchison distance Sim_2env_5otu_2f_turnover_UniFrac_PERMANOVA.RDS the PERMANOVA result for 5 signature OTUs per environment using weighted UniFrac distance Sim_2env_5otu_2f_turnover_beta.RDS the raw beta result for 5 signature OTUs per environment using Bray-Curtis distance Sim_2env_5otu_2f_turnover_Aitchison_beta.RDS the raw beta result for 5 signature OTUs per environment using Aitchison distance Sim_2env_5otu_2f_turnover_UniFrac_beta.RDS the raw beta result for 5 signature OTUs per environment using weighted UniFrac distance Sim_2env_5otu_2f_turnover_rf_test.RDS the random forest result for 5 signature OTUs per environment Sim_2env_turnover_diff.RDS the compiled abundance difference result Sim_GCN_from_CV.RDS the simulated GCN prediction for the analyses Sim_GCN_from_trait.RDS the simulated GCN for the analyses Sim_S2000_RA.RDS the relative abundance analysis result sim_2env_100otu_2f_turnover_data.RDS the simulated communities for beta-diversity analyses with 100 signature OTUs per environment sim_2env_100otu_2f_turnover_meta.RDS the metadata for communities in beta-diversity analyses with 100 signature OTUs per environment sim_2env_20otu_2f_turnover_data.RDS the simulated communities for beta-diversity analyses with 20 signature OTUs per environment sim_2env_20otu_2f_turnover_meta.RDS the metadata for communities in beta-diversity analyses with 20 signature OTUs per environment sim_2env_5otu_2f_turnover_data.RDS the simulated communities for beta-diversity analyses with 5 signature OTUs per environment sim_2env_5otu_2f_turnover_meta.RDS the metadata for communities in beta-diversity analyses with 5 signature OTUs per environment sim_S2000_RA_data.RDS the simulated communities for relative abundance analyses Sim_tree_from_trait.RDS the phylogeny for calculating UniFrac distance in simulation study 2.4 HMP/ The HMP/ folder contains the results on HMP1 dataset. HMP.PERMANOVA.RDS the PERMANOVA results in binary file HMP_v13.pred.GCN.RDS.RDS the predicted GCN for HMP in binary file HMP.beta.RDS the raw beta diversity results in binary file HMP_v13.pred.GCN.RDS.locations.RDS the insertion location of 16S rRNA genes in binary file HMP.diff.RDS the abundance difference results in binary file HMP_v13_HQ.RAD.RDS the relative abundance results in binary file HMP.rf.RDS the random forest results in binary file HMP_v13_HQ.abundance.RDS the gene abundance in HMP dataset in binary file HMP_epa_result.jplace the raw output from EPA in .jplace format HMP_v13_HQ.meta.RDS the metadata for HMP dataset in binary file HMP_v13.NSTD.RDS the NSTD for HMP dataset 2.5 SILVA/ The SILVA/ folder contains the results on SILVA dataset. SILVA_predicted_GCN.txt the table of predicted 16S rRNA GCN and the associated confidence SILVA.NSTD.RDS the NSTD for SILVA in binary file SILVA132NR99.sorted.id.3level.txt the taxonomic information for SILVA in text file SILVA.pred.GCN.RDS.RDS the predicted GCN for SILVA in binary file SILVA_epa_result.jplace the insertion location of 16S rRNA genes in .jplace format SILVA.pred.GCN.RDS.locations.RDS the insertion location of 16S rRNA genes in binary file 2.6 EBI/ The EBI/ folder contains the results on EBI (MGnify) In contains one file: EBI.adj.NSTI.txt the accession number, biome type and adjusted NSTI for 113842 communities from EBI (MGnify) 2.7 EMP/ The EMP/ folder contains the results on EMP dataset All files are binary files EMP.Animal.PERMANOVA.RDS the PERMANOVA results within animal-associated microbiomes EMP.Animal.diff.RDS the abundance difference results within animal-associated microbiomes EMP.Animal.rf.RDS the random forest results within animal-associated microbiomes EMP.Saline.PERMANOVA.RDS the PERMANOVA results within saline microbiomes EMP.Saline.diff.RDS the abundance difference results within saline microbiomes EMP.Saline.rf.RDS the random forest results within saline microbiomes EMP.NonSaline.diff.RDS the abundance difference results within non-saline microbiomes EMP.NonSaline.PERMANOVA.RDS the PERMANOVA results within non-saline microbiomes EMP.NonSaline.rf.RDS the random forest results within non-saline microbiomes EMP.Plant.diff.RDS the abundance difference results within plant-associated microbiomes EMP.Plant.PERMANOVA.RDS the PERMANOVA results within plant-associated microbiomes EMP.Plant.rf.RDS the random forest results within plant-associated microbiomes EMP_Deblur_abundances.RDS the gene abundance in EMP dataset (Deblur version) EMP_Deblur_sub2k.RAD.RDS the relative abundance results for EMP dataset 2.8 Scripts/ The Scripts/ folder contains the R scripts for reproducing key statistics and figures. You can source the scripts in RStudio with working directory set to Scripts/ Figure_1.R the script for reproducing Figure 1 Figure_2_GCN_Classification.R the script for reproducing Figure 2 Figure_3_Abundance.R the script for reproducing Figure 3 Figure_4_S2_Beta_diversity.R the script for reproducing Figure 4 and Figure S2 Figure_5_Table_S5_EBI_NSTI_Biomes.R the script for reproducing Figure 5 and Table S5 Figure_S1_Reference_insertion.R the script for reproducing Figure S1 Summary_HMP_EMP_beta.R the script for summarizing beta diversity analyses in HMP and EMP Table_1_S3_Model_selection_PPIC.R the script for reproducing Table 1 and Table S3 Table_S1.R the script for reproducing Table S1 Table_S2_PE_HMM_comparison.R the script for reproducing Table S2 Table_S4_Beta_diversity.R the script for reproducing Table S4 Table_S6_SILVA_summary.R the script for reproducing Table S6 added_utils.R the script for useful functions not included in RasperGade16S yet