Published April 10, 2020 | Version v1
Dataset Open

Bioinformatic pipeline: Vast differences in strain-level diversity in the gut microbiota of two closely related honey bee species

  • 1. Lausanne University
  • 2. National Institute of Advanced Industrial Science and Technology

Description

This data-set contains the full bioinformatic pipeline used to analyze metagenomic samples in the study "Vast differences in strain-level diversity in the gut microbiota of two closely related honey bee species" (Ellegaard et al. 2020, Current Biology). 

New metagenomic samples were generated for the study, for which the raw data is available on the NCBI Sequence Read Achive, under accession: PRJNA59809.

The data of this submission consist of 9 tar-balls, as further described here below. Download and unpack to view the contents (tar -zxvf filename.tar.gz). For each tarball, all directories contain README.txt files, describing the contents of the directory. Due to size constraints, some intermediate files have been omitted, and some workflows are demonstrated for a subset of the data. However, the full analysis can be reproduced from the raw data, using the provided scripts.

All scripts are included within the directories where they were applied. Perl-scripts contain documentation, which can be viewed by typing: "perl script_name.pl -h". For R scripts, the usage is indicated as a comment in the top lines of each script. Note that many of the scripts require specific input-files to be present in the run-directory. Their usage is demonstrated within the workflow directories in bash-scripts (*.sh). Commands used for generating plots and some statistics are given within workflow directories in text-files "R.commands" when applicable.

Aside from custom code, the pipeline also utilizes various open-source Software packages, which are detailed in the file "software_dependencies.txt". Note, while many of the scripts will run fast on any computer, some steps of the pipeline are computationally demanding, and will require significant computing time, as well as storage space. When scripts are known to be time-consuming, this is indicated in the script help message.

Description of tarballs.

raw_data_processing.tar.gz: Describes the quality-control and trimming of raw data, and includes info on the sequencing run.

databases.tar.gz: Contains all databases used for analysis, in addition to relevant meta-data.

mapping_stats.tar.gz: Contains a file with the number of reads mapped to the honey bee gut microbiota database and the host genomes, for each sample. Bash-scripts are provided, detailing how the mapping was done and quantified.

orthologs_phylogenies.tar.gz: Contains the pipeline for inferring orthologous gene-families and core genome phylogenies, as well as scripts for filtering of single-copy core gene families.

assemblies.tar.gz: Contains the final de novo metagenome assembly files (contig fasta-files), gener
ated for both complete and rarefied read subsets. Bash-scripts detailing the assembly commands are also provided.

SDP_validation.tar.gz: Contains the pipeline for metagenomic validation of candidate SDPs. Final output-files, containing the percentage identity of recruited metagenomic ORFs to database core genes, are provided for each candidate SDP. Additionally, a small example dataset is provided, where the intermediate result-files can be viewed.

community_profiling.tar.gz: Contains the pipeline for community profiling, i.e. the quantification of individual community members (SDPs) across samples. Final output files are provided, including mapped read coverage on core gene families and corresponding plots. A small bam-file (containing data from a single subset sample), is also provided, in order to demonstrate the pipeline, together with all scripts used.

snv_profiling.tar.gz: Contains the pipeline used for SNV profiling, including filtering and analysis. Final filtered vcf-files are provided for each SDP. Analytical output files are also provided, including data on shared SNV fractions, distance matrices, and cumulative curves.

metagenomic_ORF_analyses.tar.gz: Contains the pipeline for analysis of metagenomic ORFs. This includes prediction of ORFs, clustering, annotation and functional characterization. ORF sequences, annotation files, and cluster-files are provided.

Notes

The study was supported by the Human Frontier Science Program HFSP Young Investigator grant RGY0077/2016, the European Research Council ERC-StG 'MicroBeeOme' (714804) and the Swiss National Science Foundation SNFS project grant 31003A_179487.

Files

software_dependencies.txt

Files (8.9 GB)

Name Size Download all
md5:2bb292263fc1576c2778061952b7fb53
2.6 GB Download
md5:66328e6806a7dd0a2b5ebcd057c7cf3c
381.8 MB Download
md5:8236dcd9cd69084d19c9518cb6602f0b
2.2 GB Download
md5:15f0b2eec51973235f9fc6212f8cdf6f
5.0 kB Download
md5:00dad44494ab7e1a011869512e4d4d4a
316.1 MB Download
md5:dc44efdbf96ae6f0d60ce540a277bef2
772.7 MB Download
md5:56561e3902fd2b4944a0f19c45baa03e
2.5 kB Download
md5:c48cd7fbe41ceb9fd6d58bd1fb82e1a8
12.9 MB Download
md5:d7d01c864b1a994e7e3f0d243f9fef0b
2.6 GB Download
md5:20f250714796903257a157973808f656
1.1 kB Preview Download

Additional details

Funding

MicroBeeOme – Evolution of the honey bee gut microbiome through bacterial diversification 714804
European Commission
Molecular crosstalk underlying symbiotic interactions in the honey bee gut microbiota 31003A_179487
Swiss National Science Foundation