The honey bee gut microbiota genomic database
Description
This data repository contains the latest version of the "honey bee gut microbiota genomic database", and two example data-sets, which can be used to run the pipelines:
- Community_profiling (Github)
- Species_validation (Github)
For previous publications using these pipelines, see 10.1038/s41467-019-08303-0, 10.1016/j.cub.2020.04.070
The genomic database contains data from a total of 198 bacterial genomes, as detailed in the database metafile (see file descriptions here below). It has been tested on the Western and Eastern honey bee (Apis mellifera, Apis cerana), for which it has been shown to recruit about 90% of all the reads in most metagenomic samples (excluding host-derived reads). The database also contains genomes derived from other bee species, such as bumble bees, but it has not been tested with metagenomic data for these bee species yet. Most species in the database are represented by multiple genomes, but still with a maximum of 98.5% gANI (genomic average nucleotide identity) between genomes. Thus, several published genomes isolated from social bees are not included.
FILE DESCRIPTIONS
genome_db_metafile_210402.txt
Plain text-file with identifiers for genomes in the database.
- Tab1 contains locus-tags (derived from the gene-ids of the annotation files), which are used as main identifiers for the genomes in all database files.
- Tab2 contains the genome phylotype-affiliation (> 97% 16S rRNA identity).
- Tab3 contains the genome SDP-affiliation ("Sequence-discrete populations"), as determined with genomic and metagenomic data (largely corresponding to the currently named species of the honey bee gut microbiota).
- Tab 4 indicates whether the genome was chosen as reference for plotting core gene-family coverage (Community profiling pipeline).
- Tab 5 contains accession numbers of genomes in public repositories (Genbank Assembly accession/IMG accession).
genome_db_210402.tar.gz
Contains the genomic database, with all files required for running the "Community profiling" pipeline:
- "genome_db_210402": fasta-file with genome sequences of bacteria included in the database. For draft genomes (the majority), the contigs have been concatenated into a single contig per genome, to facilitate downstream processing with bioinformatic pipelines.
- "genome_db_metafile_210402.txt": meta-data for genomes, see detailed description here above
- "faa_files": directory containing the amino-acid sequences of genes for all genomes
- "ffn_files": directory containing the nucleotide sequences of genes for all genomes
- "bed_files": directory containing bed-files, specifying the location of genes on the concatenated contigs.
- "gff_files": directory containing gff-files with annotations of genes for all genomes
- "Orthofinder": directory containing files with filtered single-copy orthologous gene-families (estimated with "Orthofinder"), for quantifying the abundance of community members based on core gene family coverage.
species_validation.tar.gz
Example data-set for running the "Species_validation" pipeline. Contains nucleotide sequences of ORFs (open reading-frames) predicted on two assembled metagenomes derived from the gut microbiota of Apis mellifera (ORFs denoted "AmAi03" in the fasta-headers) and Apis cerana (denoted "AcCh03" in the fasta-headers). The samples were previously analyzed in https://doi.org/10.1016/j.cub.2020.04.070 as part of a much larger data-set.
Additionally, it contains amino-acid and nucleotide sequences of genomes included in the honey bee gut microbiota genomic database, which are required for generating the core gene alignments used in the validation.
metagenomic_reads.fastq.tar.gz
Example data-set for running the "Community profiling" pipeline. Contains metagenomic reads from two samples derived from the gut microbiota of Apis mellifera, previously analyzed in https://doi.org/10.1016/j.cub.2020.04.070 as part of a much larger data-set. To further reduce the file-size, the data were subset for reads mapping to the phylotype Lactobacillus Firm5. Complete data is publicly available on NCBI: PRJNA598094
Files
genome_db_metafile_210402.txt
Additional details
Related works
- Cites
- Journal article: 10.1038/s41467-019-08303-0 (DOI)
- Journal article: 10.1016/j.cub.2020.04.070 (DOI)