Zenodo

The large volume of publicly available honey bee (Apis mellifera) drone sequence data provides a unique opportunity to develop a HapMap for the species. The aim of this community is to develop and grow such a resource.

We have processed Illumina whole-genome sequence data for 1407 drones, which includes a mix of related and unrelated samples. These samples represent 19 countries and include various hybrid strains in addition to 8 subspecies: A. m. capensis, A. m. scutellata and A. m. unicolor from the African (A) lineage; A. carnica and A. ligustica from the central and southern European (C) lineage; A. m. iberiensies and A. m. mellifera from the northern and western European (M) lineage; and A. m. caucasia from the Eastern European lineage (O).

This resource includes project-specific datasets containing sample gVCF files, project-specific unfiltered VCF files based on joint-calling across all AmelHap samples, and quality-filtered VCF file.

Downloading data from a Zenodo DOI

We recommend to use a batch file download tool (e.g. download_zenodo) for bulk download of files from a single DOI as some datasets comprise hundreds of samples. Note, some Project Accession numbers are split over multiple Zenodo datasets due to archiving restrictions. In these instances the dataset will include the partial tag, and the dataset title will indicate for instance 'part 1'.

Generating a gVCF file

We hope to integrate community-generated gVCF files uploaded to Zenodo into future releases of AmelHap. Before processing any publicly available sequence data, please check that the Run Accession has not already been processed by referring to the metadata.

To generate a drone gVCF files we recommend to use Nextflow (version 3). Sequence data can be downloaded from the SRA/ENA with nf-core/fetchings. We recommend to provide a list of Run Accession numbers as the input. Once the data has downloaded, the samplesheet can be parsed to generate an input sheet for nf-core/sarek, e.g.:

awk '{FS=","}{print $1,"NA",0,$5,1,$2,$3};' ./results/samplesheet/samplesheet.csv \ | sed 's/"//g' | sed 's/ /,/g' | sed '1d' | \ sed '1ipatient,sex,status,sample,lane,fastq_1,fastq_2' > ./samples.csv

Sequence data can be processed and aligned with nf-core/sarek (we recommend using version -r 3.1.2) and variants called with HaplotypeCaller. To apply haploid calling you will need to pass additional arguments to HaplotypeCaller in a config file, e.g.:

process { withName: GATK4_HAPLOTYPECALLER { ext.args = '--sample-ploidy 1 --emit-ref-confidence GVCF -A AlleleFraction' } }