Pangenome-based Genome Inference
Creators
- 1. Institute for Medical Biometry and Bioinformatics, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- 2. New York Genome Center, New York, New York, USA
- 3. European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany; European Molecular Biology Laboratory (EMBL), GeneCore, Heidelberg, Germany
- 4. Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington, USA
- 5. Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- 6. European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
- 7. Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington, USA; Howard Hughes Medical Institute, University of Washington, Seattle, Washington, USA
- 8. Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf
Description
Haplotype-resolved assemblies ("haplotype-resolved-assemblies.tar.gz"), the variant callset and pangenome graph ("callset-and.graph.tar.gz") produced from these assemblies, callsets and graphs used for the "leave-one-out" evaluation ("leave-one-out.tar.gz"), and PanGenie genotypes ("cohort-genotypes.tar.gz") for 300 samples (consisting of 100 trios) selected from the 1000 Genome samples.
Abstract:
Typical analysis workflows map reads to a reference genome in order to genotype genetic variants. Generating such alignments introduces reference biases and comes with substantial computational burden.
In contrast, recent k-mer based genotypers are fast, but struggle in repetitive or duplicated genomic regions.
We propose a novel algorithm, PanGenie, that leverages a haplotype-resolved pangenome reference in conjunction with k-mer counts from short-read sequencing data to genotype a wide spectrum of genetic variation - a process we refer to as genome inference.
Compared to mapping-based approaches, PanGenie is more than 4x faster at 30x coverage and reaches significantly better genotype concordances for almost all variant types and coverages tested. Improvements are especially pronounced for large insertions (>=50bp) and variants in repetitive regions, enabling the inclusion of these classes of variants in genome-wide association studies. PanGenie efficiently leverages the increasing amount of haplotype-resolved assemblies to unravel the functional impact of previously inaccessible variants while being scalable to thousands of genotyped samples.
Files
Files
(33.1 GB)
Name | Size | Download all |
---|---|---|
md5:2fa13171466fdc30deab4b87891b8475
|
881.4 MB | Download |
md5:1f73e91b8d392abae2a06ecc82f1aa82
|
6.5 GB | Download |
md5:e9afe54ad29bedf5659a2d587304e7b4
|
24.8 GB | Download |
md5:f77bfc530febe994b39a2d5491850c15
|
965.6 MB | Download |