Pangenome-based Genome Inference

Ebler, Jana; Ebert, Peter; Clarke, Wayne E.; Rausch, Tobias; Audano, Peter A.; Houwaart, Torsten; Mao, Yafei; Korbel, Jan O.; Eichler, Evan E.; Zody, Michael C.; Dilthey, Alexander T.; Marschall, Tobias

doi:10.5281/zenodo.5119259

Published July 23, 2021 | Version v2

Dataset Open

Pangenome-based Genome Inference

1. Institute for Medical Biometry and Bioinformatics, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
2. New York Genome Center, New York, New York, USA
3. European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany; European Molecular Biology Laboratory (EMBL), GeneCore, Heidelberg, Germany
4. Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington, USA
5. Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
6. European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
7. Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington, USA; Howard Hughes Medical Institute, University of Washington, Seattle, Washington, USA
8. Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf

Haplotype-resolved assemblies ("haplotype-resolved-assemblies.tar.gz"), the variant callset and pangenome graph ("callset-and.graph.tar.gz") produced from these assemblies, callsets and graphs used for the "leave-one-out" evaluation ("leave-one-out.tar.gz"), and PanGenie genotypes ("cohort-genotypes.tar.gz") for 300 samples (consisting of 100 trios) selected from the 1000 Genome samples.

Abstract:

Typical analysis workflows map reads to a reference genome in order to genotype genetic variants. Generating such alignments introduces reference biases and comes with substantial computational burden.
In contrast, recent k-mer based genotypers are fast, but struggle in repetitive or duplicated genomic regions.
We propose a novel algorithm, PanGenie, that leverages a haplotype-resolved pangenome reference in conjunction with k-mer counts from short-read sequencing data to genotype a wide spectrum of genetic variation - a process we refer to as genome inference.
Compared to mapping-based approaches, PanGenie is more than 4x faster at 30x coverage and reaches significantly better genotype concordances for almost all variant types and coverages tested. Improvements are especially pronounced for large insertions (>=50bp) and variants in repetitive regions, enabling the inclusion of these classes of variants in genome-wide association studies. PanGenie efficiently leverages the increasing amount of haplotype-resolved assemblies to unravel the functional impact of previously inaccessible variants while being scalable to thousands of genotyped samples.

Files

Files (33.1 GB)

Name	Size	Download all
callset-and-graph.tar.gz md5:2fa13171466fdc30deab4b87891b8475	881.4 MB	Download
cohort-genotyping.tar.gz md5:1f73e91b8d392abae2a06ecc82f1aa82	6.5 GB	Download
haplotype-resolved-assemblies.tar.gz md5:e9afe54ad29bedf5659a2d587304e7b4	24.8 GB	Download
leave-one-out.tar.gz md5:f77bfc530febe994b39a2d5491850c15	965.6 MB	Download

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	882	221
Downloads	276	26
Data volume	3.6 TB	217.3 GB

Pangenome-based Genome Inference

Creators

Description

Files

Files (33.1 GB)