There is a newer version of the record available.

Published July 23, 2021 | Version v2
Dataset Open

Pangenome-based Genome Inference

  • 1. Institute for Medical Biometry and Bioinformatics, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
  • 2. New York Genome Center, New York, New York, USA
  • 3. European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany; European Molecular Biology Laboratory (EMBL), GeneCore, Heidelberg, Germany
  • 4. Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington, USA
  • 5. Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
  • 6. European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
  • 7. Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington, USA; Howard Hughes Medical Institute, University of Washington, Seattle, Washington, USA
  • 8. Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf

Description

Haplotype-resolved assemblies ("haplotype-resolved-assemblies.tar.gz"), the variant callset and pangenome graph ("callset-and.graph.tar.gz") produced from these assemblies, callsets and graphs used for the "leave-one-out" evaluation ("leave-one-out.tar.gz"), and PanGenie genotypes ("cohort-genotypes.tar.gz") for 300 samples (consisting of 100 trios) selected from the 1000 Genome samples.

Abstract:

Typical analysis workflows map reads to a reference genome in order to genotype genetic variants. Generating such alignments introduces reference biases and comes with substantial computational burden.
In contrast, recent k-mer based genotypers are fast, but struggle in repetitive or duplicated genomic regions.
We propose a novel algorithm, PanGenie, that leverages a haplotype-resolved pangenome reference in conjunction with k-mer counts from short-read sequencing data to genotype a wide spectrum of genetic variation - a process we refer to as genome inference.
Compared to mapping-based approaches, PanGenie is more than 4x faster at 30x coverage and reaches significantly better genotype concordances for almost all variant types and coverages tested. Improvements are especially pronounced for large insertions (>=50bp) and variants in repetitive regions, enabling the inclusion of these classes of variants in genome-wide association studies. PanGenie efficiently leverages the increasing amount of haplotype-resolved assemblies to unravel the functional impact of previously inaccessible variants while being scalable to thousands of genotyped samples.

Files

Files (33.1 GB)

Name Size Download all
md5:2fa13171466fdc30deab4b87891b8475
881.4 MB Download
md5:1f73e91b8d392abae2a06ecc82f1aa82
6.5 GB Download
md5:e9afe54ad29bedf5659a2d587304e7b4
24.8 GB Download
md5:f77bfc530febe994b39a2d5491850c15
965.6 MB Download