Contributor: HHU Düsseldorf / Marschall Lab
Contact: Jana Ebler [jana.ebler@hhu.de] / Tobias Marschall [tobias.marschall@hhu.de]

Genotying results produced for the minigraph-cactus paper based on GRCh38 and CHM13. PanGenie version v2.1.0 was used.
Pipelines used: https://bitbucket.org/jana_ebler/hprc-experiments/src/master/ (GRCh38) and https://bitbucket.org/jana_ebler/hprc-experiments/src/chm13-based-pipeline/ (CHM13)


GRCh38-based results:


1.) grch38_all-samples_bi_all.vcf.gz: 

PanGenie genotypes across all 300 pilot samples and the panel samples in bi-allelic representation. Contains
the unfiltered set (= all variants). In order to obtain the final, filtered genotypes, extract all variants with confidence_level >= 1 as defined in file: grch38_bi_all_filters.tsv.gz.
This can be done based on the provided script using the following command:

zcat all-samples_bi_all.vcf.gz | python3 select_ids.py grch38_bi_all_filters.tsv filtered | bgzip -c > grch38_all-samples_bi_filtered.vcf.gz


2.) grch38_bi_all_filters.tsv.gz: 

filters computed across genotypes. The column "confidence_level" defines which variants are in the unfiltered, positive and filtered set of variants.
- unfiltered set (= all variants): confidence_level >= 0
- positive set: confidence_level = 4
- final filtered set: confidence_level >= 1



CHM13-based results:

1.) cactus_filtered_ids_chm13.vcf.gz:

Input VCF used for PanGenie. Filtered and preprocessed version of the Minigraph-Cactus VCF for CHM13.


2.) chm13_all-samples_bi_all.vcf.gz: 

PanGenie genotypes across all 300 pilot samples and the panel samples in bi-allelic representation. Contains
the unfiltered set (= all variants). In order to obtain the final, filtered genotypes, extract all variants with confidence_level >= 1 as defined in file: chm13_bi_all_filters.tsv.gz.
This can be done based on the provided script using the following command:

zcat all-samples_bi_all.vcf.gz | python3 select_ids.py chm13_bi_all_filters.tsv filtered | bgzip -c > chm13_all-samples_bi_filtered.vcf.gz


3.) chm13_bi_all_filters.tsv.gz: 

filters computed across genotypes. The column "confidence_level" defines which variants are in the unfiltered, positive and filtered set of variants.
- unfiltered set (= all variants): confidence_level >= 0
- positive set: confidence_level = 4
- final filtered set: confidence_level >= 1



HGSVC GRCh38-based results:


1.) hgsvc_all-samples_bi_all.vcf.gz: 

PanGenie genotypes across all 300 pilot samples and the panel samples in bi-allelic representation. Contains
the unfiltered set (= all variants). In order to obtain the final, filtered genotypes, extract all variants with confidence_level >= 1 as defined in file: hgsvc_bi_all_filters.tsv.gz.
This can be done based on the provided script using the following command:

zcat all-samples_bi_all.vcf.gz | python3 select_ids.py hgsvc_bi_all_filters.tsv filtered | bgzip -c > hgsvc_all-samples_bi_filtered.vcf.gz


2.) hgsvc_bi_all_filters.tsv.gz: 

filters computed across genotypes. The column "confidence_level" defines which variants are in the unfiltered, positive and filtered set of variants.
- unfiltered set (= all variants): confidence_level >= 0
- positive set: confidence_level = 4
- final filtered set: confidence_level >= 1