Published December 21, 2016 | Version v1
Dataset Open

Raw BRCA1/2 variants in breast cancer patients and healthy relatives produced with GATK.

Creators

  • 1. Institute of Molecular Biology NAS RA

Description

Aligned sequencing data is available in the NCBI Sequence Read Archive (SRA, https://www.ncbi.nlm.nih.gov/sra/) under accession SRP095082. Variants were called using GATK HaplotypeCaller (version 3.6). After joint performing joint genotyping multi-sample vcf file was generated. Next, SNPs and indels were extracted into two different vcf files and specific set of filters were applied for each case.

 

File descriptions

Datasets

BRCA_SNVs.vcf - this file contains SNPs called with GATK and hard filters applied. Following filtering options were applied: "QD < 2.0", "FS > 60.0", "MQ < 40.0",  "MQRankSum < -12.5", "ReadPosRankSum < -8.0", "SB < -0.10" , "DP < 10" , "GQ < 30" , and "SOR > 3.0"

BRCA_indels.vcf - This file contains indels called with GATK and hard filters applied. Following filtering options were applied: "QD < 2.0", "FS > 200.0", "ReadPosRankSum < -20.0", "InbreedingCoeff < -0.8", "SOR > 10.0".

 

Scripts package (scritps.zip)

Scripts.zip file contains scripts and supporting files for genotype calling and filtering. 

raw.variant.caling.sh – bam files preprocessing, alignment refining and raw genotype calling with HaplotypeCaller.

genotyping_and_filtering.sh – joint genotyping, variant hard filtering and callset refinement.

LIST.txt – supporting file that contains bam filenames containing aligned reads.

sample_order.txt – supporting file for sample renaming.

 

Reference files (hg19) used in variant calling scripts

Reference files can be downloaded from GATK bundle web-site at https://software.broadinstitute.org/gatk/download/bundle.  

ucsc.hg19.fasta - human genome assembly;

Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz – set of known indels to be used for local realignment;

1000G_phase1.indels.hg19.sites.vcf.gz – set of known indels to be used for local realignment;

dbsnp_138.hg19.vcf.gz – a recent dbSNP release (build 138); 

1000G_phase3_v4_20130502.hg19.lifted.sites.vcf – the latest set from 1000G phase 3 (v4) for genotype refinement.

 

Files

scritps.zip

Files (255.0 kB)

Name Size Download all
md5:a186bc3435a898cd6e169194aab6d621
58.1 kB Download
md5:fc1c2a2b52578630166b557846c0a8ba
193.5 kB Download
md5:9607c82e3990cf245ea3b75b53587f0f
3.4 kB Preview Download