###################################################### # README for Local Ancestry Data from 563 Cabo Verdean individuals # # The following steps were performed in the Goldberg Lab at Duke University by Katharine Korunes (contact: kkorunes@gmail.com) # Local ancestry calling was perfomed March 2020. README last edited 01 September 2020. # ###################################################### This directory contains local ancestry calls based on SNP array data described in Beleza et al (2013), prepared as described below. # DATA AND PRE-PROCESSING STEPS: The starting dataset included 564 admixed individuals from Cabo Verde. We filtered individuals with >5% missing calls overall or >10% missing calls on any single chromosome. This resulted in removal of 1 individual with high missingness on chromosome 14 (11.36%). For the remaining 563 individuals, we merged the data with genotypes from 107 IBS (Iberian Population in Spain) and 107 GWD (Gambian in Western Division - Mandinka) samples from high-coverage resequencing data released through the International Genome Sample Resource (see Clarke et al 2017 and Fairley et al 2020). We selected biallelic SNPs occurring in both the Cabo Verde samples and the reference samples. The merged dataset contained 884,656 autosomal SNPs and and 20,967 X chromosome SNPs shared between the Cabo Verde samples and the reference samples, with average missingness by SNP of 0.0017 for autosomes, and 0.0024 for the X chromosome. We then performed phasing with SHAPEIT2 using the Phase 3, NCBI build 37 (hg19) reference panel of haplotypes and associated genetic map in IMPUTE2 format (Delaneau et al 2013). We first ran SHAPEIT -check to exclude sites not contained within the reference map, followed by SHAPEIT phasing to yield phased genotypes at 881,279 autosomal SNPs and 20,793 X chromosome SNPs. These phased samples were provided to RFMix, as described next. # LOCAL ANCESTRY INFERENCE We ran RFMix v1.5.4 (Maples et al 2013) on the phased samples using a two-way admixture model. We used the RFMix PopPhased program with default window size, the --use-reference-panels-in-EM option, -e = 2 (2 EM iterations), and --forward-backward. # FILES: - Files with the *.Viterbi.txt suffix are the RFMix output. See the RFMix v1.5.4 Manual for details about RFMix and its file formats. Briefly, the Viterbi files contain one row per SNP and one column per haplotype. Thus, there are 2 columns per sample, in the order described below. Note that since RFmix was run with --use-reference-panels-in-EM, the reference population haplotypes are included in the Viterbi files. - The CaboVerde_LocalAncestryRelease_RFMixOrder_BlindIDs.txt file matches the same sample order as the RFMix output. This file matches the haplotypes to their corresponding island within Cabo Verde. - Files with the *.map suffix contain details about the SNPs, in order along each chromosome. The 3 columns are: physical position, map position (in cM), and rsID. # CITATIONS: - S. Beleza, et al., Genetic Architecture of Skin and Eye Color in an African-European Admixed Population. PLOS Genetics 9, e1003372 (2013). - L. Clarke, et al., The international Genome sample resource (IGSR): A worldwide collection of genome variation incorporating the 1000 Genomes Project data. Nucleic Acids Res 45,D854–D859 (2017). - S. Fairley, E. Lowy-Gallego, E. Perry, P. Flicek, The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Research 48, D941–D947 (2020). - O. Delaneau, J.-F. Zagury, J. Marchini, Improved whole-chromosome phasing for disease and population genetic studies. Nature Methods 10, 5–6 (2013). - B. K. Maples, S. Gravel, E. E. Kenny, C. D. Bustamante, RFMix: A Discriminative Modeling Approach for Rapid and Robust Local-Ancestry Inference. The American Journal of Human Genetics 93, 278–288 (2013).