Published November 9, 2023 | Version 1.0
Dataset Open

Simulated Ancient Genomic Kinship Dataset: BAM (5x run7-12) Files for Related (including inbred) Pairs

Description

Simulated Ancient Genomic Kinship Dataset: VCF and BAM (5x (run7-12) Files for Related (including inbred) Pairs

Description:

This dataset comprises simulated pedigrees (VCF files containing 8,677,101 autosomal biallelic and 298,625 X chromosomal SNP positions) generated using Ped-sim (v1.3) and comprising pairs of diverse familial relationship types up to third-degree. The first-degree relationships are parent-offspring and siblings; the second-degree relationships are half-siblings, grandparent-grandchild, and avuncular pairs; and third-degree relationships are first cousins, great-grandparent-great-grandchild, and grand avuncular pairs. For each of these 8 relationship types, our dataset includes 48 pairs of individuals. It also contains unrelated pairs. Additionally, the dataset includes first- and second-degree relatives, with inbreeding (parent-offspring pairs where the parents of the offspring are the first cousins and grandparent-grandchild pairs where the grandchild is the offspring of first cousins). Our simulations encompass all combinations of kinship types regarding sex. The dataset was further enriched by simulating ancient DNA-like sequencing data (5x and 1x BAM files) of Ped-sim simulated individuals using the gargammel tool, employing procedures akin to standard paleogenomic sequencing libraries. Note that the BAM files contain only randomly chosen 200K autosomal SNP positions. Positions can be found in the "200K_positions" file. Details can be found in Aktürk, Mapelli and Güler et al. 2023.

Data Sources and Generation:

Founder genotypes for pedigree simulation were created from the Tuscany (TSI) population SNPs within the 1000 Genomes Dataset v3. Notably, the founder genotypes lack background relatedness or runs of homozygosity (ROH).

Description of File Naming Conventions:

The naming conventions of the BAM files in this dataset are designed to convey key information regarding the specifics of each file.

cov1x or cov5x: This segment denotes the coverage level of the BAM files, indicating whether the sequencing coverage for the individuals in the files is 1x or 5x.

run_*: Signifies the particular batch from which the pedigree and individuals are derived. This name segment also applies to VCF files.

parent-offspring_* or similar identifiers: Reflects the origin of the individual from the corresponding VCF file. For instance, "parent-offspring_1" corresponds to the individuals present in the "run_*_parent-offspring_1.vcf" file.

parent-offspring* or similar identifiers: Indicates the origin of the individual from the sets within the VCF files. For example, "parent-offspring1" signifies the first set of parent-offspring pedigrees within the VCF file. Note that parent-offspring, grandparent-grandchild, and great-grandparent-great-grandchild and the inbreeding VCFs contain only one set, so this identifier is always 1. This convention can be 1 or 2 for the rest of the pedigrees, as the VCF files contain two sets of related pairs.

_g*-b*-: Provides information about the individual's generational level within the VCF. This follows the Ped-sim syntax. For example, for parent-offspring type, "_g1-b1-" indicates the first parent (generation 1) within a specific pedigree, and "_g1-b2-" indicates the second parent (generation 1) while "_g2-b1-" represents the offspring (generation 2).

Example Naming Structure:

For instance, the file "cov1x_run1_parent-offspring_1_parent-offspring1_g1-b1-i1.all.hs37d5.cons.90perc.trimBAM.bam" signifies a BAM file with 1x coverage, originating from "run1," containing individuals from the "run_*_parent-offspring_1.vcf" file (first set of parent-offspring pairs) where "_g1-b1-" designates the first parent in the first generation. The latter half of the name "hs37d5.cons.90perc.trimBAM.bam" is the same across all files.  

Note1: Segments such as parent-offspring*_g*-b*- can also be tracked in the naming of the genotype columns in the VCF.

Note2: Sexual information within the VCF files is discernible from the genetic data present at X chromosome positions. Individuals carrying two genotypes on the X chromosome are female, while those with a single genotype are male.

Note3: Some of the individuals from distinct pedigrees may, in fact, be related due to shared ancestry through common founders. To suit specific research objectives, researchers may need to identify and exclude such relatives if the full dataset is used for kinship estimation.

For more details about the dataset's generation process, unique characteristics, or any specific inquiries, our team is available for further information. We welcome and encourage inquiries, aiming to provide comprehensive support and additional details that might aid researchers in utilizing this dataset effectively. Please don't hesitate to contact us for any specific information you may need.

This repository contains only cov5x BAM files (run7-12). The rest of the files can be found at 10.5281/zenodo.10079685 and 10.5281/zenodo.10070958.

 

Files

cov5x_bams_run7-12.zip

Files (30.0 GB)

Name Size Download all
md5:5e31c5cb4c6e3c8be93e1e640e32aa12
30.0 GB Preview Download

Additional details

References

  • Caballero, M., Seidman, D. N., Qiao, Y., Sannerud, J., Dyer, T. D., Lehman, D. M., Curran, J. E., Duggirala, R., Blangero, J., & Carmi, S. (2019). Crossover interference and sex-specific genetic maps shape identical by descent sharing in close relatives. PLoS Genetics, 15(12), e1007979–e1007979.
  • Renaud, G., Hanghøj, K., Willerslev, E., & Orlando, L. (2017). gargammel: a sequence simulator for ancient DNA. Bioinformatics (Oxford, England), 33(4), 577–579. https://doi.org/10.1093/BIOINFORMATICS/BTW670
  • (2015). A global reference for human genetic variation. Nature, 526(7571), 68-74. https://doi.org/10.1038/nature15393
  • Bherer, C., Campbell, C. L., & Auton, A. (2017). Refined genetic maps reveal sexual dimorphism in human meiotic recombination at multiple scales. Nature Communications 2017 8:1, 8(1), 1–9. https://doi.org/10.1038/ncomms14994
  • Housworth, E. A., & Stahl, F. W. (2003). Crossover interference in humans. American Journal of Human Genetics, 73(1), 188–197. https://doi.org/10.1086/376610
  • Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., & Durbin, R. (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156–2158.
  • Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841–842. https://doi.org/10.1093/BIOINFORMATICS/BTQ033
  • Briggs, A. W., Stenzel, U., Johnson, P. L. F., Green, R. E., Kelso, J., Prüfer, K., Meyer, M., Krause, J., Ronan, M. T., Lachmann, M., & Pääbo, S. (2007). Patterns of damage in genomic DNA sequences from a Neandertal. Proceedings of the National Academy of Sciences of the United States of America, 104(37), 14616–14621. https://doi.org/10.1073/PNAS.0704665104
  • Schubert, M., Lindgreen, S., & Orlando, L. (2016). AdapterRemoval v2: Rapid adapter trimming, identification, and read merging. BMC Research Notes, 9(1). https://doi.org/10.1186/s13104-016-1900-2
  • Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), 1754–1760. https://doi.org/10.1093/BIOINFORMATICS/BTP324
  • Jun, G., Wing, M. K., Abecasis, G. R., & Kang, H. M. (2015). An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data. In Genome Research (Vol. 25, Issue 6, pp. 918–925). Cold Spring Harbor Laboratory. https://doi.org/10.1101/gr.176552.114