Published September 8, 2021 | Version 1.0.0
Dataset Open

A unified genealogy of modern and ancient genomes: Unified, inferred tree sequences of 1000 Genomes, Human Genome Diversity, and Simons Genome Diversity Projects

  • 1. Broad Institute of MIT and Harvard, Cambridge, MA, USA
  • 2. Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
  • 3. Harvard Medical School Department of Genetics, Boston, MA, USA
  • 4. Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria

Description

Unified, inferred tree sequences built from the 1000 Genomes phase 3, Human Genome Diversity, and Simons Genome Diversity Projects. Each tree sequence is the arm of an autosome (the short arm of acrocentric chromosomes are not included). Tree sequences were inferred using tsinfer version 0.2.1, dated using tsdate version 0.1.4 and compressed using tszip. All data is in GRCh38.

The full data pipeline used to generate these tree sequences and associated metadata is available on GitHub. A description can be found in the Supplementary Material of Wohns et al. (2021).

Tree sequences can  be decompressed as follows:

$ tsunzip hgdp_tgp_sgdp_chr1_p.dated.trees.tsz

Once decompressed, trees files can be loaded and processed in Python using tskit

import tskit
ts = tskit.load("hgdp_tgp_sgdp_chr1_p.dated.trees")
# ts is an instance of tskit.TreeSequence 
print("The short arm of chromosome 1 contains {} trees".format(ts.num_trees))

Metadata associated with nodes contain the mean and variance of tsdate's posterior distribution on node time. To access these values, we can use:

import json
node = ts.node(10000)
metadata_dict = json.loads(node.metadata)
print("The mean of the posterior distribution on the age of node 10000 is {} generations".format(metadata_dict["mn"]))
print("The variance of the posterior distribution on the age of node 10000 is {} generations".format(metadata_dict["vr"]))

Age estimates for each variant site can be derived from the mean of the age estimates of the upper and lower bounding nodes of the oldest mutation associated with a site. tsdate includes a function to find the age estimates of all sites in the tree sequence:

import tsdate
site_times = tsdate.sites_time_from_ts(ts, node_selection='arithmetic')

This returns a numpy array which has a length equal to the number of sites.

Accessing variant sites in the tree sequence provides the position and id of variants:

site = ts.site(1000)
site_metadata = json.loads(site.metadata)
print("The position of site 1000 is {} and its ID is {}.".format(site.position, site_metadata["ID"]))

Metadata associated with individuals and populations was derived from the original sources (TGP, HGDP, and SGDP) and converted to JSON form. For example, to access individual metadata we can use:

ind = ts.individual(0)
metadata_dict = json.loads(ind.metadata)

The metadata_dict variable will now contain all the metadata for the individual with ID 0 as a dictionary. Metadata associated with populations can be found in a similar way. Population IDs are associated with individuals via their constituent nodes. For example,

pop_metadata = [json.loads(pop.metadata) for pop in ts.populations()]
ind_node = ts.node(ind.nodes[0])
ind_pop_metadata = pop_metadata[ind_node.population]

After this, the ind_pop_metadata variable will contain the population level metadata for individual ID 0.

Files

Files (3.5 GB)

Name Size Download all
md5:b3df74072a42a87856f5a8e5042a1b84
55.9 MB Download
md5:ced040878fd4186d485e5b54b3cd7838
115.8 MB Download
md5:955dcb7a4ce6a6fa87cbf8cb32409378
63.7 MB Download
md5:738018952cd2bc20adc59503256b3e87
100.2 MB Download
md5:b3c6965fbeee8d03d2a37d35c5ddd696
46.3 MB Download
md5:d2550904e2dff8361f837c137a446e6f
119.0 MB Download
md5:57955f6bdc38818fdad086e987ed4d2a
124.3 MB Download
md5:48a8e488cc97c3402cb32ffd21f28d36
114.4 MB Download
md5:ec313f1181c1c804f40ebe13aa269cfc
110.0 MB Download
md5:4a70525d9cd4a51c499ce758e2bebf5b
49.7 MB Download
md5:e0cb87835020b4fc299f4479d699ccaf
64.5 MB Download
md5:37f83d76d69bfee68df4b75bb84abf95
32.1 MB Download
md5:bbeeb5adf7960039aa9407d995361315
71.9 MB Download
md5:88f2b5aeb8a18a70daae604be6c24cc3
23.3 MB Download
md5:d3e3b2f071df7755cec9bf843d93ffd1
78.3 MB Download
md5:77bd1f82d4c7663d06b67d6da07d2685
32.7 MB Download
md5:7ec1cee67d663cedef01c3926147bdce
42.9 MB Download
md5:c1ba6c6bc79f255dd7b9287bf30ccda8
149.9 MB Download
md5:046762467192138819f697057b82de86
126.1 MB Download
md5:6072f1a49d6f583a2d3cbf3c3dce2c3b
36.6 MB Download
md5:4af161fbcc77bb7541c08175ffb6129f
47.5 MB Download
md5:b1871ddd6929207dee57b1830fdc89a2
48.8 MB Download
md5:d2cc814908ae7c4581b4c99b18506889
49.2 MB Download
md5:a9ea82b6ef6bc98e69473aacaf98a8cd
125.7 MB Download
md5:dddaf015108cffc81e2bf2b3a932d06e
175.6 MB Download
md5:9c08067ba83a31fcc4962603f6cf9c66
117.3 MB Download
md5:21ae421f9b54b308346635441c00e9af
120.6 MB Download
md5:7e650222389b59cf7252602fdb2e11d1
67.8 MB Download
md5:4badf0d134f0b2ecd2d96a31e6c2ffd4
170.1 MB Download
md5:10446bb1c06e062b115263475d3550a6
60.2 MB Download
md5:5171cf1b52e21d64f7f03a4551bf4f81
155.8 MB Download
md5:e59db9eb5dc8a45defef64b2989a9947
75.2 MB Download
md5:2cca92fe097be304f0c4e406f4afd69d
132.7 MB Download
md5:9fc69715d2e06036bb5af8b0b05e01ca
80.9 MB Download
md5:184fc5f53c2ee0d65115c16dfabb02a0
115.7 MB Download
md5:d89551ec5b1bf785c6eaddbadcf087c5
72.9 MB Download
md5:b58486a0ccd96681cd2739c5ab47448a
119.4 MB Download
md5:1b696dbe706a3919c2d0688f56656edd
67.5 MB Download
md5:337c0f2b0d8eda2a02d3ee502056f8a0
94.1 MB Download

Additional details

References