A unified genealogy of modern and ancient genomes: Unified, inferred tree sequences of 1000 Genomes, Human Genome Diversity, and Simons Genome Diversity Projects
Authors/Creators
- 1. Broad Institute of MIT and Harvard, Cambridge, MA, USA
- 2. Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
- 3. Harvard Medical School Department of Genetics, Boston, MA, USA
- 4. Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria
Description
Unified, inferred tree sequences built from the 1000 Genomes phase 3, Human Genome Diversity, and Simons Genome Diversity Projects. Each tree sequence is the arm of an autosome (the short arm of acrocentric chromosomes are not included). Tree sequences were inferred using tsinfer version 0.2.1, dated using tsdate version 0.1.4 and compressed using tszip. All data is in GRCh38.
The full data pipeline used to generate these tree sequences and associated metadata is available on GitHub. A description can be found in the Supplementary Material of Wohns et al. (2021).
Tree sequences can be decompressed as follows:
$ tsunzip hgdp_tgp_sgdp_chr1_p.dated.trees.tsz
Once decompressed, trees files can be loaded and processed in Python using tskit.
import tskit
ts = tskit.load("hgdp_tgp_sgdp_chr1_p.dated.trees")
# ts is an instance of tskit.TreeSequence
print("The short arm of chromosome 1 contains {} trees".format(ts.num_trees))
Metadata associated with nodes contain the mean and variance of tsdate's posterior distribution on node time. To access these values, we can use:
import json
node = ts.node(10000)
metadata_dict = json.loads(node.metadata)
print("The mean of the posterior distribution on the age of node 10000 is {} generations".format(metadata_dict["mn"]))
print("The variance of the posterior distribution on the age of node 10000 is {} generations".format(metadata_dict["vr"]))
Age estimates for each variant site can be derived from the mean of the age estimates of the upper and lower bounding nodes of the oldest mutation associated with a site. tsdate includes a function to find the age estimates of all sites in the tree sequence:
import tsdate
site_times = tsdate.sites_time_from_ts(ts, node_selection='arithmetic')
This returns a numpy array which has a length equal to the number of sites.
Accessing variant sites in the tree sequence provides the position and id of variants:
site = ts.site(1000)
site_metadata = json.loads(site.metadata)
print("The position of site 1000 is {} and its ID is {}.".format(site.position, site_metadata["ID"]))
Metadata associated with individuals and populations was derived from the original sources (TGP, HGDP, and SGDP) and converted to JSON form. For example, to access individual metadata we can use:
ind = ts.individual(0)
metadata_dict = json.loads(ind.metadata)
The metadata_dict variable will now contain all the metadata for the individual with ID 0 as a dictionary. Metadata associated with populations can be found in a similar way. Population IDs are associated with individuals via their constituent nodes. For example,
pop_metadata = [json.loads(pop.metadata) for pop in ts.populations()]
ind_node = ts.node(ind.nodes[0])
ind_pop_metadata = pop_metadata[ind_node.population]
After this, the ind_pop_metadata variable will contain the population level metadata for individual ID 0.
Files
Files
(3.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:b3df74072a42a87856f5a8e5042a1b84
|
55.9 MB | Download |
|
md5:ced040878fd4186d485e5b54b3cd7838
|
115.8 MB | Download |
|
md5:955dcb7a4ce6a6fa87cbf8cb32409378
|
63.7 MB | Download |
|
md5:738018952cd2bc20adc59503256b3e87
|
100.2 MB | Download |
|
md5:b3c6965fbeee8d03d2a37d35c5ddd696
|
46.3 MB | Download |
|
md5:d2550904e2dff8361f837c137a446e6f
|
119.0 MB | Download |
|
md5:57955f6bdc38818fdad086e987ed4d2a
|
124.3 MB | Download |
|
md5:48a8e488cc97c3402cb32ffd21f28d36
|
114.4 MB | Download |
|
md5:ec313f1181c1c804f40ebe13aa269cfc
|
110.0 MB | Download |
|
md5:4a70525d9cd4a51c499ce758e2bebf5b
|
49.7 MB | Download |
|
md5:e0cb87835020b4fc299f4479d699ccaf
|
64.5 MB | Download |
|
md5:37f83d76d69bfee68df4b75bb84abf95
|
32.1 MB | Download |
|
md5:bbeeb5adf7960039aa9407d995361315
|
71.9 MB | Download |
|
md5:88f2b5aeb8a18a70daae604be6c24cc3
|
23.3 MB | Download |
|
md5:d3e3b2f071df7755cec9bf843d93ffd1
|
78.3 MB | Download |
|
md5:77bd1f82d4c7663d06b67d6da07d2685
|
32.7 MB | Download |
|
md5:7ec1cee67d663cedef01c3926147bdce
|
42.9 MB | Download |
|
md5:c1ba6c6bc79f255dd7b9287bf30ccda8
|
149.9 MB | Download |
|
md5:046762467192138819f697057b82de86
|
126.1 MB | Download |
|
md5:6072f1a49d6f583a2d3cbf3c3dce2c3b
|
36.6 MB | Download |
|
md5:4af161fbcc77bb7541c08175ffb6129f
|
47.5 MB | Download |
|
md5:b1871ddd6929207dee57b1830fdc89a2
|
48.8 MB | Download |
|
md5:d2cc814908ae7c4581b4c99b18506889
|
49.2 MB | Download |
|
md5:a9ea82b6ef6bc98e69473aacaf98a8cd
|
125.7 MB | Download |
|
md5:dddaf015108cffc81e2bf2b3a932d06e
|
175.6 MB | Download |
|
md5:9c08067ba83a31fcc4962603f6cf9c66
|
117.3 MB | Download |
|
md5:21ae421f9b54b308346635441c00e9af
|
120.6 MB | Download |
|
md5:7e650222389b59cf7252602fdb2e11d1
|
67.8 MB | Download |
|
md5:4badf0d134f0b2ecd2d96a31e6c2ffd4
|
170.1 MB | Download |
|
md5:10446bb1c06e062b115263475d3550a6
|
60.2 MB | Download |
|
md5:5171cf1b52e21d64f7f03a4551bf4f81
|
155.8 MB | Download |
|
md5:e59db9eb5dc8a45defef64b2989a9947
|
75.2 MB | Download |
|
md5:2cca92fe097be304f0c4e406f4afd69d
|
132.7 MB | Download |
|
md5:9fc69715d2e06036bb5af8b0b05e01ca
|
80.9 MB | Download |
|
md5:184fc5f53c2ee0d65115c16dfabb02a0
|
115.7 MB | Download |
|
md5:d89551ec5b1bf785c6eaddbadcf087c5
|
72.9 MB | Download |
|
md5:b58486a0ccd96681cd2739c5ab47448a
|
119.4 MB | Download |
|
md5:1b696dbe706a3919c2d0688f56656edd
|
67.5 MB | Download |
|
md5:337c0f2b0d8eda2a02d3ee502056f8a0
|
94.1 MB | Download |
Additional details
References
- Wohns et al 2021, A unified genealogy of modern and ancient genomes. doi: https://doi.org/10.1101/2021.02.16.431497