Published May 20, 2019 | Version 1.0.0
Dataset Open

Inferring whole-genome histories in large population datasets: inferred tree sequences for Simons Genome Diversity Project

  • 1. Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford

Description

Tree sequences inferred for the SGDP autosomes using tsinfer version 0.1.4 and compressed using tszip. Tree sequences can  be decompressed as follows:

$ tsunzip sgdp_chr1.trees.tsz

Once decompressed, trees files can be loaded and processed using tskit

import tskit
ts = tskit.load("sgdp_chr1.trees")
# ts is an instance of tskit.TreeSequence 
print("Chromosome 1 contains {} trees".format(ts.num_trees))

Metadata associated with individuals and populations was derived from the original source and converted to JSON form. For example, to access individual metadata we can use:

import tskit
import json
ts = tskit.load("sgdp_chr1.trees")
ind = ts.individual(0)
metadata_dict = json.loads(ind.metadata)

The metadata_dict variable will now contain all the metadata for the individual with ID 0 as a dictionary. Metadata associated with populations can be found in a similar way. Population IDs are associated with individuals via their constituent nodes. For example,

pop_metadata = [json.loads(pop.metadata) for pop in ts.populations()]
ind_node = ts.node(ind.nodes[0])
ind_pop_metadata = pop_metadata[ind_node.population]

After this, the ind_pop_metadata variable will contain the population level metadata for individual ID 0.

The full data pipeline used to generate these tree sequences and associated metadata is available on GitHub.

Notes

AWW and CF would like to thank the Rhodes Trust for their generous support.

Files

Files (472.1 MB)

Name Size Download all
md5:f8d1a636b8fca006146ec298c6fc8535
36.6 MB Download
md5:1be38f5f2f04c8192a0342e3d8dc0b8e
23.4 MB Download
md5:8bb9781ecb36e9ec3189b04fb043a99b
22.6 MB Download
md5:a9813059fedae11bff601eb662c1662e
22.1 MB Download
md5:0906f923207bd73e79e08c5a5390f43f
17.2 MB Download
md5:083088cb1edb3042700af2c4326d5953
15.3 MB Download
md5:c772c731253bf306f3029426ae3d9ad8
14.8 MB Download
md5:bf3c9bb93fd2710de4c876d73be56282
15.9 MB Download
md5:2f01e6fd030f5014db9efb176e8e0709
13.0 MB Download
md5:10768c90dabbf4224011ab5ab7e52720
14.6 MB Download
md5:d7d871320eb43515ba7860d52c1525ab
7.5 MB Download
md5:369a4a2686d8d63fc09c5d5e9efc5c61
41.0 MB Download
md5:17152a89f855f30a3c149305b7ff7168
11.3 MB Download
md5:33e75ebc359f91f5011ebc49802655c9
7.1 MB Download
md5:e4dc74481fe4681846488c8faacb1d23
6.2 MB Download
md5:58bf9282e13c899045b6c47dd98c0701
35.4 MB Download
md5:25aec6168a9818bd48821cef34881938
33.0 MB Download
md5:fa2d5baf6d770666a5c9cca6054ce6ce
30.8 MB Download
md5:b6378eeffd02d2b784ffed6a90dc3ed0
29.2 MB Download
md5:56d723cb1e92170ca15e80dd912392a2
26.5 MB Download
md5:84b88219eda5a244eaca0e23a7f175e2
27.0 MB Download
md5:063b07e7d0c42659d231f5c12bcdb7b0
21.6 MB Download

Additional details

Funding

Wellcome Trust
The Genetic Analysis of Populations. 100956

References