Published October 2, 2024 | Version v1
Dataset Open

Twigstats scripts and example dataset

Description

This repository provides all scripts to run Relate and Twigstats on imputed ancient genomes. We also provide a complete self contained example dataset, but you should be able to use the exact same scripts on your own datasets as well.

Installation

Download

To run this on your own dataset please download scripts.tgz and Relate_input_files.tgz.

To run the provided example, please additionally download example_data_chr1.tgz or example_data.tgz.

All output files that are generated by run_wg.sh are stored under results/.

Running the scripts

Please extract tar balls, e.g. using tar -xzvf scripts.tgz.

The script run.sh shows how to run everything 'in order' for chromosome 1. The script run_wg.sh runs everything for the whole genome.
You can find the individual scripts that are being called under scripts/

Input files

The directory example_data_chr1 stores files for only chromosome 1, whereas example_data stores files for the whole genome.

Under example_data/ and example_data_chr1/ you will find the following files:

  • GLIMPSE imputed vcf, here named ancients_glimpse2_chr1.bcf.
  • Modern vcf (e.g. 1000G), here named 1000GP_sub_chr1.bcf.
  • A poplabels file listing population labels for each individual. Individuals have to appear in the same order as in the merged vcf file. The file should contain four columns: ID POP GROUP SEX. The second column is used for population assignment.
  • A second poplabels file used for the MDS analysis. The second column should now list IDs of all individuals plotted in the MDS (i.e. should be identical to first column). The outgroup should be grouped together into one population.
  • File containing sample ages in generations, two lines per sample (diploid), e.g. for 3 samples of ages 0, 10, and 100 generations:
    0
    0
    10
    10
    100
    100
  • We provide all the other required Relate input files under Relate_input_files/. You can reuse these in your analysis.

In this example, we are using data from the 1000 Genomes Project dataset (Nature 2015). We additionally use low coverage shotgun genomes from Anglo-Saxon contexts, British Iron/Roman Age, Irish Bronze Age, and the Scandinavian Early Iron Age (Cassidy et al, PNAS 2016; Martiniano et al, Nature Communications 2016; Anastasiadou et al, Communications Biology 2023; Schiffels et al Nature Communications 2016; Gretzinger et al Nature 2022; Rodriguez-Varela et al Cell 2023). These were imputed using GLIMPSE (https://odelaneau.github.io/GLIMPSE).

Step by step guide

Please follow run.sh (chromosome 1 only). The script run_wg.sh will run the whole genome.

These scripts will

  1. Run scripts/1_prep_vcf.sh to filter the imputed genotypes. 
  2. Then run scripts/2_prep_Relate.sh to prepare Relate input files
  3. Finally run scripts/3_run_Relate.sh to estimate genealogies

We can use these Relate files for various analyses:

  • You can run Twigstats and infer admixture proportions using Rscript scripts/4_run_Twigstats.R.
  • You can estimate coalescence rates and population sizes using Rscript scripts/5_plot_popsize.R.
  • You can run an MDS using Rscript scripts/6_plot_MDS.R.

To see the arguments required in each script, you can execute the script without arguments, e.g. by executing scripts/1_prep_vcf.sh or Rscript scripts/4_run_Twigstats.R.

The expected output is shown in the attached pdf.

Files

Fig1.pdf

Files (7.5 GB)

Name Size Download all
md5:d577433626ee4923c4b3dcdf133515dc
3.4 GB Download
md5:8e7e3f23e95115f4cdfeb9c8f7865270
273.4 MB Download
md5:86600acea139d78206757bb8eb1a55a8
152.0 kB Preview Download
md5:f5082440da3424f061143cd3fd2f674e
2.3 GB Download
md5:bd17f09a941a750ce7f3fe45072f3baf
1.5 GB Download
md5:72834206b3171c9203d3c88984cbd6e3
1.8 kB Download
md5:eb6c325edfcfe702123c01af6da4d62c
1.9 kB Download
md5:d6e09afa397f60d1ac103420f7feb820
8.5 kB Download

Additional details