Dataset Open Access

Discovery of tandem and interspersed segmental duplications using high throughput sequencing

Soylev, Arda; Le, Thong; Amini, Hajar; Alkan, Can; Hormozdiari, Fereydoun

We developed novel algorithms to accurately characterize tandem, direct and inverted interspersed segmental duplications using short read whole genome sequencing data sets. We integrated these methods to our TARDIS tool, which is now capable of detecting various types of SVs using multiple sequence signatures such as read pair, read depth and split read. We evaluated the prediction performance of our algorithms through several experiments using both simulated and real data sets. In the  simulation experiments, using a 30x coverage TARDIS achieved 96% sensitivity with only 4% false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state of the art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1 (less than 5\% for the top 50 predictions). 

Here we deposit current versions of TARDIS (1.0.2) and CNVSim, and all predictions, truth sets, and the CRAM files for the simulation data.
Files (42.0 GB)
Name Size
all_predictions.tar.gz
md5:c663e530cea2659a379d37b40d3c3cba
28.4 MB Download
CNVSim.zip
md5:593af83a02225d01d0997a61e52ae874
4.5 kB Download
simulation-10x.cram
md5:9377091f45f0a14c3cee8ee891d93539
3.4 GB Download
simulation-20x.cram
md5:711d7cb52288c2dca59306b68603e3c2
6.7 GB Download
simulation-30x.cram
md5:4cf53f7999a3e0730867535cbe6f0e29
10.2 GB Download
simulation-30x_Y.cram
md5:f4be2d2b334b1251cfa923e40b7e2696
1.2 GB Download
simulation-60x.cram
md5:a7a134a243372c8456f2b7115f99fa53
20.5 GB Download
tardis-1.0.2.tar.gz
md5:71889ccba4c3ec306c247bde37232079
20.8 MB Download
true_calls_realdata.tar.gz
md5:461146745122ed9a460a3acd3f9a8aae
115.9 kB Download
true_calls_simulation.tar.gz
md5:37865fd65134525395b6aa3d6f1611ab
223.5 kB Download
311
86
views
downloads
All versions This version
Views 311311
Downloads 8686
Data volume 427.1 GB427.1 GB
Unique views 259259
Unique downloads 2323

Share

Cite as