Published May 14, 2020 | Version v0.2.0
Dataset Open

1,500 simulated transcriptomic variants for MINTIE paper

  • 1. Peter MacCallum Cancer Centre
  • 2. Walter + Eliza Hall Institute

Description

Contains RNA-seq data set of 1,500 simulated heterozygous transcriptomic variants (500 fusions, 500 splice variants and 500 transcribed structural variants) used in the MINTIE paper. An additional 100 unmodified background genes were also added. The controls set contains unmodified sequences of all variant genes included in the case sample. Variant information and paired end reads, as well as the fasta files from which they were generated, are provided.

 

Code used to generate these samples can be found under https://github.com/Oshlack/MINTIE/tree/master/simu.

 

Simulations were generated by extracting sequence from the transcripts listed in the hg38 UCSC RefSeq reference, and simulating reads from the resulting sequence. 100 variants from 15 variant types were generated (five fusion types: canonical, extended exon, novel exon, with insertion and unpartnered, five TSV types: insertions, deletions, ITDs, PTDs and inversions, and five novel splice variants: extended exons, novel exons, truncated exons, skipped exons and retained introns. Only transcripts from genes that did not overlap any other genes were used in the simulation. Additionally, each transcript had to have at least 3 exons to be considered as a simulation transcript. 

 

All fusions were simulated by selecting the first two and the last two exons from two random transcripts from different genes, and inserting the intervening sequence. Canonical fusions contained no intervening sequence, while fusions with extended exons inserted 30-200bp of intronic sequence from the end of the second exon of the first transcript. Similarly, fusions with novel exons contained intronic sequence 30-200bp downstream with a size of 30-200bp. Non-canonical fusions with insertions were generated by inserting 7-50bp of randomly-generated sequence between the two fusion transcripts. 

 

Small TSVs were generated by inserting, duplicating or deleting sequence within randomly selected exons from randomly selected transcripts. These small variant types were between 7 and 50 base-pairs and had to reside at least 10bp within the exon. Inversions and partial-tandem duplications were generated by selecting 1-3 random exons within a transcript and either inverting or duplicating their sequence in tandem. Lastly, splice variants were generated by extending or placing novel exons downstream of a randomly selected exon. To ensure that novel or extended exons did not overlap exons from other transcripts (or downstream exons of the same transcript), each candidate exon was checked for these potential overlaps (which would otherwise result in obfuscation of the variant, or the wrong variant type being created). Novel junction variants were created by selecting a random pair of exons and checking whether an existing junction existed between them (creating a transcript with this junction if not). Two randomly-selected neighbouring exons were both truncated at their facing ends (end and start respectively) by 30-200bp. Retained introns included a random intronic sequence from a given transcript that was >30bp. The presence of correct splicing motifs was not considered for the simulation.

 

In addition to each variant gene, the sequence to the unaltered wild-type gene was added to the simulated case sample’s reference. An additional 100 unaltered background genes were also added to the case sample. A control sample reference was also generated, which included the unaltered wildtype sequence only for all simulated transcripts. ART-illumina (doi: 10.1093/bioinformatics/btr708) v2.5.8 was run on the corresponding references with 100bp paired-end reads with a fragment size of 300 and coverage of 50 (transcripts should thus have an effective coverage of 100, given the bi-allelic reference containing variant and wildtype transcripts).

 

We also include three down-sampled versions of the simulation files (40x, 20x and 10x) used in the MINTIE paper. Note that the variant coverage will be half the sequence coverage. These were down-sampled using seqtk v1.0 (https://github.com/lh3/seqtk).

Files

Files (798.8 MB)

Name Size Download all
md5:605b6e6515336cf14b9b9de9f70d685f
14.9 MB Download
md5:ea4f7bf2d7eac145306159a9b83936d5
15.4 MB Download
md5:ee3a19d69d41295faac1a2357dd5d2b8
28.4 MB Download
md5:8f2aecf6600f549f2cb02911ae42a5fb
29.5 MB Download
md5:1b3fd33fcde19eb50204f349464a2012
55.0 MB Download
md5:f31a68c588953720eec5710e50004ba3
57.2 MB Download
md5:b4191e77396966cd27c5f800c1594a3f
11.4 MB Download
md5:fb81ba4fdba785e6808134bf98f0859e
134.3 MB Download
md5:ec8fb70eff51bd998c1d00db163799a3
139.8 MB Download
md5:3798e7859d4a5ce2bbc56fe8bf1509ab
6.4 MB Download
md5:40f1af2d28dd10ccd3b736c4d9030081
150.1 MB Download
md5:647209271133505773b1e1b8fb1a2c5b
156.4 MB Download
md5:2a4df09db1b85ebd7569dc251895819f
59.9 kB Download
md5:e93ab60c02614cbcc17f04aa8affb03d
56.6 kB Download