Dataset Open Access

Simulated pairs of nucleotide sequences for testing (alignment-free) genome distance estimate methods

Criscuolo Alexis


Citation Style Language JSON Export

{
  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.4034462", 
  "author": [
    {
      "family": "Criscuolo Alexis"
    }
  ], 
  "issued": {
    "date-parts": [
      [
        2020, 
        9, 
        17
      ]
    ]
  }, 
  "abstract": "<p>This repository contains 24,000 pairs of nucleotide sequences (and associated parameters) that have been simulated for testing alignment-free genome distance estimates. Given an evolutionary distance <em>d</em> varying from 0.05 to 1.00 nucleotide substitutions per character (step = 0.05), the program <a href=\"http://abacus.gene.ucl.ac.uk/software/indelible/\"><em>INDELible</em></a> was used to simulate the evolution of 200 nucleotide sequence pairs with <em>d</em> substitution events per character under the models GTR and GTR+&Gamma;. Each model was adjusted with three different equilibrium frequencies:</p>\n\n<ul>\n\t<li><em>f</em><sub>1</sub>: equal frequencies, i.e. freq(A) = freq(C) = freq(G) = freq(T) = 0.25,</li>\n\t<li><em>f</em><sub>2</sub>: GC-rich, i.e. freq(A) = 0.1, freq(C) = 0.3, freq(G) = 0.4, freq(T) = 0.2,</li>\n\t<li><em>f</em><sub>3</sub>: AT-rich, i.e. freq(A) = freq(T) = 0.4, freq(C) = freq(G) = 0.1.</li>\n</ul>\n\n<p>For each simulated sequence pair, model parameters (i.e. GTR: six relative rates of nucleotide substitution; GTR+&Gamma;: six rates and one &Gamma; shape parameter) were randomly drawn from 142 sets of parameters derived from real-case data (see file <a href=\"https://zenodo.org/record/4034261/files/GTR.params.trees.tsv?download=1\">GTR.params.trees.tsv</a> at <a href=\"https://zenodo.org/record/4034261\">https://zenodo.org/record/4034261</a>). Initial sequence length was 5 Mbs, and an indel rate of 0.01 was set with indel length drawn from [1, 50000] according to a Zipf distribution with parameter 1.5 (see <em>INDELible</em> <a href=\"http://abacus.gene.ucl.ac.uk/software/indelible/manual/model.shtml\">manual</a>).</p>\n\n<p>&nbsp;</p>\n\n<p>For each of the 20 evolutionary distances <em>d</em> = 0.05, 0.10, ..., 1.00, six XZ-compressed files containing 200 simulation data are available:</p>\n\n<ul>\n\t<li><code>data-d-f1-nogam.tsv.xz</code> &nbsp; data simulated under the model GTR with equilibrium frequencies <em>f</em><sub>1</sub></li>\n\t<li><code>data-d-f1-gamma.tsv.xz</code> &nbsp; data simulated under the model GTR+&Gamma; with equilibrium frequencies <em>f</em><sub>1</sub></li>\n\t<li><code>data-d-f2-nogam.tsv.xz</code> &nbsp; data simulated under the model GTR with equilibrium frequencies <em>f</em><sub>2</sub></li>\n\t<li><code>data-d-f2-gamma.tsv.xz</code> &nbsp; data simulated under the model GTR+&Gamma; with equilibrium frequencies <em>f</em><sub>2</sub></li>\n\t<li><code>data-d-f3-nogam.tsv.xz</code> &nbsp; data simulated under the model GTR with equilibrium frequencies <em>f</em><sub>3</sub></li>\n\t<li><code>data-d-f3-gamma.tsv.xz</code> &nbsp; data simulated under the model GTR+&Gamma; with equilibrium frequencies <em>f</em><sub>3</sub></li>\n</ul>\n\n<p>&nbsp;</p>\n\n<p>Each file is tab-delimited and contains the 18 following fields:</p>\n\n<ul>\n\t<li><code>[1]&nbsp; &nbsp;</code>&nbsp;&nbsp; integer <em>seed</em> value specified to <em>INDELible</em>,</li>\n\t<li><code>[2-5]&nbsp;</code> &nbsp; frequencies of T, C, A, G, respectively, specified to <em>INDELible</em>,</li>\n\t<li><code>[6-10]&nbsp;</code> C-T, A-T, G-T, A-C, C-G rate parameters, respectivly (normalized such that A-G rate = 1), specified to <em>INDELible</em>,</li>\n\t<li><code>[11] &nbsp; </code> &nbsp; &Gamma; shape parameter <em>alpha</em> (= 0 in the <code>nogam</code> files, i.e. GTR substitution model without &Gamma;) specified to <em>INDELible</em>,</li>\n\t<li><code>[12] &nbsp; </code> &nbsp; length <em>lgt1</em> of the first sequence <em>seq1</em> (i.e. no. A, C, G, T in <em>seq1</em>),</li>\n\t<li><code>[13] &nbsp; </code> &nbsp; length <em>lgt2</em> of the second sequence <em>seq2</em> (i.e. no. A, C, G, T in <em>seq2</em>),</li>\n\t<li><code>[14] &nbsp; </code> &nbsp; no. <em>sites</em> in aligned sequences <em>seq1</em> and <em>seq2</em> (i.e. no. A, C, G, T and gap character states in <em>seq1</em> or <em>seq2</em>),</li>\n\t<li><code>[15] &nbsp; </code> &nbsp; no. non-gapped sites (<em>core</em> sites) in aligned sequences <em>seq1</em> and <em>seq2</em>,</li>\n\t<li><code>[16] &nbsp; </code> &nbsp; observed <em>p-distance</em> between aligned sequences <em>seq1</em> and <em>seq2</em> (i.e. no. nucleotide mismatches divided by no. <em>core</em> sites),</li>\n\t<li><code>[17] &nbsp; </code> &nbsp; aligned <em>seq1</em> (containing indel gaps),</li>\n\t<li><code>[18] &nbsp; </code> &nbsp; aligned <em>seq2</em> (containing indel gaps).</li>\n</ul>\n\n<p>Of note, <em>seq1</em> and <em>seq2</em> (fields <code>[17-18]</code>) being aligned, these two entries are two strings with identical no. <em>sites</em> (field <code>[14]</code>). Gap character states (<code>-</code>) should be removed from <em>seq1</em> and <em>seq2</em> to obtain the unaligned sequences.</p>\n\n<p>_____</p>\n\n<p>Criscuolo A (2020) <em>On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference</em>. F1000Research, 9:1309. <a href=\"https://doi.org/10.12688/f1000research.26930.1\">doi:10.12688/f1000research.26930.1</a></p>", 
  "title": "Simulated pairs of nucleotide sequences for testing (alignment-free) genome distance estimate methods", 
  "type": "dataset", 
  "id": "4034462"
}
60
179
views
downloads
All versions This version
Views 6060
Downloads 179179
Data volume 72.4 GB72.4 GB
Unique views 4444
Unique downloads 66

Share

Cite as