Dataset Open Access

Simulated pairs of nucleotide sequences for testing (alignment-free) genome distance estimate methods

Criscuolo Alexis


DataCite XML Export

<?xml version='1.0' encoding='utf-8'?>
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd">
  <identifier identifierType="DOI">10.5281/zenodo.4034462</identifier>
  <creators>
    <creator>
      <creatorName>Criscuolo Alexis</creatorName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-8212-5215</nameIdentifier>
      <affiliation>Institut Pasteur</affiliation>
    </creator>
  </creators>
  <titles>
    <title>Simulated pairs of nucleotide sequences for testing (alignment-free) genome distance estimate methods</title>
  </titles>
  <publisher>Zenodo</publisher>
  <publicationYear>2020</publicationYear>
  <subjects>
    <subject>simulation</subject>
    <subject>genomes</subject>
  </subjects>
  <dates>
    <date dateType="Issued">2020-09-17</date>
  </dates>
  <resourceType resourceTypeGeneral="Dataset"/>
  <alternateIdentifiers>
    <alternateIdentifier alternateIdentifierType="url">https://zenodo.org/record/4034462</alternateIdentifier>
  </alternateIdentifiers>
  <relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.4034461</relatedIdentifier>
  </relatedIdentifiers>
  <rightsList>
    <rights rightsURI="https://creativecommons.org/licenses/by/4.0/legalcode">Creative Commons Attribution 4.0 International</rights>
    <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
  </rightsList>
  <descriptions>
    <description descriptionType="Abstract">&lt;p&gt;This repository contains 24,000 pairs of nucleotide sequences (and associated parameters) that have been simulated for testing alignment-free genome distance estimates. Given an evolutionary distance &lt;em&gt;d&lt;/em&gt; varying from 0.05 to 1.00 nucleotide substitutions per character (step = 0.05), the program &lt;a href="http://abacus.gene.ucl.ac.uk/software/indelible/"&gt;&lt;em&gt;INDELible&lt;/em&gt;&lt;/a&gt; was used to simulate the evolution of 200 nucleotide sequence pairs with &lt;em&gt;d&lt;/em&gt; substitution events per character under the models GTR and GTR+&amp;Gamma;. Each model was adjusted with three different equilibrium frequencies:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;&lt;em&gt;f&lt;/em&gt;&lt;sub&gt;1&lt;/sub&gt;: equal frequencies, i.e. freq(A) = freq(C) = freq(G) = freq(T) = 0.25,&lt;/li&gt;
	&lt;li&gt;&lt;em&gt;f&lt;/em&gt;&lt;sub&gt;2&lt;/sub&gt;: GC-rich, i.e. freq(A) = 0.1, freq(C) = 0.3, freq(G) = 0.4, freq(T) = 0.2,&lt;/li&gt;
	&lt;li&gt;&lt;em&gt;f&lt;/em&gt;&lt;sub&gt;3&lt;/sub&gt;: AT-rich, i.e. freq(A) = freq(T) = 0.4, freq(C) = freq(G) = 0.1.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each simulated sequence pair, model parameters (i.e. GTR: six relative rates of nucleotide substitution; GTR+&amp;Gamma;: six rates and one &amp;Gamma; shape parameter) were randomly drawn from 142 sets of parameters derived from real-case data (see file &lt;a href="https://zenodo.org/record/4034261/files/GTR.params.trees.tsv?download=1"&gt;GTR.params.trees.tsv&lt;/a&gt; at &lt;a href="https://zenodo.org/record/4034261"&gt;https://zenodo.org/record/4034261&lt;/a&gt;). Initial sequence length was 5 Mbs, and an indel rate of 0.01 was set with indel length drawn from [1, 50000] according to a Zipf distribution with parameter 1.5 (see &lt;em&gt;INDELible&lt;/em&gt; &lt;a href="http://abacus.gene.ucl.ac.uk/software/indelible/manual/model.shtml"&gt;manual&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;For each of the 20 evolutionary distances &lt;em&gt;d&lt;/em&gt; = 0.05, 0.10, ..., 1.00, six XZ-compressed files containing 200 simulation data are available:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;&lt;code&gt;data-d-f1-nogam.tsv.xz&lt;/code&gt; &amp;nbsp; data simulated under the model GTR with equilibrium frequencies &lt;em&gt;f&lt;/em&gt;&lt;sub&gt;1&lt;/sub&gt;&lt;/li&gt;
	&lt;li&gt;&lt;code&gt;data-d-f1-gamma.tsv.xz&lt;/code&gt; &amp;nbsp; data simulated under the model GTR+&amp;Gamma; with equilibrium frequencies &lt;em&gt;f&lt;/em&gt;&lt;sub&gt;1&lt;/sub&gt;&lt;/li&gt;
	&lt;li&gt;&lt;code&gt;data-d-f2-nogam.tsv.xz&lt;/code&gt; &amp;nbsp; data simulated under the model GTR with equilibrium frequencies &lt;em&gt;f&lt;/em&gt;&lt;sub&gt;2&lt;/sub&gt;&lt;/li&gt;
	&lt;li&gt;&lt;code&gt;data-d-f2-gamma.tsv.xz&lt;/code&gt; &amp;nbsp; data simulated under the model GTR+&amp;Gamma; with equilibrium frequencies &lt;em&gt;f&lt;/em&gt;&lt;sub&gt;2&lt;/sub&gt;&lt;/li&gt;
	&lt;li&gt;&lt;code&gt;data-d-f3-nogam.tsv.xz&lt;/code&gt; &amp;nbsp; data simulated under the model GTR with equilibrium frequencies &lt;em&gt;f&lt;/em&gt;&lt;sub&gt;3&lt;/sub&gt;&lt;/li&gt;
	&lt;li&gt;&lt;code&gt;data-d-f3-gamma.tsv.xz&lt;/code&gt; &amp;nbsp; data simulated under the model GTR+&amp;Gamma; with equilibrium frequencies &lt;em&gt;f&lt;/em&gt;&lt;sub&gt;3&lt;/sub&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;Each file is tab-delimited and contains the 18 following fields:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;&lt;code&gt;[1]&amp;nbsp; &amp;nbsp;&lt;/code&gt;&amp;nbsp;&amp;nbsp; integer &lt;em&gt;seed&lt;/em&gt; value specified to &lt;em&gt;INDELible&lt;/em&gt;,&lt;/li&gt;
	&lt;li&gt;&lt;code&gt;[2-5]&amp;nbsp;&lt;/code&gt; &amp;nbsp; frequencies of T, C, A, G, respectively, specified to &lt;em&gt;INDELible&lt;/em&gt;,&lt;/li&gt;
	&lt;li&gt;&lt;code&gt;[6-10]&amp;nbsp;&lt;/code&gt; C-T, A-T, G-T, A-C, C-G rate parameters, respectivly (normalized such that A-G rate = 1), specified to &lt;em&gt;INDELible&lt;/em&gt;,&lt;/li&gt;
	&lt;li&gt;&lt;code&gt;[11] &amp;nbsp; &lt;/code&gt; &amp;nbsp; &amp;Gamma; shape parameter &lt;em&gt;alpha&lt;/em&gt; (= 0 in the &lt;code&gt;nogam&lt;/code&gt; files, i.e. GTR substitution model without &amp;Gamma;) specified to &lt;em&gt;INDELible&lt;/em&gt;,&lt;/li&gt;
	&lt;li&gt;&lt;code&gt;[12] &amp;nbsp; &lt;/code&gt; &amp;nbsp; length &lt;em&gt;lgt1&lt;/em&gt; of the first sequence &lt;em&gt;seq1&lt;/em&gt; (i.e. no. A, C, G, T in &lt;em&gt;seq1&lt;/em&gt;),&lt;/li&gt;
	&lt;li&gt;&lt;code&gt;[13] &amp;nbsp; &lt;/code&gt; &amp;nbsp; length &lt;em&gt;lgt2&lt;/em&gt; of the second sequence &lt;em&gt;seq2&lt;/em&gt; (i.e. no. A, C, G, T in &lt;em&gt;seq2&lt;/em&gt;),&lt;/li&gt;
	&lt;li&gt;&lt;code&gt;[14] &amp;nbsp; &lt;/code&gt; &amp;nbsp; no. &lt;em&gt;sites&lt;/em&gt; in aligned sequences &lt;em&gt;seq1&lt;/em&gt; and &lt;em&gt;seq2&lt;/em&gt; (i.e. no. A, C, G, T and gap character states in &lt;em&gt;seq1&lt;/em&gt; or &lt;em&gt;seq2&lt;/em&gt;),&lt;/li&gt;
	&lt;li&gt;&lt;code&gt;[15] &amp;nbsp; &lt;/code&gt; &amp;nbsp; no. non-gapped sites (&lt;em&gt;core&lt;/em&gt; sites) in aligned sequences &lt;em&gt;seq1&lt;/em&gt; and &lt;em&gt;seq2&lt;/em&gt;,&lt;/li&gt;
	&lt;li&gt;&lt;code&gt;[16] &amp;nbsp; &lt;/code&gt; &amp;nbsp; observed &lt;em&gt;p-distance&lt;/em&gt; between aligned sequences &lt;em&gt;seq1&lt;/em&gt; and &lt;em&gt;seq2&lt;/em&gt; (i.e. no. nucleotide mismatches divided by no. &lt;em&gt;core&lt;/em&gt; sites),&lt;/li&gt;
	&lt;li&gt;&lt;code&gt;[17] &amp;nbsp; &lt;/code&gt; &amp;nbsp; aligned &lt;em&gt;seq1&lt;/em&gt; (containing indel gaps),&lt;/li&gt;
	&lt;li&gt;&lt;code&gt;[18] &amp;nbsp; &lt;/code&gt; &amp;nbsp; aligned &lt;em&gt;seq2&lt;/em&gt; (containing indel gaps).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Of note, &lt;em&gt;seq1&lt;/em&gt; and &lt;em&gt;seq2&lt;/em&gt; (fields &lt;code&gt;[17-18]&lt;/code&gt;) being aligned, these two entries are two strings with identical no. &lt;em&gt;sites&lt;/em&gt; (field &lt;code&gt;[14]&lt;/code&gt;). Gap character states (&lt;code&gt;-&lt;/code&gt;) should be removed from &lt;em&gt;seq1&lt;/em&gt; and &lt;em&gt;seq2&lt;/em&gt; to obtain the unaligned sequences.&lt;/p&gt;

&lt;p&gt;_____&lt;/p&gt;

&lt;p&gt;Criscuolo A (2020) &lt;em&gt;On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference&lt;/em&gt;. F1000Research, 9:1309. &lt;a href="https://doi.org/10.12688/f1000research.26930.1"&gt;doi:10.12688/f1000research.26930.1&lt;/a&gt;&lt;/p&gt;</description>
  </descriptions>
</resource>
60
179
views
downloads
All versions This version
Views 6060
Downloads 179179
Data volume 72.4 GB72.4 GB
Unique views 4444
Unique downloads 66

Share

Cite as