Dataset Open Access

Simulated pairs of nucleotide sequences for testing (alignment-free) genome distance estimate methods

Criscuolo Alexis


Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>Criscuolo Alexis</dc:creator>
  <dc:date>2020-09-17</dc:date>
  <dc:description>This repository contains 24,000 pairs of nucleotide sequences (and associated parameters) that have been simulated for testing alignment-free genome distance estimates. Given an evolutionary distance d varying from 0.05 to 1.00 nucleotide substitutions per character (step = 0.05), the program INDELible was used to simulate the evolution of 200 nucleotide sequence pairs with d substitution events per character under the models GTR and GTR+Γ. Each model was adjusted with three different equilibrium frequencies:


	f1: equal frequencies, i.e. freq(A) = freq(C) = freq(G) = freq(T) = 0.25,
	f2: GC-rich, i.e. freq(A) = 0.1, freq(C) = 0.3, freq(G) = 0.4, freq(T) = 0.2,
	f3: AT-rich, i.e. freq(A) = freq(T) = 0.4, freq(C) = freq(G) = 0.1.


For each simulated sequence pair, model parameters (i.e. GTR: six relative rates of nucleotide substitution; GTR+Γ: six rates and one Γ shape parameter) were randomly drawn from 142 sets of parameters derived from real-case data (see file GTR.params.trees.tsv at https://zenodo.org/record/4034261). Initial sequence length was 5 Mbs, and an indel rate of 0.01 was set with indel length drawn from [1, 50000] according to a Zipf distribution with parameter 1.5 (see INDELible manual).

 

For each of the 20 evolutionary distances d = 0.05, 0.10, ..., 1.00, six XZ-compressed files containing 200 simulation data are available:


	data-d-f1-nogam.tsv.xz   data simulated under the model GTR with equilibrium frequencies f1
	data-d-f1-gamma.tsv.xz   data simulated under the model GTR+Γ with equilibrium frequencies f1
	data-d-f2-nogam.tsv.xz   data simulated under the model GTR with equilibrium frequencies f2
	data-d-f2-gamma.tsv.xz   data simulated under the model GTR+Γ with equilibrium frequencies f2
	data-d-f3-nogam.tsv.xz   data simulated under the model GTR with equilibrium frequencies f3
	data-d-f3-gamma.tsv.xz   data simulated under the model GTR+Γ with equilibrium frequencies f3


 

Each file is tab-delimited and contains the 18 following fields:


	[1]      integer seed value specified to INDELible,
	[2-5]    frequencies of T, C, A, G, respectively, specified to INDELible,
	[6-10]  C-T, A-T, G-T, A-C, C-G rate parameters, respectivly (normalized such that A-G rate = 1), specified to INDELible,
	[11]      Γ shape parameter alpha (= 0 in the nogam files, i.e. GTR substitution model without Γ) specified to INDELible,
	[12]      length lgt1 of the first sequence seq1 (i.e. no. A, C, G, T in seq1),
	[13]      length lgt2 of the second sequence seq2 (i.e. no. A, C, G, T in seq2),
	[14]      no. sites in aligned sequences seq1 and seq2 (i.e. no. A, C, G, T and gap character states in seq1 or seq2),
	[15]      no. non-gapped sites (core sites) in aligned sequences seq1 and seq2,
	[16]      observed p-distance between aligned sequences seq1 and seq2 (i.e. no. nucleotide mismatches divided by no. core sites),
	[17]      aligned seq1 (containing indel gaps),
	[18]      aligned seq2 (containing indel gaps).


Of note, seq1 and seq2 (fields [17-18]) being aligned, these two entries are two strings with identical no. sites (field [14]). Gap character states (-) should be removed from seq1 and seq2 to obtain the unaligned sequences.

_____

Criscuolo A (2020) On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference. F1000Research, 9:1309. doi:10.12688/f1000research.26930.1</dc:description>
  <dc:identifier>https://zenodo.org/record/4034462</dc:identifier>
  <dc:identifier>10.5281/zenodo.4034462</dc:identifier>
  <dc:identifier>oai:zenodo.org:4034462</dc:identifier>
  <dc:relation>doi:10.5281/zenodo.4034461</dc:relation>
  <dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
  <dc:rights>https://creativecommons.org/licenses/by/4.0/legalcode</dc:rights>
  <dc:subject>simulation</dc:subject>
  <dc:subject>genomes</dc:subject>
  <dc:title>Simulated pairs of nucleotide sequences for testing (alignment-free) genome distance estimate methods</dc:title>
  <dc:type>info:eu-repo/semantics/other</dc:type>
  <dc:type>dataset</dc:type>
</oai_dc:dc>
60
179
views
downloads
All versions This version
Views 6060
Downloads 179179
Data volume 72.4 GB72.4 GB
Unique views 4444
Unique downloads 66

Share

Cite as