Published December 29, 2020 | Version v1
Dataset Open

Data from: In search of an optimal DNA diagnosis for taxonomic descriptions with MOLD, a novel tool to identify diagnostic nucleotide characters

  • 1. Severtsov Institute of Ecology and Evolution

Description

While DNA characters are increasingly used for phylogenetic inference, taxa delimitation and identification, their use for formal description of taxa remains scarce and inconsistent. The major impediments until recently was a lack of a suitable algorithm to identify signature DNA characters. The 2019-2020 however were marked by an almost simultaneous release of three softwares, simple to run and designed specifically for taxonomists. There is, nevertheless, a major concern, whether taxonomy will benefit from wide application of these, or any of the previously available tools. The reluctance of using DNA data in taxonomy is partly due to concerns of insufficient reliability of DNA characters, as robustness of DNA based diagnoses, depending on the sampled fraction of the species diversity has not thus far been assessed.

We propose a novel program, named MOLD that recovers diagnostic nucleotide combinations (DNCs) for selected taxa with DNA sequences available. We carried our random iterated haplotype subsampling on species in six published DNA data sets of varying complexity, providing a diagnosis to each subsample to evaluate how the robustness of DNA based diagnosis changes depending on the sampled fraction of the taxon's diversity. We demonstrate that the currently used diagnostic DNA characters, or combinations thereof (DNCs) often do not exist for a particular species in a particular data set, or are not sufficiently reliable. We propose a new type of DNA diagnosis, termed herein rDNCs, which is compiled to suit pre-defined criteria of reliability, and is implemented in MOLD. We demonstrate that rDNCs can be successfully identified even in data sets comprising hundreds of species, and allow for notably more reliable diagnoses, than the currently used diagnostic DNA characters. MOLD recovers reliable and reproducible diagnoses in traditionally problematic cases, such as cryptic species or species with pronounced genetic structure, and shows unparalleled efficiency in large DNA data sets, making a valuable complement to the currently existing toolkit.

Notes

Seven published DNA datasets were analysed using MOLD - a novel software tool to recover diagnostic DNA characters for taxonomy.

Funding provided by: Russian Science Foundation
Crossref Funder Registry ID: http://dx.doi.org/10.13039/501100006769
Award Number: 19-74-10020

Files

mDNC_h-resampling.zip

Files (722.6 kB)

Name Size Download all
md5:1d660951363d9af06bcb99337ad01a4e
126.9 kB Preview Download
md5:d64a84bcd7e9427cba1c34d94599435a
18.4 kB Preview Download
md5:fa78e8c781876e3eff9ffdc0b05e886b
14.3 kB Preview Download
md5:56cc53e593f07cfe20546c849ae43cb2
399.5 kB Preview Download
md5:9d0befb48bb34ca35d8e81a354e6599c
61.9 kB Preview Download
md5:54a7e4b63ddafc2d68ed1c4528f962d3
37.1 kB Preview Download
md5:99ea6dbb0ee530f612030c8f812cdeab
62.5 kB Preview Download
md5:06fd6d46761d85ee37f89fe213e483f0
2.1 kB Download