There is a newer version of the record available.

Published October 24, 2021 | Version v1
Software Open

Optimal sequence similarity thresholds for clustering of molecular operational taxonomic units in DNA metabarcoding studies

  • 1. University of Milan

Description

Clustering approaches are pivotal to handle the many sequence variants obtained in DNA metabarcoding datasets, therefore they have become a key step of metabarcoding analysis pipelines. Clustering often relies on a sequence similarity threshold to gather sequences in Molecular Operational Taxonomic Units (MOTUs) that ideally each represent a homogeneous taxonomic entity, e.g. a species or a genus. However, the choice of the clustering threshold is rarely justified, and its impact on MOTU over-splitting or over-merging even less tested. Here, we evaluated clustering threshold values for several metabarcoding markers under different criteria: limitation of MOTU over-merging, limitation of MOTU over-splitting, and trade-off between over-merging and over-splitting. We extracted sequences from a public database for eight markers, ranging from generalist markers targeting Bacteria or Eukaryota, to more specific markers targeting a class or a subclass (e.g. Insecta, Oligochaeta). Based on the distributions of pairwise sequence similarities within species and within genera and on the rates of over-splitting and over-merging across different clustering thresholds, we were able to propose threshold values minimizing the risk of over-splitting, that of over-merging, or offering a trade-off between the two risks. For generalist markers, high similarity thresholds (0.96-0.99) are generally appropriate, while more specific markers require lower values (0.85-0.96). These results do not support the use of a fixed clustering threshold (e.g. 0.97). Instead, we advocate a careful examination of the most appropriate threshold based on the research objectives, the potential costs of over-splitting and over-merging, and the features of the studied markers.

Notes

Funding provided by: European Research Council
Crossref Funder Registry ID: http://dx.doi.org/10.13039/501100000781
Award Number: 772284

Files

Read_Me.txt

Files (19.4 kB)

Name Size Download all
md5:4c9dd3cc0651333aa61642fca8e94da8
1.2 kB Preview Download
md5:1632b8f0b1b248d56337bedb41fba629
3.7 kB Download
md5:733245bc6b943615acceab8c18c54289
1.9 kB Download
md5:6dfbc220614cd5869b2ee1de251400b5
4.2 kB Download
md5:6dcee5e6655792b8787ff5d2238079e2
2.2 kB Download
md5:e8f29b0855b84515d06a70fd52905122
6.2 kB Download

Additional details

Related works

Is source of
10.5061/dryad.crjdfn353 (DOI)