Simple approaches for evaluation of OTU quality based on dissimilarity arrays
Authors/Creators
- 1. Université de Toulouse, Auzeville-Tolosane, France
- 2. Université de Bordeaux, Talence, France|Université de Bordeaux, Cestas, France
- 3. Université de Bordeaux, Cestas, France|Université de Bordeaux, Talence, France
Description
An accurate and complete taxonomic description of the diversity present in an environmental sample is out of reach at this time. Instead, metabarcoding is used today and it is expected that OTUs represent a category relevant for biodiversity inventories on a molecular basis. However, artefacts in the production of OTUs can occur at different stages and may impact ecological conclusions. We propose to evaluate the quality of OTUs in a sample by characterising the deviation of each OTU's dissimilarity array from that of an ideal OTU where all sequences are at distances smaller than the barcoding gap. We consider two deviations: the creation of composed OTUs, corresponding to the artificial merging of several OTUs and the creation of noisy OTUs that contain some sequences that are loosely associated with the core sequence of the OTUs and that do not form a compact subgroup. We propose a simple and automatic 2-step method that successively categorises the OTUs of a sample as composed or single and then identifies OTUs with noise amongst the single ones. The associated code is available at https://forgemia.inra.fr/alain.franc/otu_shape. We applied the method on 32 samples of diatoms from Arcachon Bay (France) that represent contrasted environmental conditions and we obtained good agreement with expert categorisation of OTUs. We suggest that single OTUs without noise can be used as such for further ecological studies. Composed OTUs should be post-treated with classical clustering or community detection tools. The quality of single OTUs with noise remains to be further tested via supplementary studies on a diversity of organisms.
Files
MBMG_article_108649.pdf
Files
(1.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:6070e4372026f1624ff84da347302a50
|
1.2 MB | Preview Download |
System files
(135.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:a252f1d3b6486f6641c184d64c9acb6c
|
135.9 kB | Download |
Linked records
Additional details
References
- Auby I, Méteigner C, Rumebe M, Chancerel E, Salin F, Aluome C, Barraquand F, Carassou L, Del Amo Y, Meleder V, Petit A, Picoche C, Frigerio JM, Franc A (2022) Malabar datasets used in study "OTU quality from dissimilarity arrays". Recherche Data Gouv, V1. https://doi.org/10.57745/7T2UCB
- Bik HM, Porazinska DL, Creer S, Caporaso JG, Knight R, Thomas WK (2012) Sequencing our way towards understanding global eukaryotic biodiversity. Trends in Ecology & Evolution 27(4): 233–243. https://doi.org/10.1016/j.tree.2011.11.010
- Blaxter M, Mann J, Chapman T, Thomas F, Whitton C, Floyd R, Abebe E (2005) Defining operational taxonomic units using DNA barcode data. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 360(1462): 1935–1943. https://doi.org/10.1098/rstb.2005.1725
- Cortes C, Vapnik V (1995) Support-vector networks. Machine Learning 20(3): 273–297. https://doi.org/10.1007/BF00994018
- Cox T, Cox MAA (2001) Multidimensional Scaling. In: Chapman Hall/CRC (Eds) Monographs on Statistics and Applied Probability, 2nd edn., Vol. 88, 328 pp. https://doi.org/10.1201/9780367801700
- Cros MJ, Frigerio JM, Peyrard N, Franc A (2022) Code, dataset and results for the study "OTU quality from dissimilarity arrays". Figshare. https://doi.org/10.6084/m9.figshare.20764690.v3
- Daudin JJ, Picard F, Robin S (2008) A mixture model for random graphs. Statistics and Computing 18(2): 173–183. https://doi.org/10.1007/s11222-007-9046-7
- Fortunato S (2010) Community detection in graphs. Physics Reports 486(3-5): 75–174. https://doi.org/10.1016/j.physrep.2009.11.002
- Frigerio JM, Rimet F, Bouchez A, Chancerel E, Chaumeil P, Salin F, Thérond S, Kahlert M, Franc A (2016) Diagno-syst: a tool for accurate inventories in metabarcoding. arXiv. https://arxiv.org/abs/1611.09410
- Froslev T, Kjoller R, Bruun H, Ejrnaes R, Brunbjerg A, Pietroni C, Hansen A (2017) Algorithm for post-clustering curation of DNA amplicon data yields reliable biodiversity estimates. Nature Communications 8(1): 1188. https://doi.org/10.1038/s41467-017-01312-x
- Girvan M, Newman M (2002) Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America 99(12): 7821–7826. https://doi.org/10.1073/pnas.122653799
- Gower JC, Ross GJS (1969) Minimum spanning trees and single linkage cluster analysis. Applied Statistics 18(1): 54–64. https://doi.org/10.2307/2346439
- Gusfield D (1997) Algorithms on Strings, Trees and Sequences. Cambridge University Press, 534 pp. https://doi.org/10.1017/CBO9780511574931
- Hajibabaei M, Shokralla S, Zhou X, Singer GAC, Baird DJ (2011) Environmental barcoding: A next generation sequencing approach for biomonitoring applications using river benthos. PLOS ONE 6(4): e17497. https://doi.org/10.1371/journal.pone.0017497
- Holland P, Laskey K, Leinhardt S (1983) Stochastic blockmodels: First steps. Social Networks 5(2): 109–137. https://doi.org/10.1016/0378-8733(83)90021-7
- Kermarrec L, Franc A, Rimet F, Chaumeil P, Humbert JF, Bouchez A (2013) Next-generation sequencing to inventory taxonomic diversity in eukaryotic communities: A test for freshwater diatoms. Molecular Ecology Resources 13(4): 607–619. https://doi.org/10.1111/1755-0998.12105
- Kopp B (1978) Hierarchical Classification I. Biometrical Journal. Biometrische Zeitschrift 20(5): 495–501. https://doi.org/10.1002/bimj.4710200506
- Lee C, Wilkinson D (2019) A review of stochastic block models and extensions for graph clustering. Applied Network Science 4: 122. https://doi.org/10.1007/s41109-019-0232-2
- Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M (2014) Swarm: Robust and fast clustering method for amplicon-based studies. PeerJ 2: e593. https://doi.org/10.7717/peerj.593
- Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M (2015) Swarm v2: Highly-scalable and high-resolution amplicon clustering. PeerJ 3: e1420. https://doi.org/10.7717/peerj.1420
- Müllner D (2013) fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python. Journal of Statistical Software 53(9): 1–18. https://doi.org/10.18637/jss.v053.i09
- Phillips JD, Gillis DJ, Hanner RH (2018) Incomplete estimates of genetic diversity within species: Implications for DNA barcoding. Ecology and Evolution 9(5): 2996–3010. https://doi.org/10.1002/ece3.4757
- Phillips JD, Gillis DJ, Hanner RH (2022) Lack of statistical rigor in DNA barcoding likely invalidates the presence of a true species' barcode gap. Frontiers in Ecology and Evolution 10: 859099. https://doi.org/10.3389/fevo.2022.859099
- Rimet F, Chaumeil P, Keck F, Kermarrec L, Vasselon V, Kahlert M, Franc A, Bouchez A (2016) R-Syst:diatom: an open-access and curated barcode database for diatoms and freshwater monitoring. Database (Oxford) 2016: baw016. https://doi.org/10.1093/database/baw016
- Taberlet P, Coissac E, Hajibabaei M, Rieseberg L (2012) Environmental DNA. Molecular Ecology 2(8): 1789–1793. https://doi.org/10.1111/j.1365-294X.2012.05542.x
- Zinger L, Lionnet C, Benoiston AS, Donald J, Mercier C, Boyer F (2021) metabaR: An R package for the evaluation and improvement of DNA metabarcoding data quality. Methods in Ecology and Evolution 12(4): 586–592. https://doi.org/10.1111/2041-210X.13552