Published February 4, 2025 | Version v4
Dataset Open

Script and data from: The best of two worlds: toward large-scale monitoring of biodiversity combining metabarcoding and optimised parataxonomic validation.

  • 1. EDMO icon National Research Institute For Agriculture, Food And Environment
  • 2. EDMO icon French Agricultural Research Centre for International Development

Description

Description

Zenodo linked to : Penel, B., Meynard, C.N., Benoit, L., Bourdonné, A., Clamens, A., Soldati, L., Migeon, A., Chapuis, M.-P., Piry, S., Kergoat, G. and Haran, J. (2025), The best of two worlds: toward large-scale monitoring of biodiversity combining COI metabarcoding and optimized parataxonomic validation. Ecography, 2025: e07699. https://doi.org/10.1111/ecog.07699

Publication abstract 

In a context of unprecedented insect decline, it is critical to have reliable monitoring tools to measure species diversity and their dynamic at large-scales. High-throughput DNA-based identification methods, and particularly metabarcoding, were proposed as an effective way to reach this aim. However, these identification methods are subject to multiple technical limitations, resulting in unavoidable false-positive and false-negative species detection. Moreover, metabarcoding does not allow a reliable estimation of species abundance in a given sample, which is key to document and detect population declines or range shifts at large scales. To overcome these obstacles, we propose here a Human-Assisted Molecular Identification (HAMI) approach, a framework based on a combination of metabarcoding and image-based parataxonomic validation of outputs and recording of abundance. We assessed the advantages of using HAMI over the exclusive use of a metabarcoding approach by examining 492 mixed beetle samples from a biodiversity monitoring initiative conducted throughout France. On average, 23% of the species are missed when relying exclusively on metabarcoding, this percent being consistently higher in species-rich samples. Importantly, on average, 20% of the species identified by molecular-only approaches correspond to false positives linked to cross-sample contaminations or mis-identified barcode sequences in databases. The combination of molecular methodologies and parataxonomic validation in HAMI significantly reduces the intrinsic biases of metabarcoding and recovers reliable abundance data. This approach also enables users to engage in a virtuous circle of database improvement through the identification of specimens associated with missing or incorrectly assigned barcodes. As such, HAMI fills an important gap in the toolbox available for fast and reliable biodiversity monitoring at large scales.

File description: 

MiSeq raw sequences of the COI barcode from 492 Coleoptera field samples :

The Raw_sequencage_data ZIP directory contains the FASTQ files of the paired-end reads (R1: reads 1; R2: reads 2) produced for each Coleoptera field samples in duplicate using the MiSeq platform GenSeq (ISEM - University of Montpellier)
 
The HAMI_data_script_results_R zip directory contains Rmarkdown script files (.Rmd and .html)  and associated data used to analyse the systemic errors of the metabarcoding approach (N= 492 Coleoptera field samples).
 
The HAMI_pipeline zip directory contains all the codes associated with the HAMI pipeline, as well as a ReadMe file and a test data set.
 
The Residual_chimera.zip directory contains lists of MOTUs associated to residual chimeric sequences that were not filtered using FROGS pipeline but secondarily detected with the de novo approach implemented in HAMI pipeline with ‘isBimeraDenovo’ R function from DADA2 v1.28.0.  It contains two distinct files according to the two sequencing runs.
 
The NUMTS_filtered.zip directory contains lists of MOTUs that were excluded of the final dataset according to the NUMTS filtering.  File xxx_pseudogene_f1_deteled.csv corresponds to MOTUs that were excluded according to the first filtrering step based on DNA sequencing.  File xxx_pseudogene_f2_deteled.csv corresponds to merged MOTUs that were excluded according to the second filter based on occurrence and percentage of identity. This folder contains files for the two sequencing runs.

Files

Raw_sequencage_data.zip

Files (4.8 GB)

Name Size Download all
md5:4a9c848f30bd0458d54701ac0cac34e5
2.1 MB Preview Download
md5:f3fe71cfb553a9b8c65dd63fe4e8ee41
32.2 MB Preview Download
md5:c4ae1ce4d4cf871bdf110ee05cec1c19
251.2 kB Preview Download
md5:3ee5752fc75ae58fdce88f61503af1cb
4.8 GB Preview Download
md5:0bf855449797ec7292cab63fe8680f25
2.7 MB Preview Download

Additional details

Funding

Agence Nationale de la Recherche
ANR Agribidiov ANR-21-CE32-006-01
Ministère de l'Agriculture et de la Souveraineté alimentaire
Ecophyto II+ project: GTP 500 ENI OFB-21-1642

Software

Repository URL
https://github.com/BenoitPenel/PEWO-1
Programming language
Python, R, Snakemake
Development Status
Active