Published May 10, 2023 | Version v1.0.0
Software Open

metacleaner: Automated curation of barcode sequence databases for metabarcoding and metagenomics

  • 1. @GrozingerLab

Description

DNA barcode reference databases generated by tools like MetaCurator - which operate on sequences retrieved from NCBI - sometimes contain falsely labelled accessions (e.g., see: https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13314). In particular, we found that pollen metabarcoding experiments using plant ITS1 and ITS2 region databases would yield many reads corresponding to random sequences in fungi, or to ITS1/ITS2 sequences in plants of the wrong genera or multiple genera at the same % identity and query coverage. You can read a detailed report of this problem using an example dataset here.

metacleaner takes as input a .fasta file and searches for hits against "good" and "bad" sequence databases to filter undesired accessions before downstream use.

1) Query sequences are searched for hits against known undesired sequences using blastn; query sequences with hits above user-defined thresholds for percentage identity (pident) and query coverage (qcovs) are flagged as mislabeled.

2) Query sequences are searched for hits against known desired sequences using blastn. Query sequences with no hits above user-defined thresholds for percentage identity (pident) and query coverage (qcovs) are flagged as potentially mislabeled, while hits above these thresholds are flagged as candidate clean sequences.

3) Taxonomy info for the top hits against desired sequences for both the candidate clean sequences and potentially mislabeled sequences is retrieved using taxonomizr and compared against the taxonomy info of the query sequence. If the taxonomy info of the query and subject (including cases where there are multiple hits) are not similar at a user-defined level of taxonomy (one of superkingdom, phylum, class, order, family, genus, or species), the query sequence is flagged as mislabeled. An additional inclusion filter can be set. All hits at the pident and qcovs thresholds must have this information or they will be flagged as mislabeled.

4) Flagged sequences are filtered from the query database.

If you use metacleaner in your research, please cite:

Crone, M., Boyle, N., Bresnahan, S.T., Biddinger, D., Grozinger, C.M. (2023). More than mesolectic: Characterizing the nutritional niche of Osmia cornifrons. Ecology and Evolution 13, e10640. https://doi.org/10.1002/ece3.10640

For more info, see: https://github.com/sbresnahan/metacleaner/tree/main

Full Changelog: https://github.com/sbresnahan/metacleaner/commits/v1.0.0

Files

sbresnahan/metacleaner-v1.0.0.zip

Files (22.0 kB)

Name Size Download all
md5:f9ea4ec03873f24cb49d6e610a1ebf03
22.0 kB Preview Download

Additional details

Related works

Is published in
Journal article: 10.1002/ece3.10640 (DOI)