Published January 3, 2024 | Version 1.0.0
Dataset Open

COI rCRUX filtered metabarcoding reference database and naive-bayes classifier

  • 1. ROR icon Scripps Institution of Oceanography
  • 2. ROR icon National Oceanic and Atmospheric Administration
  • 3. ROR icon Southern California Coastal Water Research Project
  • 4. ROR icon Northern Gulf Institute
  • 5. ROR icon NOAA Atlantic Oceanographic and Meteorological Laboratories

Description

COI metabarcoding database and naive-bayes classifier in QIIME2 .qza format, with Insecta and Amphibia sequences removed. Original database downloaded from here, built with rCRUX using the Leray CO1 primers.

rCRUX details

rCRUX generated by combining and de-replicating the following databases: 

Leray CO1-ncbi-mitochondrial (https://doi.org/10.5281/zenodo.8407603)

Leray CO1-embl (https://doi.org/10.5281/zenodo.8407606)

Leray CO1-searchterm (https://doi.org/10.5281/zenodo.8407620)

 

Primer Name:  Leray CO1
Gene:   CO1
Length of Target:    ~313
Forward Sequence (5'-3'):   GGWACWGGWTGAACWGTWTAYCCYCC
Reverse Sequence (5'-3'):    TANACYTCnGGRTGNCCRAARAAYCA
Reference:   Leray, M., Yang, J. Y., Meyer, C. P., Mills, S. C., Agudelo, N., Ranwez, V., ... & Machida, R. J. (2013). A new versatile primer set targeting a short fragment of the mitochondrial COI region for metabarcoding metazoan diversity: application for characterizing coral reef fish gut contents. Frontiers in zoology, 10(1), 34.

 

Details to filter database and train classifier:

1. Pull out all taxonomic identifiers from the matching to terms 'Insecta' or 'Amphibia' using grep
grep 'Insecta' CO1_combined_derep_and_clean_taxonomy.txt > CO1_combined_derep_and_clean_taxonomy-Insecta.txt

grep 'Amphibia' CO1_combined_derep_and_clean_taxonomy.txt > CO1_combined_derep_and_clean_taxonomy-Amphibia.txt

cat CO1_combined_derep_and_clean_taxonomy-Insecta.txt CO1_combined_derep_and_clean_taxonomy-Amphibia.txt > CO1_combined_derep_and_clean_taxonomy-Insecta-Amphibia.txt
 
2. Use this list of taxonomic identifiers to filter the taxonomy file for those two groups (Python script "grep-vf_Python.py"; attached here)
 
3. From the output file of grep-vf_Python.py, only the first column is the actual fasta header, so extract that column with awk:
awk '{print $1}' CO1_combined_derep_and_clean_taxonomy-noInsectaAmphibia.txt > filtered_taxa_toextract.txt

4. Use this list of non-Amphibia or Insecta taxa to filter the original COI database fasta
seqkit grep -f filtered_taxa_toextract.txt CO1_combined_derep_and_clean.fasta > CO1_combined_derep_and_clean-noInsectaAmphibia.fa
 
5. Convert the filtered fasta and taxonomy files to QIIME2 .qza format:
qiime tools import --type 'FeatureData[Sequence]' \
--input-path CO1_combined_derep_and_clean-noInsectaAmphibia.fasta \
--output-path COI_rCRUX_filt_20231110.qza

qiime tools import --type 'FeatureData[Taxonomy]' \
--input-path CO1_combined_derep_and_clean_taxonomy-noInsectaAmphibia.txt \
--output-path COI_rCRUX_taxonomy_filt_20231110.qza \
--input-format 'HeaderlessTSVTaxonomyFormat'
 
6. Finally, train the classifier:
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads COI_rCRUX_filt_20231110.qza \
--i-reference-taxonomy COI_rCRUX_taxonomy_filt_20231110.qza \
--p-classify--chunk-size 5000 \
--o-classifier COI_rCRUX_filt_20231110-classifier.qza
 

 

Files

CO1_combined_derep_and_clean_taxonomy-noAmphibiaInsecta.txt

Files (877.3 MB)

Name Size Download all
md5:77815c4040ed90e75338fcc4c43eeff8
288.4 MB Download
md5:85f1a3d632a64bdfc97ac957faf8c3b4
94.4 MB Preview Download
md5:48195f688d050f7407ce1c486d9d7a3c
451.1 MB Download
md5:4b407995951f2ca77b196a708ac1e088
31.8 MB Download
md5:9d3e9ea8bbceeb6347b1a450d6f859ed
11.6 MB Download
md5:99ea7c0b2efffce3c6d6b92b847b98fd
894 Bytes Download

Additional details

Related works

Is variant form of
Dataset: 10.5281/zenodo.8407631 (DOI)

References

  • Bokulich NA, Kaehler BD, Rideout JR, Dillon M, Bolyen E, Knight R, Huttley GA, Caporaso JG. 2018. Optimizing taxonomic classification of marker gene sequences. Microbiome 6(1): 90. doi: https://doi.org/10.1186/s40168-018-0470-z.
  • Curd, E. E., Gal, L., Gallego, R., Silliman, K., Nielsen, S., & Gold, Z. (2023). rCRUX: A rapid and versatile tool for generating metabarcoding reference libraries in R. Environmental DNA (Hoboken, N.J.). https://doi.org/10.1002/edn3.489
  • Leray, M., Yang, J. Y., Meyer, C. P., Mills, S. C., Agudelo, N., Ranwez, V., ... & Machida, R. J. (2013). A new versatile primer set targeting a short fragment of the mitochondrial COI region for metabarcoding metazoan diversity: application for characterizing coral reef fish gut contents. Frontiers in zoology, 10(1), 34.