CCMetagen: comprehensive and accurate identification of eukaryotes and prokaryotes in metagenomic data

10.5281/zenodo.3668497 https://zenodo.org/records/3668497 oai:zenodo.org:3668497 Marcelino, V. R. V. R. Marcelino The University of Sydney Clausen, P.T.C.L P.T.C.L Clausen Technical University of Denmark Buchmann, J.P. J.P. Buchmann The University of Sydney Wille, M. M. Wille The Peter Doherty Institute for Infection and Immunity Iredell, J.R. J.R. Iredell The University of Sydney Meyer, W. W. Meyer The University of Sydney Lund, O. O. Lund Technical University of Denmark Sorrell, T.C. T.C. Sorrell The University of Sydney Holmes, E.C. E.C. Holmes The University of Sydney CCMetagen: comprehensive and accurate identification of eukaryotes and prokaryotes in metagenomic data Zenodo 2019 metagenomics microbiome 2019-05-18 2020-02-15 eng 10.1101/641332 10.5281/zenodo.3668496 1.0.0 Creative Commons Attribution 4.0 International CCMetagen is a software to identify taxa from metagenome data. This repository contains CCMetagen version 1.0.0, which was benchmarked with other software in the original CCMetagen publication. High-throughput sequencing of DNA and RNA from environmental and host-associated samples (metagenomics and metatranscriptomics) is a powerful tool to assess which organisms are present in a sample. Taxonomic identification software usually align individual short sequence reads to a reference database, sometimes containing taxa with complete genomes only. This is a challenging task given that different species can share identical sequence regions and complete genome sequences are only available for a fraction of organisms. A recently developed approach to map sequence reads to reference databases involves weighing all high scoring read-mappings to the data base as a whole to produce better-informed alignments. We used this novel concept in read mapping to develop a highly accurate metagenomic classification pipeline named CCMetagen. Our pipeline substantially outperforms other commonly used software in identifying bacteria and fungi, and can efficiently use the entire NCBI nucleotide collection as a reference to detect species with incomplete genome data from all biological kingdoms. CCMetagen is user-friendly and the results can be easily integrated into microbial community analysis software for streamlined and automated microbiome studies.