CDSnake: Snakemake pipeline for retrieval of annotated OTUs from paired-end reads using CD-HIT utilities


	Implementation

Pipeline takes as input a folder with paired-end reads in form of fastq files and database of annotated microbial 16S sequences (for instance, Greengenes [5] or SILVA [6]) in form of fasta files. If quality of reads was preliminarily assessed, the lengths of “good” parts of R1 and R2 reads can be provided as input parameters. Parameters of “good” parts depend on overall quality profile of reads and their number, but generally quality should exceed 25 with no sudden drops. Corresponding parts of reads are considered for sequences clustering. Default lengths of “good” parts for R1 and R2 are set to 200 and 180 bp correspondingly. Reads are then put into separate folders for each pairs, and sample file including names of reads for all pairs is created. 16S-ref-db-PE-splice.pl utility then used to cut database 16S sequences into fragments corresponding in lengths and quantities to randomly selected sample. Reads sequences are then filtered by quality using Trimmomatic [7]. Filtered sequences are clustered at 99% using cd-hit-est utility to discard chimeric reads. The important feature of CD-HIT-OTU-MiSeq is that R1 reads of pairs are clustered together, separate form R2 reads, and then clusterization with same parameters is made for R2 reads. Thus reads of pairs do not need to be merged or concatenated. Chimeric reads detection is possible when reads from singe pair vote for non-matching clusters of two clusterizations (and these clusters are large enough). Remaining reads and fragments from annotated database are then clustered at 97% similarity to yields annotated clusters commonly named Operational Taxonomic Units, or OTU. 16S reads clustered with 97% similarity result in read groups corresponding to species or close taxonomic levels. Clusters that matched with some sequences from annotated database receive the annotation written into output OTU file.
