rDNA 18S V4 ASV tables (Swarm) for Tara Oceans Expedition (2009-2013), including Tara Polar Circle Expedition (2013)
Authors/Creators
- 1. CIRAD, UMR PHIM, F-34398, Montpellier, France
- 2. CNRS, FR 2424, ABiMS Platform, Station Biologique de Roscoff, Sorbonne Université, Roscoff, France
- 3. Sorbonne Université, CNRS, Station Biologique de Roscoff, AD2M, UMR 7144, 29680 Roscoff, France
Description
Reads were grouped into OTUs using the following swarm-based pipeline: paired-end reads were merged with vsearch’s --fastq_mergepairs command (version 2.15.1, allowing for staggered reads; Rognes et al., 2016), and trimmed with cutadapt (version 3.0; Martin, 2011), keeping only reads containing both forward and reverse primers. After trimming, the expected error per read was estimated with vsearch’s command --fastq_filter and the option --eeout. Each sample was then de-replicated, i.e. strictly identical reads were merged, using vsearch’s command --derep_fulllength, and converted into fasta format. Clustering was performed at the sample level with swarm 3.0 using default parameters (Mahé et al., 2015). Prior to global clustering, individual fasta files (one per sample) were pooled and further dereplicated with vsearch. Files containing per-read expected error values were also dereplicated to retain only the lowest expected error for each unique sequence. Global clustering was performed with swarm (using the fastidious option). Cluster representative sequences were then searched for chimeras with vsearch’s command --uchime_denovo using default parameters (Edgar et al., 2011).
Clustering results, expected error values, taxonomic assignments, and chimera detection results were used to build a “raw” occurrence table. Reads without primers, reads shorter than 32 nucleotides and reads with uncalled bases (“N”) were discarded. For a “filtered” occurrence table, non-chimeric sequences, sequences with an expected error per nucleotide below 0.0002, and clusters containing at least 2 reads were retained. Since primer trimming is not perfect, some sequences can still contain primer fragments or be excessively trimmed. These sub- or super-sequences were identified using vsearch and merged with their closest, most abundant perfectly trimmed sequence. Finally, occurrence patterns throughout our sample collection were used to further refine the occurrence table. Clusters that contain sub-clusters with only a single-nucleotide difference but with different ecological patterns (defined here as uncorrelated abundance values in at least 5% of the samples) were turned into distinct clusters (https://github.com/frederic-mahe/fred-metabarcoding-pipeline). On the other hand, clusters with similar sequences that had correlated abundance values in at least 95% of the samples, were merged using a re-implementation of lulu's method (Frøslev et al. 2017; https://github.com/frederic-mahe/mumu).