Global rRNA Universal Metabarcoding of Plankton database (GRUMP)

Williams, Nathan; McNichol, Jesse; Raut, Yubin; Fuhrman, Jed

doi:10.5281/zenodo.15586714

Published June 3, 2025 | Version 1.3.5

Dataset Open

Global rRNA Universal Metabarcoding of Plankton database (GRUMP)

1. USC

We introduce the Global rRNA Universal Metabarcoding Plankton database (McNichol and Williams et al., 2025), 1194 samples covering extensive latitudinal and longitudinal transects, including depth profiles, in all major ocean basins from 2003-2020. Unfractionated (>0.2 µm) seawater DNA samples were amplified using the 515Y/926R universal 3-domain rRNA primers, quantifying the relative abundance of amplicon sequencing variants (ASVs) from Bacteria, Archaea, and Eukaryotes with one denominator. Thus, the ratio of any organism (or group) to any other in a sample is directly comparable to the ratio in any other sample within the dataset, irrespective of gene copy number differences. This obviates a problem in prior global studies that used size-fractionation and different primers for prokaryotes and eukaryotes, precluding comparisons between abundances across size fractions or domains.

Sample Collection Samples were collected by multiple collaborations, which used slightly different sample collection techniques. These collection techniques will be outlined by individual cruise.

For the Atlantic Meridional Transects (AMT 19 and AMT 20), 5–10 L of whole seawater was collected from the sea surface using a Niskin bottle and was then filtered onto 0.22 µm Sterivex Durapore filters (Millipore Sigma, Burlington, MA, USA). Samples were collected by Stephanie Sargeant and Andy Rees, Plymouth Marine Laboratory (PML) as part of the Atlantic Meridional Transect (AMT) (Rees, Smyth and Brotas, 2024) research cruises 19 (2009) and 20 (2010) onboard UK research vessel RRS James Cook (JC039 and JC053 – (Rees, 2010a, 2010b). Sterivex filters were capped and stored in RNAlater® (ThermoFisher) at -80 °C until analysis.

For samples taken in the FRAM Strait (Wietz et al., 2021), whole seawater was collected using Remote Access Samplers (RAS; McLane) on seafloor moorings F4-S-1, HG-IV-S-1, Fevi-34, and EGC-5. Moorings were operated within the FRAM / HAUSGARTEN Observatory covering the West Spitsbergen Current, central Fram Strait, and East Greenland Current as well as the Marginal Ice Zone. RAS performed continuous, autonomous sampling from July 2016 – August 2017 in programmed intervals (weekly to monthly). Nominal deployment depths were 30 m (F4, HG-IV), 67 m (Fevi), and 80 m (EGC). However, vertical movements in the water column resulted in variable actual sampling depths, ranging from 25 to 150 m. Per sampling event, two lots of 500 mL of whole seawater was pumped into bags containing mercuric chloride for fixation. After RAS recovery, the two samples per sampling event were pooled, and approximately 700 mL of pooled water was filtered onto 0.22 µm Sterivex cartridges. Filtered samples were stored at -20°C until DNA extraction. For MOSAiC whole seawater was collected from the upper water via a rosette sampler equipped with Niskin bottles through a hole in the sea ice next to the RV Polarstern. If possible, duplicate samples, with two Niskins per depth were collected during the up-casts near the surface (~5 m), 10 m, chlorophyll max (~20–40 m), 50 m, and 100m. in these Niskins and. 1-4 litres was filtered on to Sterivex-filters (0.22 µm pore size) using a peristaltic pump in a temperature controlled lab at 1°C in the dark, only using red light. The number of Sterivex-filters used per sampling event varied between two during Polar Night and 3-4 during Polar day, depending on the biomass found in the samples. Sterivex filters with were stored at -80°C until further processing took place in the laboratory.

GEOTRACES cruises (Anderson et al., 2014), including transects GA02, GA03, GA10, and GP13, collected whole seawater using a Niskin bottle, filtering 100 mL of whole seawater between the surface and 5601 m onto 0.2 µm 25 mm polycarbonate filters. After filtration, 3 mL of sterile preservation solution (10 mM Tris, pH 8.0; 100 mM EDTA; 0.5 M NaCl) was added, and samples were stored in cryovials at -80°C until DNA extraction.

During the 2017 and 2019 SCOPE (Simons Collaboration on Ocean Processes and Ecology) - Gradients cruises, 0.7-4 L of whole seawater was collected at sea using the ships underway system, which is approximately 7 m below the surface, as well as the rosette sampler for depths between 15 – 125 m by Mary R. Gradoville, Brittany Stewart, and Esther Wing (Zehr lab) (Gradoville et al., 2020). This water was filtered onto 0.22 µm 25 mm Supor membrane filters (Pall Corporation, New York) and stored at -80°C until DNA extraction.

The collection of Southern Ocean transects include the 1) IND-2017 dataset, which were taken during the Totten Glacier-Sabrina Coast voyage in 2017 as part of the CSIRO Marine National Facility RV Investigator Voyage IN2017_V01, 2) the Kerguelen-Axis Marine Science program (K-AXIS) in 2016 on the Australian Antarctic Division RV Aurora Australis 2015/16 voyage 3, 3) Global Ocean Ship-based Hydrographic Investigations Program (GO-SHIP) P15S cruise in 2016 as a part of the CSIRO Marine National Facility RV Investigator Voyage IN2016_V03, and 4) the Heard Earth-Ocean-Biosphere Interactions (HEOBI) voyage in 2016 as part of the CSIRO Marine National Facility RV Investigator Voyage IN2016_V01. For these cruises, 2 L of whole seawater was filtered onto 0.22 µm Sterivex-GP polyethersulfone membrane filters (Millipore). This water was collected from the ships underway during IND-2017 by Amaranta Focardi (Paulsen Lab, Macquarie University), from between 5 and 4625 m during K-AXIS by Bruce Deagle and Lawrence Clarke (Australian Antarctic Division), from between 5 and 6015 m during GO-SHIP P15S by Eric J. Raes, Swan LS Sow and Gabriela Paniagua Cabarrus (Environmental Genomics Team, CSIRO Environment), Nicole Hellessey (University of Tasmania) and Bernhard Tschitschko (University of New South Wales), and from between 7-3579 m during HEOBI by Thomas Trull (CSIRO Environment). After filtration, samples were stored at -80°C until analysis.

As part of GO-SHIP, there were several additional transects (i.e., I08S, I09N, P16 S/N), including some that also traversed into the Southern Ocean (i.e., I08S, P16S) or Arctic Ocean (P16N). For I08S and I09N, 2 L of whole seawater was filtered onto 0.22 µm 25 mm filters (Supor® hydrophilic polyethersulfone membrane) by Norm Nelson (I08S) and Elisa Halewood (I09N), UCSB, as part of the U.S. Global Ocean Ship-based Hydrographic Investigations Program aboard the R/V Roger Revelle during the cruises in 2007. Sucrose lysis buffer was added to filters, which were then stored at -80°C until DNA extraction. For P16N and P16S, samples were collected at various depths by Elisa Halewood and Meredith Meyers (Carlson Lab, UCSB) onto 0.22 µm 25 mm Supor filters during two latitudinal transects of the Pacific Ocean in 2005 and 2006 as part of the GO-SHIP repeat hydrography program (then known as CLIVAR). Samples were stored as partially extracted lysates in sucrose lysis buffer at -80°C until DNA extraction.

Finally, for samples from the Production Observations Through Another Trans-Latitudinal Oceanic Expedition (POTATOE) cruise, 20 L of whole seawater was collected from the sea surface between 1-2 m and filtered onto 0.22 µm Sterivex® filters during a “ship of opportunity” cruise on the RVIB Nathaniel B Palmer in 2003 (Baldwin et al., 2005). Sterivex filters were stored dry at -80°C until DNA extraction.

All datasets had corresponding environmental data. We included date, time, latitude, longitude, depth, temperature, salinity, oxygen for all transects, and nutrient data where available. However, some cruises have other environmental data which can be found at the British Oceanographic Data Centre (https://www.bodc.ac.uk/) for both AMT cruises, at the CSIRO National Collections and Marine Infrastructure Data Trawler (https://www.cmar.csiro.au/data/trawler/survey_details.cfm?survey=IN2016_V01) for IND-2017 and HEOBI, at the CLIVAR and Carbon Hydrographic Data Office (https://cchdo.ucsd.edu/) for GO-SHIP P15S, P16N and P16S, at the Australian Antarctic Division Datacenter (https://data.aad.gov.au/aadc/voyages/) for the K-AXIS cruise, at (https://doi.org/10.6075/J0CCHLY9https://doi.org/10.6075/J0CCHLY9) for the I08S and I09N cruises, at the MGDS (Marine Geoscience Data System: https://www.marine-geo.org) for POTATOE, at (https://scope.soest.hawaii.edu/data/gradients/documents/) for both SCOPE-Gradients cruises, and at PANGAEA (https://www.pangaea.de/) for FRAM Strait and MOSAiC. Finally, we have also used satellite data to estimate the euphotic zone depth where photosynthetic available radiation (PAR) is 1% of its surface value (Lee et al., 2007; Kirk, 2010). We approximated the euphotic zone depth using the light attenuation at 490nm (Kd 490) product and the relationship Z eu(1%) = 4.6/Kd 490. We also used the script Longhurst-Province-Finder (https://github.com/thechisholmlab/Longhurst-Province-Finder) to assign each sample to the Longhurst Province in which it was sampled in, another useful column to help subset data and investigate specific regions of the ocean.

DNA Extraction For AMT cruises, DNA was isolated using the Qiagen AllPrep DNA/RNA Mini kit (Hilden, Germany) with modifications to be compatible with RNAlater® and to disrupt cell membranes (Varaljay et al., 2015). Briefly, the filter was removed from the Sterivex housing and immersed in RLT+ buffer that had been amended with 10 µl 1N NaOH per 1ml buffer, followed by a 2 minute agitation in a Mini-Beadbeater-96 (Biospec Inc., Bartlesville, OK, USA) with 0.1- and 0.5 mm sterile glass beads (BioSpec).

DNA was isolated from the FRAM Strait and MOSAiC samples using a PowerWater kit (QIAGEN, Germany).

DNA from GEOTRACES was isolated and purified using a phenol/chloroform-based method by Steven Biller, Paul Berube, Keven Dooley, and Medline Williams (Chisholm lab, MIT) (Biller et al., 2018).

DNA from the SCOPE-Gradients cruises was isolated using the extraction method described by (Gradoville et al., 2020).

DNA from IND-2017, K-AXIS, HEOBI, and P15S was isolated and purified in 2017-2018, using a modified phenol:chloroform:isoamyl based extraction protocol of the DNeasy PowerWater Sterivex Kit (Brown et al., 2018; Raes et al., 2018). HEOBI samples were extracted by Swan LS Sow and Jodie van de Kamp in 2017 (Environmental Genomics Team, CSIRO Environment). P15S samples were extracted by DNA was isolated and purified by Swan LS Sow, Eric J. Raes, Bronwyn Holmes and Jodie van de Kamp in 2017-2018 (Environmental Genomics Team, CSIRO Environment). K-AXIS samples were extracted by Swan LS Sow and Bronwyn Holmes in 2017 (Environmental Genomics Team, CSIRO Environment). IND-2017 samples were extracted by Swan LS Sow (Environmental Genomics Team, CSIRO Environment).

For I08S, I09N, P16N, P16S, and POTATOE, the DNA was isolated and purified using the method described in (Boström et al., 2004), with modifications detailed in (Manganelli et al., 2009; Signori et al., 2014). An additional bead-beating step was added using 0.1mm glass beads (Biospec 11079101) for two minutes at maximum speed on a VWR Analog Vortex Mixer (58816-121). DNA was precipitated in low-EDTA buffer to protect from degradation and quantified using PicoGreen dsDNA assay (Thermo Fisher Scientific). A detailed extraction protocol is available at https://osf.io/k4c5u. To minimize potential contamination, all DNA extracts were done in molecular biology-grade plasticware with filter tips, and all PCR steps were conducted in plasticware that had been exposed to UV radiation for at least 10 minutes to inactivate any contaminating DNA templates.

Sequencing methods and pipeline Relative abundance information was generated by PCR amplification of extracted DNA using the 515Y/926R universal primer pair (Parada, Needham and Fuhrman, 2016) which amplify both 16S and 18S simultaneously. All samples except for FRAM, for which raw sequences were supplied, were amplified using the protocol available at (https://www.protocols.io/view/fuhrman-lab-515f-926r-16s-and-18s-rrna-gene-sequen-vb7e2rn). Briefly, PCR reaction reagent master mix was made by adding 12 µl of PCR water (VWR), 10 µl of 5’ Master Mix (0.5 U taq, 45 mM Kcl, 2.5 mM Mg2+, 200 µM each dNTP), 1.5 µl of 1:1, 515F:926R barcoded primer mix (0.3 mM each primer) and 1 µl of DNA per reaction for a final volume of 25 µl. For the samples taken on the I8S, I9N, P16N, P16S, AMT-19, AMT-20, POTATOE, GA02, GA03, GA10, GP13, GRADIENTS 2, and GRADIENTS 3 cruises, 5' master mix (Quantabio) was used, while for samples taken on cruises P15S, K-AXIS, IND-2017, and HEOBI, GoTaq master mix (Promega) was used and finally, for samples taken during the FRAM Strait and MOSAiC cruises U Phusion Polymerase (Thermo Fisher, Germany) was used. PCR was performed under the following conditions: initial denaturing at 95°C for 120 seconds, followed by 30 cycles of 95°C for 45 seconds, 50°C for 45 seconds, and then 68 °C for 90 seconds, and then a final elongation step at 68°C for 300 seconds. PCR products were then stored at 4 °C before being cleaned using the Agencourt AMPure XP PCR purification protocol, and then DNA concentration was quantified using the Pico-Green dsDNA Quant-iT Assay Kit. Given we used barcoded primers, samples were then pooled, and then this pool was cleaned and concentrated with SPRIselect (Beckman Coulter B23317) beads, and then this cleaned pool was quantified using the Qubit dsDNA HS Assay Kit. Finally, the ratio of 16S:18S was calculated with the Bioanalyzer Chip: High Sensitivity DNA Kit.

Most sequencing was done at Tufts Medical School using HiSeq Rapid Run technology (2 x 250 bp), except for FRAM and MOSAiC which were sequenced using MiSeq technology (2 x 300 bp) and GA03/GP13 which were sequenced at USC’s genome sequencing center Molecular Genomics Core, with the same parameters as Tufts. Sequences were demultiplexed manually (https://github.com/jcmcnch/demux-notes) and denoised to amplicon sequence variants (ASVs) with a custom analysis pipeline based on QIIME2 and DADA2 (Callahan et al., 2016; Bolyen et al., 2019) which can be found on GitHub (https://github.com/jcmcnch/eASV-pipeline-for-515Y-926R). The main difference between this pipeline and standard workflows is that it contains an initial 16S/18S splitting step, which is accomplished using bbsplit against curated 16S and 18S databases derived from SILVA and PR2. This results in two ''pools'' of data (16S and 18S rRNA) that are then denoised separately, and later merged if desired. Note that standard pipelines require a merging of forward and reverse reads where they overlap, which for these primers leads to removal of all 18S sequences because their forward and reverse 18S reads do not overlap. We used a universal trim length of 220 for forward and 180 for reverse 18S reads so that ASVs could be directly compared to each other across cruises (also because forward and reverse 18S reads do not overlap, so we needed a single overall length and concatenated forward + reverse reads to create intercomparable ASVs). The 16S forward and reverse do overlap, so trimming of them is not necessary, and 16S amplicons fully span the amplified region. Two corrections were made to both 16S and 18S ASV tables before they were merged. The first adjusts for a known bias against longer 18S sequences during Illumina sequencing (Yeh et al., 2021) and the second adjusts for random variations in sample quality using the DADA2 output statistics. Briefly, bioanalyzer traces were taken for all samples except the FRAM Strait and MOSAiC samples, to determine the 16S and 18S bioanalyzer fractions prior to sequencing, which was calculated by:

Bioanalyzer 16S fraction= (16S bioanalyzer (nmol/L))/(16S+18S bioanalyzer (nmol/L) )

Bioanalyzer 18S fraction= (18S bioanalyzer (nmol/L))/(16S+18S bioanalyzer (nmol/L) )

and this was considered the “expected” outcome from the sequencing run. The same fractions were then calculated after the bbsplit step (which sorts 16S and 18S sequences into separate bins) in the analysis pipeline (Figure 1), using the number of 16S and 18S sequences measured by the sequencer.

These ratios were calculated by:

16S sequence fraction= (Total 16S sequences)/(Total 16S+total 18S sequences)

18S sequence fraction= (Total 18S sequences)/(Total 16S+total 18S sequences)

Once both the bioanalyzer ratios and the sequence ratios for 16S and 18S were obtained, we then calculated the correction factor for each sequence type by:

16S correction factor= (16S bioanalyzer fraction)/(16S sequence fraction)

18S correction factor= (18S bioanalyzer fraction)/(18S sequence fraction)

Finally, both the 16S and 18S data were multiplied by their respective correction factor using this script (https://github.com/Nwilliams96/Project-2-Universal-Primer-Pipeline), and for FRAM Strait and MOSAiC data, which did not have bioanalyzer traces, we used the same correction factor used for the Southern Ocean samples, as it was predicted they would have a similar level of Eukaryotic bias. The final dataset contains merged abundance information from both 16S and 18S ASVs using the same denominator. Correction factors and calculations are available in Supplementary Table 1 (McNichol and Williams et al., 2025).

In the final dataset, meta barcode information was derived from organisms that range from small Bacteria and Archaea (16S) to protists (photosynthetic and non-photosynthetic; 16S chloroplasts & nuclear 18S) as well as larger planktonic organisms such as arthropods, salps, and jellyfish (nuclear 18S) which are derived from tissue fragments, larvae, eggs, etc. that were captured on 0.2 µm filters. Data users should note that most photosynthetic protists (with the exception of dinoflagellates) will be represented by both chloroplast 16S and nuclear 18S. The taxonomy levels are joined from three databases: PR2 (Vaulot, 2023), SILVA (Quast et al., 2012) and ProPortal (Kelly et al., 2012). The levels are as follows: Domain – SILVA Level 1 Taxonomy and PR2 Level 1 Taxonomy, Supergroup – PR2 Level 2 Taxonomy, Division – PR2 Level 3 Taxonomy, Phylum – SILVA Level 2 Taxonomy, Class - SILVA Level 3 Taxonomy and PR2 Level 4 Taxonomy, Order - SILVA Level 4 Taxonomy and PR2 Level 5 Taxonomy, Family - SILVA Level 5 Taxonomy and PR2 Level 6 Taxonomy, Genus - SILVA Level 6 Taxonomy and PR2 Level 7 Taxonomy, Species - SILVA Level 7 Taxonomy and PR2 Level 8 Taxonomy, Proportal_ASV_Ecotype – ProPortal assignment.

Scripts necessary to exactly reproduce the analysis are available at https://github.com/jcmcnch/Global-rRNA-Univeral-Metabarcoding-of-Plankton, with modifications to the merge script which can be found at https://github.com/Nwilliams96/Project-2-Universal-Primer-Pipeline-Modifications.

Files

grump_asv_long_version-1.3.5.csv

Files (1.3 GB)

Name	Size	Download all
grump_asv_long_version-1.3.5.csv md5:46935caefc9b643d62adc5f5c2b61b8c	1.3 GB	Preview Download

Additional details

U.S. National Science Foundation
EF-2125142
Simons Foundation
Computational Biogeochemical Modeling of Marine Ecosystems (CBIOMES) 549943

Programming language: Python , R

Anderson, R.F. et al. (2014) 'GEOTRACES: Changing the Way We Explore Ocean Chemistry', Oceanography, 27(1), pp. 50–61.
Baldwin, A. et al. (2005) 'Microbial diversity in a Pacific Ocean transect from the Arctic to Antarctic circles', Aquatic Microbial Ecology, 41, pp. 91–102. Available at: https://doi.org/10.3354/ame041091.
Biller, S.J. et al. (2018) 'Marine microbial metagenomes sampled across space and time', Scientific Data, 5(1), p. 180176. Available at: https://doi.org/10.1038/sdata.2018.176.
Bolyen, E. et al. (2019) 'Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2', Nature Biotechnology, 37(8), pp. 852–857. Available at: https://doi.org/10.1038/s41587-019-0209-9.
Boström, K.H. et al. (2004) 'Optimization of DNA extraction for quantitative marine bacterioplankton community analysis', Limnology and Oceanography: Methods, 2(11), pp. 365–373. Available at: https://doi.org/10.4319/lom.2004.2.365.
Brown, M.V. et al. (2018) 'Systematic, continental scale temporal monitoring of marine pelagic microbiota by the Australian Marine Microbial Biodiversity Initiative', Scientific Data, 5(1), p. 180130. Available at: https://doi.org/10.1038/sdata.2018.130.
Callahan, B.J. et al. (2016) 'DADA2: High-resolution sample inference from Illumina amplicon data', Nature Methods, 13(7), pp. 581–583. Available at: https://doi.org/10.1038/nmeth.3869.
Gradoville, M.R. et al. (2020) 'Latitudinal constraints on the abundance and activity of the cyanobacterium UCYN-A and other marine diazotrophs in the North Pacific', Limnology and Oceanography, 65(8), pp. 1858–1875. Available at: https://doi.org/10.1002/lno.11423.
Raes, E.J. et al. (2018) 'Marine bacterial richness increases towards higher latitudes in the eastern Indian Ocean', Limnology and Oceanography Letters, 3(1), pp. 10–19. Available at: https://doi.org/10.1002/lol2.10058.
Rees, A.P. (2010a) AMT19 Cruise Report. Plymouth Marine Laboratory. Available at: https://www.plymsea.ac.uk/id/eprint/9514/ (Accessed: 26 November 2024).
Rees, A.P. (2010b) AMT20 Cruise Report. Plymouth Marine Laboratory. Available at: https://www.plymsea.ac.uk/id/eprint/9515/ (Accessed: 26 November 2024).
Rees, A.P., Smyth, T.J. and Brotas, V. (2024) 'Editorial: The Atlantic Meridional Transect programme (1995-2023)', Frontiers in Marine Science, 11. Available at: https://doi.org/10.3389/fmars.2024.1358174.
Varaljay, V.A. et al. (2015) 'Single-taxon field measurements of bacterial gene regulation controlling DMSP fate', The ISME Journal, 9(7), pp. 1677–1686. Available at: https://doi.org/10.1038/ismej.2015.23.
Wietz, M. et al. (2021) 'The polar night shift: seasonal dynamics and drivers of Arctic Ocean microbiomes revealed by autonomous sampling', ISME Communications, 1(1), p. 76. Available at: https://doi.org/10.1038/s43705-021-00074-4.
Yeh, Y. et al. (2021) 'Comprehensive single-PCR 16S and 18S rRNA community analysis validated with mock communities, and estimation of sequencing bias against 18S', Environmental Microbiology, 23(6), pp. 3240–3250. Available at: https://doi.org/10.1111/1462-2920.15553.
McNichol, J and Williams, NLR. et al. (2025) 'Characterizing organisms from three domains of life with universal primers from throughout the global ocean', bioRxiv, p. 2025.02.19.638942. Available at: https://doi.org/10.1101/2025.02.19.638942.

	All versions	This version
Views	562	234
Downloads	386	104
Data volume	553.4 GB	141.9 GB

grump_asv_long_version-1.3.5.csv

Files (1.3 GB)

Funding

Software

References

Global rRNA Universal Metabarcoding of Plankton database (GRUMP)

Authors/Creators

Description

Files

grump_asv_long_version-1.3.5.csv

Files (1.3 GB)

Additional details

Funding

Software

References