Schmid, Moritz S
Daprano, Dominic
Jacobson, Kyler M
Sullivan, Christopher
BriseƱo-Avena, Christian
Luo, Jessica Y
Cowen, Robert K
2021-05-06
<p><strong>Abstract</strong></p>
<p>Scientific imaging (e.g., satellites looking at ocean color, medical imaging) can produce vast quantities of data that need to be processed on time frames similar to data collection. While satellite imaging has many advantages, the satellite’s sensors cannot penetrate the ocean’s surface more than a few meters. To that effect, underwater imaging systems have been developed in the last 40+ years that can image organisms in-situ in hundreds of meters of water. Underwater imaging systems include those designed for benthic studies (e.g., corals) as well as instruments that document the pelagic realm (e.g., plankton and fish). As an example, we use the In-situ Ichthyoplankton Imaging System (ISIIS) which collects upwards of 14 million images per hour of deployment; in highly productive waters this number can increase up to ten-fold. A typical cruise consisting of 70 hours of ISIIS deployment can yield upwards of 1 billion images of plankton and particles. This big data problem can only be solved by using a high throughput processing pipeline that can be scaled down or up depending on the available resources. Thus, we designed a modular Python-based pipeline that can be deployed on local high-performance computing (HPC) infrastructure such as a University’s HPC, as well as on cloud providers. The code provided with this documentation was optimized for Oregon State University’s Center for Genomic Research and Biocomputing (CGRB) as well as for the National Science Foundation’s Extreme Science and Engineering Discovery Environment (XSEDE), but can easily be adapted to the user’s needs. This code and documentation enable 1) the training of a sparse Convolutional Neural Network (sCNN), and 2) applying the sCNN in a processing pipeline to classify all remaining images in an automated fashion. Standard size measurements of the plankton and particles on the segmented images are also taken as part of the pipeline. The pipeline is optimized for speed and can classify upwards of 30 million images per hour on XSEDE Comet GPU compute nodes. End-to-end processing of 1 hour worth of raw imagery data (ca. 14 million images) using XSEDE CPU and GPU nodes takes ca. 2.4 hours, including data upload, segmentation, classification, and obtaining standard length measurements. This enables us to process a typical cruise of ten 7h transects in about a week. A training library of images as well as a video test dataset are supplied with the code. While the pipeline was built for ISIIS images, imagery from other underwater systems and other areas of science can be used with the pipeline. </p>
<p> </p>
<p><strong>Cite as</strong></p>
<p>Schmid MS, Daprano D, Jacobson KM, Sullivan CM, Briseño-Avena C, Luo JY, Cowen RK. 2021. A Convolutional Neural Network based high-throughput image classification pipeline - code and documentation to process plankton underwater imagery using local HPC infrastructure and NSF’s XSEDE. [Software]. Zenodo. <a href="http://dx.doi.org/10.5281/zenodo.4641158">http://dx.doi.org/</a><a href="http://dx.doi.org/10.5281/zenodo.4641158">10.5281/zenodo.4641158</a></p>
<p> </p>
This project was funded by the National Science Foundation under grant numbers OCE-1737399 and OCE-1419987, the National Aeronautics and Space Administration under grant number 80NSSC20M0008, the Belmont Forum (through NSF grant number 1927710), as well as the Extreme Science and Engineering Discovery Environment (XSEDE) under grant number OCE170012.
https://doi.org/10.5281/zenodo.4641158
oai:zenodo.org:4641158
eng
Zenodo
https://doi.org/10.4319/LOM.2008.6.126
https://doi.org/10.1002/lom3.10285
https://doi.org/10.1038/s41598-020-57879-x
https://zenodo.org/communities/ai_ml
https://zenodo.org/communities/biologicaloceanography
https://doi.org/10.5281/zenodo.4641157
info:eu-repo/semantics/openAccess
GNU General Public License v2.0 or later
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
biological oceanography
plankton ecology
ichthyoplankton
data science
machine learning
convolutional neural net
deep learning
A Convolutional Neural Network based high-throughput image classification pipeline - code and documentation to process plankton underwater imagery using local HPC infrastructure and NSF's XSEDE
info:eu-repo/semantics/other