Published July 3, 2018 | Version v1
Journal article Open

A Pipeline for Deep Learning with Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens

  • 1. University of Florida, Gainesville, United States of America|iDigBio, Gainesville, United States of America
  • 2. University of Florida, Gainesville, United States of America
  • 3. Brigham Young University, Provo, United States of America
  • 4. Smithsonian Institution, Washington, United States of America

Description

iDigBio Matsunaga et al. 2013 currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our compute infrastructure. Using these images for scientific research is a logistical and technical challenge. Transferring large numbers of images requires programming skill, bandwidth, and storage space. While simple image transformations such as resizing and generating histograms are approachable on desktops and laptops, the neural networks commonly used for learning from images require server-based graphical processing units (GPUs) to run effectively.

Using the GUODA (Global Unified Open Data Access) infrastructure, we have built a model pipeline for applying user-defined processing to any subset of the images stored in iDigBio. This pipeline is run on servers located in the Advanced Computing and Information Systems lab (ACIS) alongside the iDigBio storage system. We use Apache Spark, the Hadoop File System (HDFS), and Mesos to perform the processing. We have placed a Jupyter notebook server in front of this architecture which provides an easy environment with deep learning libraries for Python already loaded for end users to write their own models. Users can access the stored data and images and manipulate them according to their requirements and make their work publicly available on GitHub.

As an example of how this pipeline can be used in research, we applied a neural network developed at the Smithsonian Institution to identify herbarium sheets that were prepared with hazardous mercury containing solutions Schuettpelz et al. 2017. The model was trained with Smithsonian resources on their images and transferred to the GUODA infrastructure hosted at ACIS which also houses iDigBio. We then applied this model to additional images in iDigBio to classify them to illustrate the application of these techniques to broad image corpora potentially to notify other data publishers of contamination. We present the results of this classification not as a verified research result, but as an example of the collaborative and scalable workflows this pipeline and infrastructure enable.

Files

BISS_article_25699.pdf

Files (68.1 kB)

Name Size Download all
md5:4a959d8d2e4a00698769a987779f5495
54.9 kB Preview Download
md5:94799204eed72d6dcfb529a29e2df99f
13.2 kB Preview Download

Linked records