Published August 10, 2023 | Version v1.1.0
Dataset Open

Data from: A FAIR and modular image-based workflow for knowledge discovery in the emerging field of imageomics

  • 1. National Ecological Observatory Network
  • 2. Duke University School of Medicine

Description

Data and results from the Imageomics Workflow. These include data files from the Fish-AIR repository (https://fishair.org/) for purposes of reproducibility and outputs from the application-specific imageomics workflow contained in the Minnow_Segmented_Traits repository (https://github.com/hdr-bgnn/Minnow_Segmented_Traits).

Fish-AIR:
This is the dataset downloaded from Fish-AIR, filtering for Cyprinidae and the Great Lakes Invasive Network (GLIN) from the Illinois Natural History Survey (INHS) dataset. These files contain information about fish images, fish image quality, and path for downloading the images. The data download ARK ID is dtspz368c00q. (2023-04-05). The following files are unaltered from the Fish-AIR download. We use the following files:

extendedImageMetadata.csv: A CSV file containing information about each image file. It has the following columns: ARKID, fileNameAsDelivered, format, createDate, metadataDate, size, width, height, license, publisher, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.

imageQualityMetadata.csv: A CSV file containing information about the quality of each image. It has the following columns: ARKID, license, publisher, ownerInstitutionCode, createDate, metadataDate, specimenQuantity, containsScaleBar, containsLabel, accessionNumberValidity, containsBarcode, containsColorBar, nonSpecimenObjects, partsOverlapping, specimenAngle, specimenView, specimenCurved, partsMissing, allPartsVisible, partsFolded, brightness, 
uniformBackground, onFocus, colorIssue, quality, resourceCreationTechnique. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.

multimedia.csv: A CSV file containing information about image downloads. It has the following columns: ARKID, parentARKID, accessURI, createDate, modifyDate, fileNameAsDelivered, format, scientificName, genus, family, batchARKID, batchName, license, source, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.

meta.xml: A XML file with the metadata about the column indices and URIs for each file contained in the original downloaded zip file. This file is used in the fish-air.R script to extract the indices for column headers.

The outputs from the Minnow_Segmented_Traits workflow are:

sampling.df.seg.csv: Table with tallies of the sampling of image data per species during the data cleaning and data analysis. This is used in Table S1 in Balk et al. 

presence.absence.matrix.csv: The Presence-Absence matrix from segmentation, not cleaned. This is the result of the combined outputs from the presence.json files created by the rule “create_morphological_analysis”. The cleaned version of this matrix is shown as Table S3 in Balk et al.

heatmap.avg.blob.png and heatmap.sd.blob.png: Heatmaps of average area of biggest blob per trait (heatmap.avg.blob.png) and standard deviation of area of biggest blob per trait (heatmap.sd.blob.png). These images are also in Figure S3 of Balk et al.

minnow.filtered.from.iqm.csv: Filtered fish image data set after filtering (see methods in Balk et al. for filter categories).

burress.minnow.sp.filtered.from.iqm.csv: Fish image data set after filtering and selecting species from Burress et al. 2017.

Notes

All information needed to run this workflow and generate these files can be found on the readme file on Minnow_Segemented_Traits.

Files

burress.minnow.sp.filtered.from.iqm.csv

Files (22.2 MB)

Name Size Download all
md5:8281fe4f1eec17216c0a1b5bf35aa98f
64.6 kB Preview Download
md5:0c8f2c5b325a6f7127d396e1dd116d91
5.3 MB Preview Download
md5:f30ada868c58e4b559b56fbbb2fc3186
169.4 kB Preview Download
md5:8492e19d50bf2fd5fc46c6aa4eeb3fc2
145.4 kB Preview Download
md5:10d8959d50848081846fe8ced56606fa
4.2 MB Preview Download
md5:75468f2b0736ffb69ae34be3243e6383
5.3 kB Preview Download
md5:e912e0a430e7e34551a5790d6bde0cef
482.6 kB Preview Download
md5:c680fdcd015949ea36e7a8339baaedc9
11.7 MB Preview Download
md5:6f330b759bc249a3eedc3db62b94d048
26.6 kB Preview Download
md5:6bdabf87c2bbbe6fb07f3725fb69b8f1
80.0 kB Preview Download
md5:f8ba1df24ef82054c4fe32a0e1eb1008
3.4 kB Preview Download
md5:921ebeac4f51b420155f7745cd456a29
427 Bytes Preview Download
md5:9c66303a30496021c2038216e33c0412
320 Bytes Preview Download

Additional details

Funding

U.S. National Science Foundation
HDR Institute: Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning 2118240
U.S. National Science Foundation
Collaborative Research: Biology-guided neural networks for discovering phenotypic traits 2022042