Published December 4, 2024 | Version v1
Dataset Open

Insect DNA Barcode and Image Dataset

Description

The data utilized in our experiments was obtained from the Barcode of Life Data System (BOLD), which is a cloud-based data storage and analysis platform developed at the Centre for Biodiversity Genomics in Canada. The insect_dataset.mat consists of 32424 image samples of insect species from four Insecta orders, Diptera, Coleoptera, Lepidoptera and Hymenoptera, each associated with a DNA barcode sequence of that sample. The unseen_insect_dataset.mat consists of 40050 image samples of insects from the same order, but all don't have an indicated species in the BOLD System, so they are real unclassified species (at the time of the dataset creation), having only the genus available, each one is also associated with a DNA barcode sequence of that sample.

 

Description of the .mat:

# Insect Dataset

* all_images: vector containing the 32424 64x64x3 images (RGB) pre normalized of the insects
* all_dnas: vector containing the 32424 DNA barcodes in one-hot encoding 658x5
* all_labels: vector containing the species label for the corresponding DNA and image
* all_boldids: vector of strings containing the id from boldsystemsv3 (https://v3.boldsystems.org/) they can be used to download from boldsystems the original DNA barcodes and the full size images and other data related to the sample
* train_loc: indices of the training samples in all_dnas, all_images, all_labels, all_boldids
* val_seen_loc: indices of the validation samples in all_dnas, all_images, all_labels, all_boldids that contain described(seen) species
* val_unseen_loc: indices of the validation samples in all_dnas, all_images, all_labels, all_boldids that contain undescribed(unseen) species
* test_seen_loc: indices of the test samples in all_dnas, all_images, all_labels, all_boldids that contain described(seen) species
* test_unseen_loc: indices of the test samples in all_dnas, all_images, all_labels, all_boldids that contain undescribed(unseen) species
* species2genus: the vector contains at index i the genus label of species with label i (e.g. species i has genus species2genus[i])
* described_species_labels_train: vector containing the labels of species that appear in the training set
* described_species_labels_trainval: vector containing the labels of species that appear in the training set and/or the validation set
* all_dna_features_cnn_original: vector of features extractedfrom DNA nucleotides with the method of Badirli, S., Picard, C. J., Mohler, G.,Richert, F., Akata, Z., & Dundar, M. (2023). Classifying the unknown: Insect identification with deep hierarchical Bayesian learning. Methods in Ecology and Evolution, 14,
1515-1530. https://doi.org/10.1111/2041-210X.14104
* all_image_features_resnet: vector of features extracted from the insect images with the method of the same paper as the all_dna_features_cnn_original with a pretrained resnet101
* all_dna_features_cnn_new: vector of features extracted from DNA nucleotides with our CNN
* all_image_features_gan: vector of features extracted from the insect images with out method using a ReACGAN

 

Description of the .mat:

# Unseen Insect Dataset

* all_images: vector containing the 40050 64x64x3 images (RGB) pre normalized of the insects
* all_dnas: vector containing the 40050 DNA barcodes in one-hot encoding 658x5

* all_string_dnas: vector containing the 40050 DNA barcodes in string format
* all_genus_labels: vector containing the species label for the corresponding DNA and image
* all_boldids: vector of strings containing the id from boldsystemsv3 (https://v3.boldsystems.org/) they can be used to download from boldsystems the original DNA barcodes and the full size images and other data related to the sample
* all_dna_features_cnn_original: vector of features extractedfrom DNA nucleotides with the method of Badirli, S., Picard, C. J., Mohler, G.,Richert, F., Akata, Z., & Dundar, M. (2023). Classifying the unknown: Insect identification with deep hierarchical Bayesian learning. Methods in Ecology and Evolution, 14,
1515-1530. https://doi.org/10.1111/2041-210X.14104
* all_image_features_resnet: vector of features extracted from the insect images with the method of the same paper as the all_dna_features_cnn_original with a pretrained resnet101
* all_dna_features_cnn_new: vector of features extracted from DNA nucleotides with our CNN
* all_image_features_gan: vector of features extracted from the insect images with out method using a ReACGAN

 

Note: all arrays and locs are 1-indexed like in MATLAB.

Note: the features were extracted with the same model for both the insect dataset and the unseen insect dataset.

Files

Files (8.6 GB)

Name Size Download all
md5:56032e507f4dda38d505f8fbcdf52936
4.3 GB Download
md5:6bc96e3b890bd3ec46fb2abaeb4746bb
4.3 GB Download