Published June 14, 2024 | Version 1.0.0
Dataset Open

BIOSCAN-5M

Description

Overview

As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, we present the BIOSCAN-5M Insect dataset to the machine learning community. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, geographical information, and specimen size.

Every record has both image and DNA data. Each record of the BIOSCAN-5M dataset contains six primary attributes:

  • RGB image
  • DNA barcode sequence
  • Barcode Index Number (BIN)
  • Biological taxonomic classification
  • Geographical information
  • Specimen size

Technical info

Additional BIOSCAN-5M dataset-related packages are accessible through the GoogleDrive folder including:

  • BIOSCAN_5M_original_full: The raw images of the dataset.
  • BIOSCAN_5M_cropped: Images after cropping with our cropping tool introduced in BIOSCAN-1M.
  • BIOSCAN_5M_original_256: Original images resized to 256 on their shorter side.

Files

BIOSCAN_5M_cropped_256_eval.zip

Files (41.4 GB)

Name Size Download all
md5:16a997f72a8cd08cbcf7becafe2dda50
1.5 GB Preview Download
md5:0e1cfa86dc7fa4d9c10036990992a2dd
17.9 GB Preview Download
md5:ca3191d307957732e8108121b20a2059
17.7 GB Preview Download
md5:3f170db5d95610644883dafc73389049
2.2 GB Preview Download
md5:ac381b69fafdbaedc2f9cfb89e3571f7
2.1 GB Preview Download

Additional details

Related works

Is supplement to
Dataset: 10.5281/zenodo.8030065 (DOI)

Funding

Government of Canada
New Frontiers in Research Fund (NFRF) NFRFT-2020-00073

Dates

Available
2024-06-24

Software

Repository URL
https://github.com/zahrag/BIOSCAN-5M
Programming language
Python
Development Status
Active

References

  • @misc{gharaee2024bioscan5m, title={{BIOSCAN-5M}: A Multimodal Dataset for Insect Biodiversity}, author={Zahra Gharaee and Scott C. Lowe and ZeMing Gong and Pablo Millan Arias and Nicholas Pellegrino and Austin T. Wang and Joakim Bruslund Haurum and Iuliia Zarubiieva and Lila Kari and Dirk Steinke and Graham W. Taylor and Paul Fieguth and Angel X. Chang }, year={2024}, eprint={2406.12723}, archivePrefix={arXiv}, primaryClass={cs.LG}, doi={10.48550/arxiv.2406.12723}, }