BIOSCAN-5M
Creators
-
Gharaee, Zahra
(Project leader)1
- Lowe, Scott C. (Editor)2
- Gong, ZeMing (Editor)3
- Millan Arias, Pablo (Editor)1
- Pellegrino, Nicholas (Project member)1
- Wang, Austin T. (Project member)3
- Bruslund Haurum, Joakim (Project member)4, 5
- Zarubiieva, Iuliia (Project member)2
- Kari, Lila (Supervisor)1
- Steinke, Dirk (Supervisor)6, 7
- Taylor, Graham W. (Supervisor)6, 2
- Fieguth, Paul (Supervisor)1
- Chang, Angel X. (Supervisor)3, 8
Contributors
Hosting institution:
Description
Overview
As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, we present the BIOSCAN-5M Insect dataset to the machine learning community. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, geographical information, and specimen size.
Every record has both image and DNA data. Each record of the BIOSCAN-5M dataset contains six primary attributes:
- RGB image
- DNA barcode sequence
- Barcode Index Number (BIN)
- Biological taxonomic classification
- Geographical information
- Specimen size
Technical info
Additional BIOSCAN-5M dataset-related packages are accessible through the GoogleDrive folder including:
- BIOSCAN_5M_original_full: The raw images of the dataset.
- BIOSCAN_5M_cropped: Images after cropping with our cropping tool introduced in BIOSCAN-1M.
- BIOSCAN_5M_original_256: Original images resized to 256 on their shorter side.
Files
BIOSCAN_5M_cropped_256_eval.zip
Files
(41.4 GB)
Name | Size | Download all |
---|---|---|
md5:16a997f72a8cd08cbcf7becafe2dda50
|
1.5 GB | Preview Download |
md5:0e1cfa86dc7fa4d9c10036990992a2dd
|
17.9 GB | Preview Download |
md5:ca3191d307957732e8108121b20a2059
|
17.7 GB | Preview Download |
md5:3f170db5d95610644883dafc73389049
|
2.2 GB | Preview Download |
md5:ac381b69fafdbaedc2f9cfb89e3571f7
|
2.1 GB | Preview Download |
Additional details
Identifiers
Related works
- Is supplement to
- Dataset: 10.5281/zenodo.8030065 (DOI)
Funding
- Government of Canada
- New Frontiers in Research Fund (NFRF) NFRFT-2020-00073
Dates
- Available
-
2024-06-24
Software
- Repository URL
- https://github.com/zahrag/BIOSCAN-5M
- Programming language
- Python
- Development Status
- Active
References
- @misc{gharaee2024bioscan5m, title={{BIOSCAN-5M}: A Multimodal Dataset for Insect Biodiversity}, author={Zahra Gharaee and Scott C. Lowe and ZeMing Gong and Pablo Millan Arias and Nicholas Pellegrino and Austin T. Wang and Joakim Bruslund Haurum and Iuliia Zarubiieva and Lila Kari and Dirk Steinke and Graham W. Taylor and Paul Fieguth and Angel X. Chang }, year={2024}, eprint={2406.12723}, archivePrefix={arXiv}, primaryClass={cs.LG}, doi={10.48550/arxiv.2406.12723}, }