Published March 18, 2026 | Version v1
Dataset Open

BioDCASE 2026 Challenge: Cross-Domain Mosquito Species Classification

Description

The development dataset is released for the BioDCASE 2026 Cross-Domain Mosquito Species Classification task to support model development, validation, and transparent baseline reproduction. Full task information, including the challenge overview, timeline, and evaluation setting, is provided on the official task page. The fully open baseline implementation, including code and released resources, is provided through the official GitHub repository.

The released development dataset contains 271,380 audio clips in total, corresponding to 218,388.40 seconds (60.66 hours) of mosquito flight sound recordings. It covers 9 target species across 5 domains and is intended to support research on mosquito species classification under domain shift. 

The 9 target species are:
Ae. aegypti, Ae. albopictus, Cx. quinquefasciatus, An. gambiae, An. arabiensis, An. dirus, Cx. pipiens, An. minimus, and An. stephensi.

The number of clips for each species in the released development dataset is:
Ae. aegypti: 81,587
Ae. albopictus: 18,517
Cx. quinquefasciatus: 72,056
An. gambiae: 46,998
An. arabiensis: 21,117
An. dirus: 127
Cx. pipiens: 29,754
An. minimus: 550
An. stephensi: 674

The dataset spans 5 domains, with the following clip counts:
D1: 4,065
D2: 784
D3: 679
D4: 200
D5: 265,652

Each audio file follows the naming format S_<speciesID>_D_<domainID>_<clipIndex>, so both species identity and domain identity are directly accessible from the audio ID. This makes the released dataset fully transparent and easy to inspect. Participants can directly analyse species-domain distributions, reproduce the released baseline setting, or construct alternative development splits when needed.

For the released baseline, the development dataset is divided into a trainval pool of 244,163 clips and a test set of 27,217 clips. A validation set is then derived from the trainval pool by random species-stratified sampling, yielding 213,647 training clips and 30,516 validation clips. This released split is intended as a simple and reproducible reference setup for the BioDCASE 2026 Cross-Domain Mosquito Species Classification task.

The species-domain distribution is highly uneven across the development dataset. Some species-domain combinations are well represented, while others are sparse. Participants are therefore encouraged to look beyond pooled accuracy and to consider both class balance and domain balance during development.

Participants may use the species and domain information encoded in the audio IDs to construct alternative domain-aware development splits. This can help local validation better reflect the cross-domain objective of the task.

The evaluation dataset will be released according to the challenge timeline.

Recommended links for the page

Task page: https://biodcase.github.io/challenge2026/task5

Baseline repository: https://github.com/Yuanbo2020/CD-MSC

If you use the development dataset, or refer to the BioDCASE 2026 Cross-Domain Mosquito Species Classification task, please feel free to cite the following paper.

BioDCASE 2026 CD-MSC Baseline: 📄 PDF

@misc{hou2026biodcase2026challengebaseline,
      title={BioDCASE 2026 Challenge Baseline for Cross-Domain Mosquito Species Classification}, 
      author={Yuanbo Hou and Vanja Zdravkovic and Marianne Sinka and Yunpeng Li and Wenwu Wang and Mark D. Plumbley and Kathy Willis and Stephen Roberts},
      year={2026},
      eprint={2603.20118},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2603.20118}, 
}

Files

Development_data.zip

Files (3.8 GB)

Name Size Download all
md5:dbf35b9066e82cdee5461fca138f4986
3.8 GB Preview Download

Additional details

Software

Repository URL
https://github.com/Yuanbo2020/CD-MSC
Programming language
Python
Development Status
Active