Published November 3, 2025 | Version v1
Dataset Open

COCO-Formatted Bounding Box Annotations for the NIH Thin Blood Smear Malaria Dataset

  • 1. MIRA vision microscopy GmbH
  • 2. ROR icon Friedrich-Alexander-Universität Erlangen-Nürnberg
  • 3. Friedrich-Alexander-Universität Erlangen-Nürnberg - Technische Fakultät

Description

This repository provides bounding box annotations in COCO format for the publicly available NIH-NLM Thin Blood Smear Dataset for Plasmodium falciparum detection. The annotations were generated through a combination of automated instance segmentation using Cellpose and targeted manual correction. A detailed description of the database curation can be obtained from this manuscript:

Wilm, Frauke, et al. "A COCO-Formatted Instance-Level Dataset for Plasmodium Falciparum Detection in Giemsa-Stained Blood Smears." arXiv preprint arXiv:2507.18483 (2025), doi: 10.48550/arXiv.2507.18483

Dataset Structure

The repository contains the following files:

  • nih_polys_coco.json: COCO annotations for the polygon-labeled subset
  • nih_points_coco.json: COCO annotations for the point-labeled subset (converted to boxes)
  • categories.json: Class label definitions

All annotation files follow the COCO format. The annotations include bounding boxes for three classes: non-infected red blood cells, infected red blood cells, and white blood cells. Each file corresponds to one of the original NIH subsets:

  • nih_polys_coco.json: derived from the 165 polygon-labeled images.
  • nih_points_coco.json: derived from the 800 point-annotated images.

Modalities and Tasks

Modality: Brightfield microscopy images of Giemsa-stained thin-blood smears.

Task: Object detection of individual red and white blood cells, with classification into:

  • Non-infected red blood cells
  • Infected red blood cells
  • White blood cells

Patient Information Fields

The annotation files do not include identifiable patient information. However, each image in the original NIH dataset is associated with one of 193 patients. The original dataset includes:

  • 148 infected patients
  • 45 uninfected patients
  • 5 images per patient

Please refer to the original dataset publication for additional metadata:  

Kassim, Yasmin M., et al. "Clustering-based dual deep learning architecture for detecting red blood cells in malaria diagnostic smears." IEEE Journal of Biomedical and Health Informatics 25.5 (2020): 1735-1746.

Link to dataset

Licensing and Usage

The original NIH dataset is provided under the following license terms:

  • The data may be used, modified, and redistributed for commercial and non-commercial purposes.
  • Attribution must be provided: “Courtesy of the U.S. National Library of Medicine.”
  • Please cite the original dataset publication.

The annotation files in this repository are released under the same terms as the original dataset. If used, please cite the accompanying publication describing the annotation process and evaluation.

Contact

For questions or feedback, please contact: Frauke Wilm

Files

categories.json

Files (37.8 MB)

Name Size Download all
md5:be2fce36e7bfc2e7cb0c32e6dd612990
122 Bytes Preview Download
md5:86d3f3a95c324c9479bd8986968f4327
11.4 kB Preview Download
md5:52b3c27369575036678cb14dca8d51eb
20.6 MB Preview Download
md5:98813625042d2a6765bf1e549330a6bb
17.2 MB Preview Download