Published July 11, 2025 | Version v1
Dataset Open

The Amphan Dataset: Humanitarian Classification of Bengali Disaster-Related Tweets

  • 1. IBM India Pvt. Ltd., Kolkata, 700156, West Bengal, India
  • 2. Department of Mathematics, Indian Institute of Technology Kharagpur, Kharagpur, 721302, West Bengal, India
  • 3. Institute of Computing and Analytics, NSHM Knowledge Campus, Kolkata, 700053, West Bengal, India
  • 4. Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, 721302, West Bengal, India

Description

The Amphan Dataset (Bengali Disaster-Related Tweets)

This Amphan dataset is a contribution of the paper titled "Humanitarian Classification of Crisis-related Microblogs in Bengali: A Comparison of Multilingual Pre-trained Language Models", which has been accepted for publication in the International Journal of Disaster Risk Reduction (IJDRR), Elsevier !! For more details about the dataset, please refer to our paper

Overview

The Amphan dataset is a collection of 2,400 Bengali-language tweets related to Cyclone Amphan, which struck the Bay of Bengal region in May 2020. The dataset aims to support the classification of disaster-related social media content in low-resource languages such as Bengali, with a focus on humanitarian aid and crisis response. 

To our knowledge, this is the first publicly available dataset of disaster-related social media posts in Bengali. Despite being developed in the context of Cyclone Amphan, this dataset has broader relevance due to the frequent occurrence of cyclonic storms affecting the region comprising Bangladesh and eastern India.

Dataset structure

The Amphan dataset in this repository is provided in Excel (.xlsx) format and contains 2,400 tweets (microblogs). The dataset includes the following columns:

  •  ID: Unique identifier for each tweet
  • Tweet_in_Bengali: Original tweet text written in Bengali
  • Tweet_in_EN_IndicTrans: English translation generated via a neural machine translation model, IndicTrans
  • Label: One or more humanitarian class labels assigned to the tweet

Annotation and Class Labels

The tweets were annotated with a set of humanitarian class labels based on the type of information conveyed. The dataset is multi-label, meaning a tweet can belong to more than one class. The labels are as follows:

  • affected_individual
  • caution_advice_updates
  • displaced_and_evacuations
  • donations_and_volunteering
  • infrastructure_and_utilities_damage
  • injured_or_dead_people
  • missing_and_found_people
  • requests_or_needs
  • response_efforts
  • sympathy_and_support
  • not_humanitarian

License

The Amphan dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit CC BY-NC-SA 4.0.

Citation

This dataset is released publicly for non-commercial research and academic purposes. Please cite our paper when using the data in any publication or derived work.

Files

Files (395.0 kB)

Name Size Download all
md5:0f4b24dff176d8262e5b1fc692970ded
395.0 kB Download