WhatsApp and Instagram chat message metadata (WICM)
Authors/Creators
Contributors
Data collector (4):
Description
Please note that the dataset partially overlaps with two previously published datasets (1, 2) and is uploaded accompanying a paper submission.
Collection & data description: The data were collected using the Dona platform. Participants either donated all their chats from Instagram (up to a year) or a number of chats of their choosing from WhatsApp (to limit participant burden, 5-7 were recommended). 129 people donated their WhatsApp chats and 66 their Instagram chats. Nobody donated both. This results in 762 WhatsApp and 6285 Instagram chats, totalling 6,529,297 messages. The earliest message dates back to November 2012, while the latest was sent in June 2025.
The demographic data stem from surveys that participants filled out before the donation. They contain age, gender, highest current education, employment state, selected survey language, and familiarity with the survey language. As none of the fields were mandatory, not all donations have complete demographic information.
Format: The data consist of a CSV file containing the survey responses for and a parquet file that contains a pandas dataframe with each row being a message and the following columns:
conversation_id: A UUID matching all messages of the same conversation/chat.sender_id: A UUID matching the sender of a message across all chats of a donation.datetime: The timestamp of the message being sent. Subsecond-level resolution for Instagram and WhatsApp for iOS, minute-level resolution for WhatsApp for Android.-
word_count: The number of words in the message identified by blocks of characters separated with at least one whitespace character. -
data_source_id: 2 for Whatsapp and 3 for Instagram. -
donor_id: A UUID for each donor. This is also the key for the demographic data. -
id: A UUID for each message.
Requesting access
If you would like to access these files, please reach out to Florian Martin at hcai-datasets+dona-rt2026@techfak.de. You need to satisfy these conditions in order for this request to be accepted:
- To use the data set, you must hold an academic affiliation.
- Further, you have to download and fill out the End User License Agreement (EULA) (Password: 78wH9CTCTr) and submit it to us (using the address provided above).
Files
Additional details
Related works
- Is described by
- Preprint: arXiv:2605.03687 (arXiv)
Funding
- Federal Ministry of Education and Research
- Empathische Künstliche Intelligenz 01IS20046
Software
- Repository URL
- https://github.com/mbp-lab/dona-rt2026
- Programming language
- Python
- Development Status
- Suspended