A Federated Approach to Predicting Emojis in Hindi Tweets
Creators
- 1. University of Alberta
- 2. Georgia Institute of Technology
- 3. Stanford University
- 4. Santa Clara University
- 5. Dwarkadas J. Sanghvi College of Engineering
- 6. Simon Fraser University
Description
This dataset for emoji topic prediction was collected by scraping ~1M tweets. We only kept the 24,794 tweets that are written in Hindi and contain at least one emoji. We duplicated all tweets that contain multiple emojis by the number of emojis contained, assigning a single emoji to each copy, which resulted in the final dataset of 118,030 tweets with 700 unique emojis.
Due to the imbalanced distribution of emojis in our dataset, we assign emojis to 10 coarse-grained categories. This reduction i.e., from multi-label to multi-class and unique emojis into categories, risks losing the semantic meaning of emojis. Our decision is motivated by how challenging emoji prediction is without such reductions.
We pre-processed our data to limit the risk of over-fitting to rare tokens and platform-specific tokens. For instance, we lowercase all text and removed numbers, punctuation, and retweet markers. We replaced mentions, URLs, and hashtags with specific tokens to avoid issues of over-fitting to these.
More information about the dataset is available in our paper.