A Federated Approach to Predicting Emojis in Hindi Tweets

Gandhi, Deep; Mehta, Jash; Parekh, Nirali; Waghela, Karan; D'Mello, Lynette; Talat, Zeerak

doi:10.5281/zenodo.5559434

Published October 21, 2022 | Version v1

Journal article Restricted

A Federated Approach to Predicting Emojis in Hindi Tweets

1. University of Alberta
2. Georgia Institute of Technology
3. Stanford University
4. Santa Clara University
5. Dwarkadas J. Sanghvi College of Engineering
6. Simon Fraser University

This dataset for emoji topic prediction was collected by scraping ~1M tweets. We only kept the 24,794 tweets that are written in Hindi and contain at least one emoji. We duplicated all tweets that contain multiple emojis by the number of emojis contained, assigning a single emoji to each copy, which resulted in the final dataset of 118,030 tweets with 700 unique emojis.

Due to the imbalanced distribution of emojis in our dataset, we assign emojis to 10 coarse-grained categories. This reduction i.e., from multi-label to multi-class and unique emojis into categories, risks losing the semantic meaning of emojis. Our decision is motivated by how challenging emoji prediction is without such reductions.

We pre-processed our data to limit the risk of over-fitting to rare tokens and platform-specific tokens. For instance, we lowercase all text and removed numbers, punctuation, and retweet markers. We replaced mentions, URLs, and hashtags with specific tokens to avoid issues of over-fitting to these.

More information about the dataset is available in our paper.

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/5559434">Log in</a> to check if you have access.

Request access

If you would like to request access to these files, please fill out the form below.

1. Data should be used for non-commercial research use.
2. The data should not be shared outside of the research team.
3. The data should not be used for user profiling.
4. Requesters should have received institutional approval (e.g. the requesters have been granted IRB approval for their project).
5. A research statement that details potential harms.

You are currently not logged in. Do you have an account? Log in here

	All versions	This version
Views	294	291
Downloads	20	20
Data volume	49.1 MB	49.1 MB

A Federated Approach to Predicting Emojis in Hindi Tweets

Authors/Creators

Description

Files

Restricted

Request access