Published May 15, 2023 | Version 1.0.0
Dataset Open

DWAEF_Sarc_Dataset & simile_non_simile_shuffled

Description

'DWAEF_Sarc_Dataset' was compiled for the deep weighted average ensemble-based framework (DWAEF). This dataset contains 2,891 sentences written in English. The sentences have been sourced from various platforms, including Twitter, News Headlines and the SARC datasets. Out of these, 1,538 were sarcastic and were compiled from various sources- i) 520 sentences were extracted from Twitter with hashtags- #sarcasm, #not, #sarcastic, #irony, #satire between the time period of June 2022-October 2022; ii) 520 were taken from the News Headlines dataset.; iii) remaining 498 were taken from the SARC dataset.. The 1,353 non-sarcastic sentences were compiled from Twitter and the News Headlines dataset. These sources do not belong to the same domain or topic. Twitter data, for instance, encompasses a broad range of subjects, whereas News Headlines may concentrate on specific areas, such as politics, sports, or entertainment.

For pretraining the GNN-based framework for simile detection the present study curated a dataset 'simile_non_simile_shuffled' comprising approximately 3,512 English language sentences that have been systematically categorized into two groups, i.e., sentences containing similes and those that do not. The dataset was curated through online sources and it underwent a rigorous process of double-annotation to ensure its accuracy and reliability. This dataset offers the opportunity for future research on the impact of similes in both written and spoken language, thereby contributing to a better understanding of figurative language in general.

Files

Files (679.0 kB)

Name Size Download all
md5:319cc1e7ce1001d0ccd62a1cf8f07e8d
258.1 kB Download
md5:5764e7cd9c91b12b5c47efbdd5de450b
420.9 kB Download