SlangTrack Dataset

aloraini, Afnan

doi:10.5281/zenodo.13934495

Published October 15, 2024 | Version v1

Dataset Restricted

SlangTrack Dataset

aloraini, Afnan

The SlangTrack (ST) Dataset is a novel, meticulously curated resource aimed at addressing the complexities of slang detection in natural language processing. This dataset uniquely emphasizes words that exhibit both slang and non-slang contexts, enabling a binary classification system to distinguish between these dual senses. By providing comprehensive examples for each usage, the dataset supports fine-grained linguistic and computational analysis, catering to both researchers and practitioners in NLP.

Key Features:

Unique Words: 48,508
Total Tokens: 310,170
Average Post Length: 34.6 words
Average Sentences per Post: 3.74

These features ensure a robust contextual framework for accurate slang detection and semantic analysis.

Target Word Selection:

The target words were carefully chosen to align with the goals of fine-grained analysis. Each word in the dataset:

It coexists in the slang SD wordlist and the Corpus of Historical American English (COHA).
Has between 2 and 8 distinct senses, including both slang and non-slang meanings.
Was cross-referenced using trusted resources such as:
- Green's Dictionary of Slang
- Urban Dictionary
- Online Slang Dictionary
- Oxford English Dictionary
Features at least one slang and one dominant non-slang sense.
Excludes proper nouns to maintain linguistic relevance and focus.

Data Sources and Collection:

1. Corpus of Historical American English (COHA):

Historical examples were extracted from the cleaned version of COHA (CCOHA).
Data spans the years 1980–2010, capturing the evolution of target words over time.

2. Twitter:

Twitter was selected for its dynamic, real-time communication, offering rich examples of contemporary slang and informal language.
For each target word, 1,000 examples were collected from tweets posted between 2010–2020, reflecting modern usage.

Dataset Scope:

The final dataset comprises ten target words, meeting strict selection criteria to ensure linguistic and computational relevance. Each word:

Demonstrates semantic diversity, balancing slang and non-slang senses.
Offers robust representation across both historical (COHA) and modern (Twitter) contexts.

The SlangTrack Dataset is a public resource, fostering research in slang detection, semantic evolution, and informal language processing. Combining historical and contemporary sources provides a comprehensive platform for exploring the nuances of slang in natural language.

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/13934495">Log in</a> to check if you have access.

Additional details

Created: 2024-10-15

	All versions	This version
Views	1,033	579
Downloads	2	0
Data volume	1.8 MB	0 Bytes

SlangTrack Dataset

Authors/Creators

Description

Key Features:

Target Word Selection:

Data Sources and Collection:

1. Corpus of Historical American English (COHA):

2. Twitter:

Dataset Scope:

Files

Restricted

Additional details

Dates