SlangTrack Dataset
Authors/Creators
Description
The SlangTrack (ST) Dataset is a novel, meticulously curated resource aimed at addressing the complexities of slang detection in natural language processing. This dataset uniquely emphasizes words that exhibit both slang and non-slang contexts, enabling a binary classification system to distinguish between these dual senses. By providing comprehensive examples for each usage, the dataset supports fine-grained linguistic and computational analysis, catering to both researchers and practitioners in NLP.
Key Features:
- Unique Words: 48,508
- Total Tokens: 310,170
- Average Post Length: 34.6 words
- Average Sentences per Post: 3.74
These features ensure a robust contextual framework for accurate slang detection and semantic analysis.
Target Word Selection:
The target words were carefully chosen to align with the goals of fine-grained analysis. Each word in the dataset:
- It coexists in the slang SD wordlist and the Corpus of Historical American English (COHA).
- Has between 2 and 8 distinct senses, including both slang and non-slang meanings.
- Was cross-referenced using trusted resources such as:
- Green's Dictionary of Slang
- Urban Dictionary
- Online Slang Dictionary
- Oxford English Dictionary
- Features at least one slang and one dominant non-slang sense.
- Excludes proper nouns to maintain linguistic relevance and focus.
Data Sources and Collection:
1. Corpus of Historical American English (COHA):
- Historical examples were extracted from the cleaned version of COHA (CCOHA).
- Data spans the years 1980–2010, capturing the evolution of target words over time.
2. Twitter:
- Twitter was selected for its dynamic, real-time communication, offering rich examples of contemporary slang and informal language.
- For each target word, 1,000 examples were collected from tweets posted between 2010–2020, reflecting modern usage.
Dataset Scope:
The final dataset comprises ten target words, meeting strict selection criteria to ensure linguistic and computational relevance. Each word:
- Demonstrates semantic diversity, balancing slang and non-slang senses.
- Offers robust representation across both historical (COHA) and modern (Twitter) contexts.
The SlangTrack Dataset is a public resource, fostering research in slang detection, semantic evolution, and informal language processing. Combining historical and contemporary sources provides a comprehensive platform for exploring the nuances of slang in natural language.
Files
Additional details
Dates
- Created
-
2024-10-15