There is a newer version of the record available.

Published October 15, 2024 | Version v1
Dataset Restricted

SlangTrack Dataset

Authors/Creators

Description

The SlangTrack (ST) Dataset is a novel, meticulously curated resource aimed at addressing the complexities of slang detection in natural language processing. This dataset uniquely emphasizes words that exhibit both slang and non-slang contexts, enabling a binary classification system to distinguish between these dual senses. By providing comprehensive examples for each usage, the dataset supports fine-grained linguistic and computational analysis, catering to both researchers and practitioners in NLP.

Key Features:

  • Unique Words: 48,508
  • Total Tokens: 310,170
  • Average Post Length: 34.6 words
  • Average Sentences per Post: 3.74

These features ensure a robust contextual framework for accurate slang detection and semantic analysis.

Target Word Selection:

The target words were carefully chosen to align with the goals of fine-grained analysis. Each word in the dataset:

  • It coexists in the slang SD wordlist and the Corpus of Historical American English (COHA).
  • Has between 2 and 8 distinct senses, including both slang and non-slang meanings.
  • Was cross-referenced using trusted resources such as:
    • Green's Dictionary of Slang
    • Urban Dictionary
    • Online Slang Dictionary
    • Oxford English Dictionary
  • Features at least one slang and one dominant non-slang sense.
  • Excludes proper nouns to maintain linguistic relevance and focus.

Data Sources and Collection:

1. Corpus of Historical American English (COHA):

  • Historical examples were extracted from the cleaned version of COHA (CCOHA).
  • Data spans the years 1980–2010, capturing the evolution of target words over time.

2. Twitter:

  • Twitter was selected for its dynamic, real-time communication, offering rich examples of contemporary slang and informal language.
  • For each target word, 1,000 examples were collected from tweets posted between 2010–2020, reflecting modern usage.

Dataset Scope:

The final dataset comprises ten target words, meeting strict selection criteria to ensure linguistic and computational relevance. Each word:

  • Demonstrates semantic diversity, balancing slang and non-slang senses.
  • Offers robust representation across both historical (COHA) and modern (Twitter) contexts.

The SlangTrack Dataset is a public resource, fostering research in slang detection, semantic evolution, and informal language processing. Combining historical and contemporary sources provides a comprehensive platform for exploring the nuances of slang in natural language.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Dates

Created
2024-10-15