There is a newer version of the record available.

Published October 15, 2024 | Version v1
Dataset Restricted

SlangTrack Dataset

Authors/Creators

Description

The SlangTrack (ST) Dataset is a novel, meticulously curated resource aimed at addressing the complexities of slang detection in natural language processing. This dataset uniquely emphasizes words that exhibit both slang and non-slang contexts, enabling a binary classification system to distinguish between these dual senses. By providing comprehensive examples for each usage, the dataset supports fine-grained linguistic and computational analysis, catering to both researchers and practitioners in NLP.

Key Features:

  • Unique Words: 48,508
  • Total Tokens: 310,170
  • Average Post Length: 34.6 words
  • Average Sentences per Post: 3.74

These features ensure a robust contextual framework for accurate slang detection and semantic analysis.

Target Word Selection:

The target words were carefully chosen to align with the goals of fine-grained analysis. Each word in the dataset:

  • It coexists in the slang SD wordlist and the Corpus of Historical American English (COHA).
  • Has between 2 and 8 distinct senses, including both slang and non-slang meanings.
  • Was cross-referenced using trusted resources such as:
    • Green's Dictionary of Slang
    • Urban Dictionary
    • Online Slang Dictionary
    • Oxford English Dictionary
  • Features at least one slang and one dominant non-slang sense.
  • Excludes proper nouns to maintain linguistic relevance and focus.

Data Sources and Collection:

1. Corpus of Historical American English (COHA):

  • Historical examples were extracted from the cleaned version of COHA (CCOHA).
  • Data spans the years 1980–2010, capturing the evolution of target words over time.

2. Twitter:

  • Twitter was selected for its dynamic, real-time communication, offering rich examples of contemporary slang and informal language.
  • For each target word, 1,000 examples were collected from tweets posted between 2010–2020, reflecting modern usage.

Dataset Scope:

The final dataset comprises ten target words, meeting strict selection criteria to ensure linguistic and computational relevance. Each word:

  • Demonstrates semantic diversity, balancing slang and non-slang senses.
  • Offers robust representation across both historical (COHA) and modern (Twitter) contexts.

The SlangTrack Dataset is a public resource, fostering research in slang detection, semantic evolution, and informal language processing. Combining historical and contemporary sources provides a comprehensive platform for exploring the nuances of slang in natural language.

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/13934495">Log in</a> to check if you have access.

Additional details

Dates

Created
2024-10-15