GWSD: A Graded Word Sense Disambiguation Dataset

Cassotti, Pierluigi; Tahmasebi, Nina

doi:10.5281/zenodo.14974455

Published March 5, 2025 | Version v1

Dataset Open

GWSD: A Graded Word Sense Disambiguation Dataset

1. University of Gothenburg

The GWSD Dataset (Graded Word Sense Disambiguation Dataset) is a sense-annotated dataset designed for studying diachronic word usage and semantic change. It contains:

- 2584 word usages from the Oxford English Dicitonary (OED) and

- 2584 automatically generated word usage examples.

In particular, the automatically generated word usage are obtained using Janus, a fine-tuned language model trained on the Oxford English Dictionary (OED), allowing for temporally aligned and sense-specific word usage examples spanning historical periods from 1700–2010.

Each usage is paired with a sense definition and human annotated how well the definition express the meaning of the word in that particular usage (0:Cannot decide, 1:Unrelaed, 2:Distantly Related, 3:Closely Related, 4:Identical)..
We used Amazon Mechanical Turk to collect annotations from crowd workers based in the United States, Canada, the United Kingdom, or Australia.

The dataset is particularly useful for word sense disambiguation (WSD), historical linguistics, lexical semantic change detection (LSCD), and diachronic NLP tasks.

Dataset Content
Each entry in the dataset corresponds to a word sense usage example, structured as follows:

Text: The full sentence containing the target word.
Start, End: Character indices marking the position of the target word in the sentence.
Lemma: The base (root) form of the target word.
POS Tag: Part-of-speech tag (e.g., "nn" for nouns, "vb" for verbs, "jj" for adjectives).
Sense Definition: The dictionary-provided meaning of the word in this context.
Text Year: The historical year for which the usage is generated/originated.
Text Source: The model/source from which the sentence was generated (i.e. OED/Janus).
OED Ground Truth: The reference sense label from the Oxford English Dictionary (scale 1–4).
Annotators: The list of human annotators who evaluated the sense correctness.
Annotations: Scores provided by annotators, typically on a 0–4 scale.
Annotation Time: The time (in seconds) taken by each annotator to assess the sentence.

Citation
If you use this dataset in your research, please cite the following paper:

Pierluigi Cassotti, Nina Tahmasebi; Sense-specific Historical Word Usage Generation. Transactions of the Association for Computational Linguistics 2025; 13 690–708. doi: https://doi.org/10.1162/tacl_a_00761

Files

Files (2.6 MB)

Name	Size	Download all
gwsd_dataset.jsonl md5:ec8810035991657642e46d984969c144	2.6 MB	Download

	All versions	This version
Views	489	489
Downloads	143	143
Data volume	466.0 MB	466.0 MB

GWSD: A Graded Word Sense Disambiguation Dataset

Authors/Creators

Description

Files

Files (2.6 MB)