Published March 5, 2025 | Version v1

GWSD: A Graded Word Sense Disambiguation Dataset

  • 1. University of Gothenburg

Description

The GWSD Dataset (Graded Word Sense Disambiguation Dataset) is a sense-annotated dataset designed for studying diachronic word usage and semantic change. It contains:

- 2584 word usages from the Oxford English Dicitonary (OED) and 

- 2584 automatically generated word usage examples.

In particular, the automatically generated word usage are obtained using Janus,  a fine-tuned language model trained on the Oxford English Dictionary (OED), allowing for temporally aligned and sense-specific word usage examples spanning historical periods from 1700–2010. 

Each usage is paired with a sense definition and human annotated how well the definition express the meaning of the word in that particular usage (0:Cannot decide, 1:Unrelaed, 2:Distantly Related, 3:Closely Related, 4:Identical)..
We used Amazon Mechanical Turk to collect annotations from crowd workers based in the United States, Canada, the United Kingdom, or Australia.

The dataset is particularly useful for word sense disambiguation (WSD), historical linguistics, lexical semantic change detection (LSCD), and diachronic NLP tasks.

Dataset Content
Each entry in the dataset corresponds to a word sense usage example, structured as follows:

Text: The full sentence containing the target word.
Start, End: Character indices marking the position of the target word in the sentence.
Lemma: The base (root) form of the target word.
POS Tag: Part-of-speech tag (e.g., "nn" for nouns, "vb" for verbs, "jj" for adjectives).
Sense Definition: The dictionary-provided meaning of the word in this context.
Text Year: The historical year for which the usage is generated/originated.
Text Source: The model/source from which the sentence was generated (i.e. OED/Janus).
OED Ground Truth: The reference sense label from the Oxford English Dictionary (scale 1–4).
Annotators: The list of human annotators who evaluated the sense correctness.
Annotations: Scores provided by annotators, typically on a 0–4 scale.
Annotation Time: The time (in seconds) taken by each annotator to assess the sentence.

 

Citation
If you use this dataset in your research, please cite the following paper:

Pierluigi Cassotti, Nina Tahmasebi; Sense-specific Historical Word Usage Generation. Transactions of the Association for Computational Linguistics 2025; 13 690–708. doi: https://doi.org/10.1162/tacl_a_00761

Files

Files (2.6 MB)

Name Size Download all
md5:ec8810035991657642e46d984969c144
2.6 MB Download