Amharic WSD Dataset: Advancing Word Sense Disambiguation in Amharic
Creators
- 1. Addis Ababa Institute of Technology(AAiT)
Description
This dataset is specifically designed for the Word Sense Disambiguation (WSD) task in the Amharic language, consisting of 50,415 annotated sentences. Each sentence includes the correct sense for one of 200 ambiguous words chosen based on homonymy relations, where a single word may have multiple meanings depending on its context.
The ambiguous words were selected to capture the nuances of Amharic vocabulary, drawing from diverse textual sources such as news articles, literature, and social media. This ensures a broad and representative range of usage across various contexts, making the dataset particularly valuable for advancing Amharic NLP research. Potential applications include improvements in machine translation, sentiment analysis, and other semantic processing tasks in Amharic.
The dataset is organized in a structured format, with each entry containing fields for sentence, ambiguous word, sense, gloss, and sense label, facilitating ease of use for machine learning models.