Word Sense Change Testset
Creators
- 1. Sprakbanken, University of Gothenburg, Sweden
- 2. L3S Research Center, Leibniz Universität Hannover, Germany
Description
Overview
This testset consists of 23 terms which have experienced word sense change during the past centuries. The main changes for each term were found using Wikipedia, dictionary.com and the Oxford English Dictionary. We consider major changes in usage as well as changes to sense. In cases where multiple (fine-grained) senses were available, we opted to accept the widest sense. E.g. for the term rock we consider a music sense without any distinction between different types of rock music, because our dataset is unlikely to have fine-grained sense differentiations. If a clear time point cannot be pinpointed, we choose the earliest possible. For comparison purposes we also chose a set of 11 terms that have experienced minimal change during the investigated period, i.e., stable terms.
Supplementary material
1. testset.txt
Contains a list of all terms and the different change types for each term with a short description of the sense and change.
2. Files of the kind "TERM.txt"
The header tells us the term, which clustering coefficient was used, which similarity threshold and which similarity measure.
A path starts with "Path:".
A unit starts with "UNIT:"
and the numbers following indicate 1. the number of years that the unit spans, and then a list of all years that the internal clusters stem from.
E.g., UNIT: 83 1785, 1787, 1790, 1793, 1798, 1801, 1823, 1867, spanns 83 years and consists of clusters from year 1785, 1787, 1790 etc.
Indentation shows the tree structure, more indentation means lower level branch in the tree.
As an example, in AEROPLANE.txt unit UNIT: 23 1908, 1909, 1910, 1911, 1914, 1918, 1930, 1908 is the root node and the unit is related to UNIT: 27 1916, 1919, 1924, 1932, 1942, 1916.
Interesting findings
The longest units and paths are found for stable terms, e.g., newspaper. These are statistically significantly longer than the average units and paths for terms that later evolve.
Newspaper has a unit that spans 145 years and the first path spans from 1852 - 2007.
FLIGHT.txt
For the term flight we find that the first unit captures a name, Flight & Robson who were organ builders.
The second unit (it its own path) represents the flight over a hurdle: UNIT: 28 1868, 1869, 1870, 1877, 1885, 1889, 1890, 1892, 1893, 1894, 1895
There is a unit (it its own path) that represents the flight of a cricket ball: UNIT: 29 1938, 1957, 1966
Finally, the last path represents flight as in a means of transportation, in particular for holidays, starting with UNIT: 19 1962, 1970, 1973, 1980
TAPE.txt
The first path for tape is a path related to sowing tape.
Then there is a second path starting with UNIT: 38 1970, 1974, 2007 that takes up the musical tape.
The last path end in the same units that the second path ends in, also related to the musical tape.
The music tape and the sowing tape should be related because of their shape, but we cannot find any relation as there are few or no overlapping terms.
Notes
Files
WSE-testset.zip
Files
(16.1 kB)
Name | Size | Download all |
---|---|---|
md5:546883de341b6ba97796910465aae57f
|
16.1 kB | Preview Download |