Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.
Published June 21, 2017 | Version v1
Dataset Open

Word Sense Change Testset

  • 1. Sprakbanken, University of Gothenburg, Sweden
  • 2. L3S Research Center, Leibniz Universität Hannover, Germany

Description

Overview

This testset consists of 23 terms which have experienced word sense change during the past centuries. The main changes for each term were found using Wikipedia, dictionary.com and the Oxford English Dictionary. We consider major changes in usage as well as changes to sense. In cases where multiple (fine-grained) senses were available, we opted to accept the widest sense. E.g. for the term rock we consider a music sense without any distinction between different types of rock music, because our dataset is unlikely to have fine-grained sense differentiations. If a clear time point cannot be pinpointed, we choose the earliest possible. For comparison purposes we also chose a set of 11 terms that have experienced minimal change during the investigated period, i.e., stable terms.

 

Supplementary material

1. testset.txt

Contains a list of all terms and the different change types for each term with a short description of the sense and change.

 

2. Files of the kind "TERM.txt"

The header tells us the term, which clustering coefficient was used, which similarity threshold and which similarity measure.

A path starts with "Path:". 

A unit starts with "UNIT:"

and the numbers following indicate 1. the number of years that the unit spans, and then a list of all years that the internal clusters stem from.

E.g., UNIT: 83 1785, 1787, 1790, 1793, 1798, 1801, 1823, 1867, spanns 83 years and consists of clusters from year  1785, 1787, 1790 etc.

Indentation shows the tree structure, more indentation means lower level branch in the tree.

As an example, in AEROPLANE.txt unit UNIT: 23 1908, 1909, 1910, 1911, 1914, 1918, 1930, 1908 is the root node and the unit is related to UNIT: 27 1916, 1919, 1924, 1932, 1942, 1916.

 

Interesting findings

The longest units and paths are found for stable terms, e.g., newspaper. These are statistically significantly longer than the average units and paths for terms that later evolve.

Newspaper has a unit that spans 145 years and the first path spans from 1852 - 2007.

 

FLIGHT.txt

For the term flight we find that the first unit captures a name, Flight & Robson who were organ builders.

The second unit (it its own path) represents the flight over a hurdle: UNIT: 28 1868, 1869, 1870, 1877, 1885, 1889, 1890, 1892, 1893, 1894, 1895

There is a unit (it its own path) that represents the flight of a cricket ball: UNIT: 29 1938, 1957, 1966

Finally, the last path represents flight as in a means of transportation, in particular for holidays, starting with  UNIT: 19 1962, 1970, 1973, 1980

 

TAPE.txt

The first path for tape is a path related to sowing tape.

Then there is a second path starting with  UNIT: 38 1970, 1974, 2007 that takes up the musical tape.

The last path end in the same units that the second path ends in, also related to the musical tape.

The music tape and the sowing tape should be related because of their shape, but we cannot find any relation as there are few or no overlapping terms.

 

Notes

This work has been funded in parts by the project "Towards a knowledge-based culturomics" supported by a framework grant from the Swedish Research Council (2012--2016; dnr 2012-5738). This work is also in parts funded by the European Research Council under Alexandria (ERC 339233) and the European Community's H2020 Program under SoBigData (RIA 654024).

Files

WSE-testset.zip

Files (16.1 kB)

Name Size Download all
md5:546883de341b6ba97796910465aae57f
16.1 kB Preview Download

Additional details

Funding

SoBigData – SoBigData Research Infrastructure 654024
European Commission
ALEXANDRIA – Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives 339233
European Commission