Published March 22, 2018 | Version 2.1
Dataset Open

JeSemE models for lexical semantic change

  • 1. Friedrich-Schiller-Universität Jena, Germany

Contributors

  • 1. Friedrich-Schiller-Universität Jena, Germany

Description

Models for diachronic lexical semantics used by the Jena Semantic Explorer (JeSemE) web site described in our COLING 2018 paper "JeSemE: A Website for Exploring Diachronic Changes in Word Meaning and Emotion".

Also described and applied in Johannes Hellrich's Ph.D. thesis "Word Embeddings: Reliability & Semantic Change" who was funded by the Deutsche Forschungsgemeinschaft (DFG) within the graduate school "The Romantic Model" (GRK 2041/1).

One ZIP file per corpus, each containing several CSV files:

  • CHI.csv with χword association values (structure: word-id, word-id, time, value)
  • EMBEDDING.csv with SVD-PPMI word embeddings (aligned; structure: word-id, time, values)
  • EMOTION.csv with VAD word emotion values (structure: word-id, time, values)
  • FREQUENCY.csv with relative word frequency values (structure: word-id, time, value)
  • PPMI.csv with PPMI word association values (structure: word-id, word-id, time, value)
  • SIMILARITY.csv with word embedding derived word similarity values (structure: word-id, word-id, time, value)
  • WORDIDS.csv mapping words to their corpus specific IDs

Corpora are:

  • coha: Corpus of Historical American English

  • dta: Deutsches Textarchiv 'German Text Archive'

  • google_fiction: Google Books N-Gram corpus, English fiction subcorpus

  • google_german: Google Books N-Gram corpus, German subcorpus

  • rsc: Royal Society Corpus 

Files

coha.zip

Files (5.1 GB)

Name Size Download all
md5:54f3eac9bc0c7a467c1fbc2fc98fb923
1.7 GB Preview Download
md5:e832b83159f52cf7576fe7a515e45183
426.7 MB Preview Download
md5:32ba9a865c3be70ed6e28592b2aa379c
2.0 GB Preview Download
md5:e084dc9d97d9c3ea345f39d31095ce12
810.8 MB Preview Download
md5:4114177f3fca62ae375295e0d4dd724c
111.9 MB Preview Download

Additional details