Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction and Disambiguation

Panchenko, Alexander; Ruppert, Eugen; Faralli, Stefano; Ponzetto, Simone Paolo; Biemann, Chris

doi:10.5281/zenodo.485151

Published April 1, 2017 | Version v1

Dataset Open

Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction and Disambiguation

1. Universität Hamburg
2. University of Mannheim

This dataset contains the models for interpretable Word Sense Disambiguation (WSD) that were employed in Panchenko et al. (2017; the paper can be accessed at https://www.lt.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_LangTech/publications/EACL_Interpretability___FINAL__1_.pdf).

The files were computed on a 2015 dump from the English Wikipedia. Their contents:

Induced Sense Inventories: wp_stanford_sense_inventories.tar.gz
This file contains 3 inventories (coarse, medium fine)
Language Model (3-gram): wiki_text.3.arpa.gz
This file contains all n-grams up to n=3 and can be loaded into an index
Weighted Dependency Features: wp_stanford_lemma_LMI_s0.0_w2_f2_wf2_wpfmax1000_wpfmin2_p1000.gz
This file contains weighted word--context-feature combinations and includes their count and an LMI significance score
Distributional Thesaurus (DT) of Dependency Features: wp_stanford_lemma_BIM_LMI_s0.0_w2_f2_wf2_wpfmax1000_wpfmin2_p1000_simsortlimit200_feature expansion.gz
This file contains a DT of context features. The context feature similarities can be used for context expansion

For further information, consult the paper and the companion page: http://jobimtext.org/wsd/

Panchenko A., Ruppert E., Faralli S., Ponzetto S. P., and Biemann C. (2017): Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction and Disambiguation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL'2017). Valencia, Spain. Association for Computational Linguistics.

Files

Files (10.6 GB)

Name	Size
wiki_text.3.arpa.gz md5:42c76c58f8df910a109ade319ea42d7e	9.5 GB	Download
wp_stanford_lemma_BIM_LMI_s0.0_w2_f2_wf2_wpfmax1000_wpfmin2_p1000_simsortlimit200_feature expansion.gz md5:350f35fab640f87c716fd8fb48872563	344.1 MB	Download
wp_stanford_lemma_LMI_s0.0_w2_f2_wf2_wpfmax1000_wpfmin2_p1000.gz md5:3d60d4b4e98b24c282a13a9968f93b63	404.5 MB	Download
wp_stanford_sense_inventories.tar.gz md5:c73849732658d0baa54b29102c0e018e	342.8 MB	Download

	All versions	This version
Views	690	689
Downloads	326	326
Data volume	875.1 GB	875.1 GB

Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction and Disambiguation

Authors/Creators

Description

Files

Files (10.6 GB)