Published May 29, 2019 | Version v1
Dataset Open

Hypernyms extracted from a large text corpus using Hearst lexical-syntactic patterns

  • 1. University of Hamburg


The list of hyponym-hypernym pairs was obtained by applying lexical-syntactic patterns described in  Hearst (1992)  on the corpus prepared by Panchenko et al. (2016). This corpus is a concatenation of the English Wikipedia (2016 dump), Gigaword, ukWaC  and English news corpora from the Leipzig Corpora Collection. The lexical-syntactic patterns proposed by Marti Hearst (1992) and further extended and implemented in the form of FSTs by Panchenko et al. (2012) for extracting (noisy) hyponym-hypernym pairs are as follows -- (i) such NP as NP, NP[,] and/or NP; (ii) NP such as NP, NP[,] and/or NP; (iii) NP, NP [,] or other NP; (iv) NP, NP [,] and other NP; (v) NP, including NP, NP [,] and/or NP; (vi) NP, especially NP, NP [,] and/or NP. Pattern extraction on the corpus yields a list of 27.6 million hyponym-hypernym pairs along with the frequency of their occurrence in the corpus. 


Files (213.2 MB)

Name Size Download all
213.2 MB Download