Published June 21, 2021 | Version 2.0.0
Dataset Open

PrevDistro - Preverb Distributions in Hungarian

  • 1. Hungarian Research Centre for Linguistics

Description

PrevDistro (Preverb Distributions) is an open-source dataset containing 41.5 million corpus occurrences of 49 preverb-verb construction types. It consists of the following columns:

  • 1 sid: ID
  • 2 constype: construction type
  • 3 subtype: construction subtype
  • 4 prevpos: preverb position
  • 5 prev: preverb
  • 6 verb: verb lemma
  • 7 intervening: intervening words (as lemmas)
  • 8 actform: actual form (the same content as in column 10, but this column is lowercase)
  • 9 left: left context
  • 10 kwic: keyword in context
  • 11 right: right context
  • 12 docid: document ID from the Hungarian Gigaword Corpus
  • 13 title: document title
  • 14 style: document style (e.g. official, press, ...)
  • 15 region: document region (e.g. Transylvania, Subcarpathia, ...)
  • 16 year: year of publication (sometimes several years can be found in one document)

The first row stands for the header. If a cell's value is unspecified, it is marked with underscore (_).

Notes

PrevDistro 1.0.0 (deprecated) can be found at https://science-data.hu/dataset.xhtml?persistentId=doi:10.5072/FK2/TRSD50 In PrevDistro 2.0.0, several new columns were added and the already existing data has undergone some fixes as well.

Files

Files (13.2 GB)

Name Size Download all
md5:686521c26f1fbbc473e210946a4ab0cb
13.2 GB Download

Additional details

Related works

Is new version of
Thesis: 10.15774/PPKE.BTK.2021.019 (DOI)