Published August 29, 2024 | Version v1
Dataset Open

PluG: A Corpus of Pre-Modern Ukrainian Texts

  • 1. ROR icon National Technical University "Kharkiv Polytechnic Institute"

Description

The PluG (Pluperfect GRAC) corpus is a collection of Ukrainian texts from the General Regionally Annotated Corpus of Ukrainian (GRAC: uacorpus.org). It covers texts from 1816 to 1954, including various types such as fiction, news articles, and other writings. The corpus focuses on works from before the mid-20th century and contains texts by 7,590 unique authors and 44 unique translators.

The corpus features 42,000 files with 58,676,313 tokens (109M Gemma). It consists of copyright-free classic literature and other old texts suitable for LLM training, computational linguistics studies and education. The texts of the corpus were extracted from printed sources using OCR and corrected manually. It includes some texts written in old orthographical systems (Kulishivka, Zhelekhivka, Skrypnykivka). The texts come from various regions of Ukraine, with many from cities like Kyiv, Lviv, and Kharkiv. PluG includes both original Ukrainian works and translations from other languages.

PluG2 is an expanded version of the PluG corpus that contains a larger collection of Western Ukrainian texts from the 1880s to the 1920s written using the orthography system of the time (Zhelekhivka). PluG2 features 73,900,596 tokens. The added texts represent not only a distinctive orthographic system, but also a separate historical variant of literary Ukrainian, which has numerous peculiar grammatical and lexical features and can cause complications when training models oriented to the modern standard.

The corpus is available under CC-BY license. It is designed as a dataset for applied linguistic studies, providing a valuable resource for research on Ukrainian literature, language development, and cultural history of the 19th and early 20th centuries. The corpus provides a wide range of metadata for each text, including information about authors, translators, years of publication, genres, styles, and locations.

Full tagset used in the meta-annotation are available on the GRAC website: https://uacorpus.org/rozmitka-tekstiv/stili-tematika-i-zhanri

It's planned to be updated yearly to keep the resource up-to-date and valuable for researchers.

Acknowledgements: A large part of the collection was sourced from open digital libraries, most notably the collection of Western Ukrainian newspapers assembled by Orest Drul (https://zbruc.eu/). We are grateful to Orest Drul, Maksym Bystrytskyi, Mykola Zharkykh, Mykhailo Nazarenko, Nataliia Mykhailivska, and all those who create and maintain open digital libraries.

 

 

 

Files

PluG2_extended_metadata.xml

Files (559.3 MB)

Name Size Download all
md5:0922f67d8d99e0ae00428b363e546b12
69.5 MB Preview Download
md5:8a008c4d933b664fdca9fe929a2aa276
15.6 MB Download
md5:442f59a78c83f774702bc8f853c0bdbe
232.2 MB Preview Download
md5:1c34e8f49cb525f36a6de9134ec4002c
52.3 MB Preview Download
md5:dc5c3b7c1dbdd549ff9a9948115fd790
11.7 MB Download
md5:4bcac86db317e113fc980aac2a95dad8
178.0 MB Preview Download

Additional details

Dates

Created
1816/1954
Collected
2016/2024