# Ukrainian Epigraphic Corpus ## Description This corpus contains epigraphic texts collected to support the development of a Simple Knowledge Organization System (SKOS) vocabulary for **Ukrainian epigraphy**. The corpus is designed to facilitate term extraction, analysis of synonymous sets, hierarchical organization of terms, and linguistic studies focused on Ukrainian inscriptions. ## Corpus Structure - **epigraphic_corpus.txt**: The main file containing annotated texts from various sources, including academic publications and web-based resources. ## Methodology ### Corpus Collection Texts were collected from both academic and digital sources, ensuring comprehensive representation of Ukrainian epigraphic terminology. 1. **Language Criteria**: - Only texts in **Ukrainian** were included. - Russian-language publications were excluded to ensure linguistic and cultural autonomy. 2. **Authorship Diversity**: - The corpus includes contributions from scholars in archaeology, history, linguistics, and cultural studies. 3. **Genre Balancing**: - Books, monographs, academic articles, conference proceedings, and web sources were included to provide a balanced perspective. 4. **Regional and Temporal Coverage**: - The corpus covers a broad geographical range within **Ukraine**, including regions such as **Kyiv**, **Halychyna**, and **Chernihiv**. - Publications from the second half of the **20th century** to **2024** were included. 5. **Format and Medium**: - Both digital and non-digital publications were included to reflect the transition of Ukrainian epigraphy into the digital age. ### Data Processing The collected texts were uploaded into **Sketch Engine** for further processing: 1. **Tokenization**: Breaking down texts into individual words or terms. 2. **Lemmatization**: Reducing words to their base forms for consistency in term analysis. 3. **Part-of-Speech Tagging**: Annotating tokens with grammatical information to facilitate accurate term identification. ## Corpus Statistics - **Total Tokens**: 1,293,226 - **Total Words**: 778,104 - **Total Documents**: 292 ### Sub-Corpora 1. **Academic Publications**: - **Tokens**: 1,080,109 (83.52%) - **Documents**: 18 2. **Web Epigraphy**: - **Tokens**: 214,347 (16.57%) - **Documents**: 274 ## Usage License This corpus is shared under the **CC BY 4.0** license. You are free to use, share, and adapt the material provided proper attribution is given to the creators. ## Citation If you use this corpus, please cite it as follows: > Ukrainian Epigraphic Corpus (2024). Available at Zenodo: [Insert DOI Here]