
# Ukrainian Epigraphic Corpus

## Description
This corpus contains epigraphic texts collected to support the development of a Simple Knowledge Organization System (SKOS) vocabulary for **Ukrainian epigraphy**. The corpus is designed to facilitate term extraction, analysis of synonymous sets, hierarchical organization of terms, and linguistic studies focused on Ukrainian inscriptions.

## Corpus Structure
- **epigraphic_corpus.txt**: The main file containing annotated texts from various sources, including academic publications and web-based resources.

## Methodology
### Corpus Collection
Texts were collected from both academic and digital sources, ensuring comprehensive representation of Ukrainian epigraphic terminology.

1. **Language Criteria**:
   - Only texts in **Ukrainian** were included.
   - Russian-language publications were excluded to ensure linguistic and cultural autonomy.

2. **Authorship Diversity**:
   - The corpus includes contributions from scholars in archaeology, history, linguistics, and cultural studies.

3. **Genre Balancing**:
   - Books, monographs, academic articles, conference proceedings, and web sources were included to provide a balanced perspective.

4. **Regional and Temporal Coverage**:
   - The corpus covers a broad geographical range within **Ukraine**, including regions such as **Kyiv**, **Halychyna**, and **Chernihiv**.
   - Publications from the second half of the **20th century** to **2024** were included.

5. **Format and Medium**:
   - Both digital and non-digital publications were included to reflect the transition of Ukrainian epigraphy into the digital age.

### Data Processing
The collected texts were uploaded into **Sketch Engine** for further processing:

1. **Tokenization**: Breaking down texts into individual words or terms.
2. **Lemmatization**: Reducing words to their base forms for consistency in term analysis.
3. **Part-of-Speech Tagging**: Annotating tokens with grammatical information to facilitate accurate term identification.

## Corpus Statistics
- **Total Tokens**: 1,293,226
- **Total Words**: 778,104
- **Total Documents**: 292

### Sub-Corpora
1. **Academic Publications**:
   - **Tokens**: 1,080,109 (83.52%)
   - **Documents**: 18
2. **Web Epigraphy**:
   - **Tokens**: 214,347 (16.57%)
   - **Documents**: 274

## Usage License
This corpus is shared under the **CC BY 4.0** license. You are free to use, share, and adapt the material provided proper attribution is given to the creators.

## Citation
If you use this corpus, please cite it as follows:
> Ukrainian Epigraphic Corpus (2024). Available at Zenodo: [Insert DOI Here]
