ParaFarm: English-Ukrainian Multiple-Translation Corpus
Authors/Creators
Description
Annotation
ParaFarm: English-Ukrainian Multiple-Translation Corpus is a parallel corpus designed to facilitate the study of translation variation and linguistic diversity in Ukrainian. The corpus comprises 1,367 English segments extracted from George Orwell's Animal Farm, aligned with their corresponding translations from seven published Ukrainian editions of the novel and three AI-generated translations. This resource enables researchers to explore multiple translation choices for identical source material, offering valuable insights into Ukrainian language variability, translator decision-making, and the properties of neural machine translation systems. The corpus is distributed in TMX format.
Version 1.2 — What's New
Three AI-generated Ukrainian translations have been added to the corpus, contributed by Ivan Kulynych and used in a construct validity study of neural MT metrics (Chaplynskyi et al., forthcoming):
GPT-5.2 — a general-purpose large language model (OpenAI)
DeepL — a commercial neural machine translation system
Lapa (v0.1.2-instruct) — a 12B-parameter LLM fine-tuned for Ukrainian (Paniv et al., 2025)
The three systems represent architecturally distinct approaches to machine translation, enabling direct comparison between human and AI translation strategies across the same source material.
Applications
Translation Studies: comparative analysis of human and AI translation strategies and decision-making processes
Ukrainian Language Variation: investigation of lexical, morphological, and grammatical diversity in Ukrainian, including features such as discourse particles and diminutive morphology
Corpus Linguistics: quantitative analysis of translation patterns and linguistic phenomena
Machine Translation Evaluation: reference corpus for assessing MT system output quality and construct validity of neural MT metrics
Stylometry: analysis of translatorial voice and stylistic distinctiveness across human and AI systems
Paraphrase Generation: training data for neural paraphrase generation models
Ethical Considerations
This corpus was created exclusively for academic research purposes under the principles of fair use in scholarly analysis. The source material and translations are used in a transformative manner for linguistic research, with proper attribution to the original translators. The AI translations were generated using publicly available systems for research purposes only and are included to enable replication and further research building on Chaplynskyi et al. (forthcoming).
Related publications
Kalashnyk, V. (2025). Створення поліваріантного паралельного корпусу українських перекладів повісті Джорджа Орвелла Animal Farm та його використання для дослідження варіантності української мови. У С. І. Куранова (голова орг. ком.) та ін. (Ред.), Мовний простір сучасного світу: Тези доповідей ІХ Всеукраїнської наукової конференції студентів, аспірантів і молодих учених (Київ, 30 травня 2025 р.) (с. 113–119). НаУКМА. https://ekmair.ukma.edu.ua/items/e5e87391-55e6-49f9-b28c-52549a07864a
Dmytro Chaplynskyi, Ivan Kulynych, Maria Shvedova and Lesia Ivashkevych. Semantic Fidelity Versus Literary Quality: A Construct Validity Study of Neural Machine Translation Metrics. - In print.
Files
Files
(7.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:69cca47808809e52202610b00748fb96
|
124.0 kB | Download |
|
md5:b4ca27ef692ee64d3781a3ac94eb33b3
|
3.2 MB | Download |
|
md5:64ef5d93ff37e9d715d7d40d420b05b6
|
4.6 MB | Download |