Published May 25, 2026 | Version v3

ParaFarm: English-Ukrainian Multiple-Translation Corpus

  • 1. ROR icon National Technical University "Kharkiv Polytechnic Institute"
  • 2. Friedrich-Schiller-Universität Jena
  • 3. Grammarly

Description

Annotation

ParaFarm: English-Ukrainian Multiple-Translation Corpus is a parallel corpus designed to facilitate the study of translation variation and linguistic diversity in Ukrainian. The corpus comprises 1,367 English segments extracted from George Orwell's Animal Farm, aligned with their corresponding translations from seven published Ukrainian editions of the novel and three AI-generated translations. This resource enables researchers to explore multiple translation choices for identical source material, offering valuable insights into Ukrainian language variability, translator decision-making, and the properties of neural machine translation systems. The corpus is distributed in TMX format.

Version 1.2 — What's New

Three AI-generated Ukrainian translations have been added to the corpus, contributed by Ivan Kulynych and used in a construct validity study of neural MT metrics (Chaplynskyi et al., forthcoming):

GPT-5.2 — a general-purpose large language model (OpenAI)
DeepL — a commercial neural machine translation system
Lapa (v0.1.2-instruct) — a 12B-parameter LLM fine-tuned for Ukrainian (Paniv et al., 2025)

The three systems represent architecturally distinct approaches to machine translation, enabling direct comparison between human and AI translation strategies across the same source material.

Applications

Translation Studies: comparative analysis of human and AI translation strategies and decision-making processes
Ukrainian Language Variation: investigation of lexical, morphological, and grammatical diversity in Ukrainian, including features such as discourse particles and diminutive morphology
Corpus Linguistics: quantitative analysis of translation patterns and linguistic phenomena
Machine Translation Evaluation: reference corpus for assessing MT system output quality and construct validity of neural MT metrics
Stylometry: analysis of translatorial voice and stylistic distinctiveness across human and AI systems
Paraphrase Generation: training data for neural paraphrase generation models

Ethical Considerations

This corpus was created exclusively for academic research purposes under the principles of fair use in scholarly analysis. The source material and translations are used in a transformative manner for linguistic research, with proper attribution to the original translators. The AI translations were generated using publicly available systems for research purposes only and are included to enable replication and further research building on Chaplynskyi et al. (forthcoming).

Related publications

Kalashnyk, V. (2025). Створення поліваріантного паралельного корпусу українських перекладів повісті Джорджа Орвелла Animal Farm та його використання для дослідження варіантності української мови. У С. І. Куранова (голова орг. ком.) та ін. (Ред.), Мовний простір сучасного світу: Тези доповідей ІХ Всеукраїнської наукової конференції студентів, аспірантів і молодих учених (Київ, 30 травня 2025 р.) (с. 113–119). НаУКМА. https://ekmair.ukma.edu.ua/items/e5e87391-55e6-49f9-b28c-52549a07864a

Dmytro Chaplynskyi, Ivan Kulynych, Maria Shvedova and Lesia Ivashkevych. Semantic Fidelity Versus Literary Quality: A Construct Validity Study of Neural Machine Translation Metrics. - In print.

 

Files

Files (7.9 MB)

Name Size Download all
md5:69cca47808809e52202610b00748fb96
124.0 kB Download
md5:b4ca27ef692ee64d3781a3ac94eb33b3
3.2 MB Download
md5:64ef5d93ff37e9d715d7d40d420b05b6
4.6 MB Download