Published March 12, 2020 | Version 1.0
Dataset Open

TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages

  • 1. University of Helsinki


This paper presents TaPaCo, a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links between sentences "meaning the same thing". This graph is then traversed to extract sets of paraphrases. Several language-independent filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 - 250 000 sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.



Files (32.5 MB)

Name Size Download all
240.9 kB Preview Download
32.2 MB Preview Download

Additional details


FoTran – Found in Translation – Natural Language Understanding with Cross-Lingual Grounding 771113
European Commission