Dataset Open Access

TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages

Scherrer, Yves

This paper presents TaPaCo, a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links between sentences "meaning the same thing". This graph is then traversed to extract sets of paraphrases. Several language-independent filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 - 250 000 sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.

Files (32.5 MB)
Name Size
240.9 kB Download
32.2 MB Download
All versions This version
Views 1,3031,303
Downloads 4,5404,540
Data volume 110.8 GB110.8 GB
Unique views 1,1671,167
Unique downloads 2,7232,723


Cite as