Dataset Open Access
This paper presents TaPaCo, a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links between sentences "meaning the same thing". This graph is then traversed to extract sets of paraphrases. Several language-independent filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 - 250 000 sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
| Name | Size | |
|---|---|---|
|
tapaco_lrec2020.pdf
md5:dc4e7ba1109936403e7f914bd6f01795 |
240.9 kB | Download |
|
tapaco_v1.0.zip
md5:c2673250380a9399129e759507580020 |
32.2 MB | Download |
| All versions | This version | |
|---|---|---|
| Views | 1,531 | 1,531 |
| Downloads | 5,825 | 5,825 |
| Data volume | 146.4 GB | 146.4 GB |
| Unique views | 1,379 | 1,379 |
| Unique downloads | 3,340 | 3,340 |