TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages

Scherrer, Yves

doi:10.5281/zenodo.3707949

Published March 12, 2020 | Version 1.0

Dataset Open

TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages

Scherrer, Yves¹

1. University of Helsinki

This paper presents TaPaCo, a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links between sentences "meaning the same thing". This graph is then traversed to extract sets of paraphrases. Several language-independent filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 - 250 000 sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.

Files

tapaco_lrec2020.pdf

Files (32.5 MB)

Name	Size	Download all
tapaco_lrec2020.pdf md5:dc4e7ba1109936403e7f914bd6f01795	240.9 kB	Preview Download
tapaco_v1.0.zip md5:c2673250380a9399129e759507580020	32.2 MB	Preview Download

Additional details

European Commission
FoTran - Found in Translation – Natural Language Understanding with Cross-Lingual Grounding 771113

	All versions	This version
Views	3,261	3,218
Downloads	5,275	5,256
Data volume	515.0 GB	514.9 GB

TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages

Authors/Creators

Description

Files

tapaco_lrec2020.pdf

Files (32.5 MB)

Additional details

Funding