Published May 22, 2017 | Version v1
Dataset Open

Toward a Comparable Corpus of Latvian, Russian and English Tweets

Description

Twitter has become a rich source for linguistic data. Here, a possibility of building a trilingual Latvian-Russian-English corpus of tweets from Riga, Latvia is investigated. Such a corpus, once constructed, might be of great use for multiple purposes such as training machine translation models, examining cross-lingual phenomena and studying the population of Riga. This pilot study shows that it is feasible to build such a resource by building and analysing a pilot corpus, which is made publicly available and can be used to construct a large comparable corpus.

Files

tweets.csv

Files (2.6 MB)

Name Size Download all
md5:75d4fd3ed752cbc0d1eaad2fa84ae6a9
2.6 MB Preview Download

Additional details

Related works