Published May 22, 2017
| Version v1
Dataset
Open
Toward a Comparable Corpus of Latvian, Russian and English Tweets
Creators
Description
Twitter has become a rich source for linguistic data. Here, a possibility of building a trilingual Latvian-Russian-English corpus of tweets from Riga, Latvia is investigated. Such a corpus, once constructed, might be of great use for multiple purposes such as training machine translation models, examining cross-lingual phenomena and studying the population of Riga. This pilot study shows that it is feasible to build such a resource by building and analysing a pilot corpus, which is made publicly available and can be used to construct a large comparable corpus.
Files
tweets.csv
Files
(2.6 MB)
Name | Size | Download all |
---|---|---|
md5:75d4fd3ed752cbc0d1eaad2fa84ae6a9
|
2.6 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/dimazest/2017-lv-corpus (URL)