Toward a Comparable Corpus of Latvian, Russian and English Tweets

doi:10.5281/zenodo.582300

Published May 22, 2017 | Version v1

Dataset Open

Toward a Comparable Corpus of Latvian, Russian and English Tweets

Dmitrijs Milajevs

Twitter has become a rich source for linguistic data. Here, a possibility of building a trilingual Latvian-Russian-English corpus of tweets from Riga, Latvia is investigated. Such a corpus, once constructed, might be of great use for multiple purposes such as training machine translation models, examining cross-lingual phenomena and studying the population of Riga. This pilot study shows that it is feasible to build such a resource by building and analysing a pilot corpus, which is made publicly available and can be used to construct a large comparable corpus.

Files

tweets.csv

Files (2.6 MB)

Name	Size	Download all
tweets.csv md5:75d4fd3ed752cbc0d1eaad2fa84ae6a9	2.6 MB	Preview Download

Additional details

Is supplement to: https://github.com/dimazest/2017-lv-corpus (URL)

244

Views

Downloads

Show more details

	All versions	This version
Views	244	244
Downloads	41	41
Data volume	108.6 MB	108.6 MB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: May 22, 2017
Modified: January 24, 2020

Toward a Comparable Corpus of Latvian, Russian and English Tweets

Creators

Description

Files

tweets.csv

Files (2.6 MB)

Additional details

Related works