Conference paper Open Access

Language Use in a Multilingual Tweet Corpus

Dmitrijs Milajevs

A trilingual Latvian-Russian-English corpus of tweets is presented with an analysis of users, language and topics. The corpus consists of 1.4 million tweets that cover a period from April 2017 to July 2018. The language analysis reveals that the majority of users mostly use one language. Across topics, there is more Latvian content than in the whole collection. Among many potential use cases, the corpus can be used, for example, to study the public engagement of major Latvian media outlets and public figures, or the factors that determine language choice and content of a tweet.

Files (99.0 MB)
Name Size
2018-lv2-bhlt.zip
md5:7371dac47a85ebea9eef99f5e88ccac3
30.1 MB Download
collected_tweets.csv
md5:8c3fb27ad84daee00ac5ab9d15de8d4d
37.4 MB Download
lv.cfg
md5:d85a07f1f8ad0afb74f11c619e47c7d1
10.8 kB Download
paper.pdf
md5:49cbe6fcc79bb725650f3f73e6f15749
1.3 MB Download
rehydrated_tweets.csv
md5:907b1ead3854224d8d164426f1e986d5
27.0 MB Download
relevance_judgments.csv
md5:04cdbc58c62c874655794c85e9042f2e
3.2 MB Download
topics.json.txt
md5:f108cad407c7940fd106c75936ba84a1
7.5 kB Download
78
94
views
downloads
All versions This version
Views 7847
Downloads 9463
Data volume 1.0 GB469.4 MB
Unique views 6244
Unique downloads 4039

Share

Cite as