Published July 19, 2018 | Version v0.4
Conference paper Open

Language Use in a Multilingual Tweet Corpus

  • 1. NIST

Description

A trilingual Latvian-Russian-English corpus of tweets is presented with an analysis of users, language and topics. The corpus consists of 1.4 million tweets that cover a period from April 2017 to July 2018. The language analysis reveals that the majority of users mostly use one language. Across topics, there is more Latvian content than in the whole collection. Among many potential use cases, the corpus can be used, for example, to study the public engagement of major Latvian media outlets and public figures, or the factors that determine language choice and content of a tweet.

Files

2018-lv2-bhlt.zip

Files (99.0 MB)

Name Size Download all
md5:7371dac47a85ebea9eef99f5e88ccac3
30.1 MB Preview Download
md5:8c3fb27ad84daee00ac5ab9d15de8d4d
37.4 MB Preview Download
md5:d85a07f1f8ad0afb74f11c619e47c7d1
10.8 kB Download
md5:49cbe6fcc79bb725650f3f73e6f15749
1.3 MB Preview Download
md5:907b1ead3854224d8d164426f1e986d5
27.0 MB Preview Download
md5:04cdbc58c62c874655794c85e9042f2e
3.2 MB Preview Download
md5:f108cad407c7940fd106c75936ba84a1
7.5 kB Preview Download

Additional details