Language Use in a Multilingual Tweet Corpus

doi:10.5281/zenodo.1317574

Published July 19, 2018 | Version v0.4

Conference paper Open

Language Use in a Multilingual Tweet Corpus

Dmitrijs Milajevs¹

1. NIST

A trilingual Latvian-Russian-English corpus of tweets is presented with an analysis of users, language and topics. The corpus consists of 1.4 million tweets that cover a period from April 2017 to July 2018. The language analysis reveals that the majority of users mostly use one language. Across topics, there is more Latvian content than in the whole collection. Among many potential use cases, the corpus can be used, for example, to study the public engagement of major Latvian media outlets and public figures, or the factors that determine language choice and content of a tweet.

Files

2018-lv2-bhlt.zip

Files (99.0 MB)

Name	Size	Download all
2018-lv2-bhlt.zip md5:7371dac47a85ebea9eef99f5e88ccac3	30.1 MB	Preview Download
collected_tweets.csv md5:8c3fb27ad84daee00ac5ab9d15de8d4d	37.4 MB	Preview Download
lv.cfg md5:d85a07f1f8ad0afb74f11c619e47c7d1	10.8 kB	Download
paper.pdf md5:49cbe6fcc79bb725650f3f73e6f15749	1.3 MB	Preview Download
rehydrated_tweets.csv md5:907b1ead3854224d8d164426f1e986d5	27.0 MB	Preview Download
relevance_judgments.csv md5:04cdbc58c62c874655794c85e9042f2e	3.2 MB	Preview Download
topics.json.txt md5:f108cad407c7940fd106c75936ba84a1	7.5 kB	Preview Download

Additional details

Cites: https://zenodo.org/record/582300 (URL)
Is supplement to: https://github.com/dimazest/2018-lv2/tree/bhlt (URL)

	All versions	This version
Views	310	138
Downloads	186	128
Data volume	3.2 GB	1.7 GB

Language Use in a Multilingual Tweet Corpus

Creators

Description

Files

2018-lv2-bhlt.zip

Files (99.0 MB)

Additional details

Related works