Published June 25, 2025 | Version v1
Dataset Open

Monthly Twitter n-grams generated from a corpus of more than 2 billion English tweets (2013-2023)

  • 1. ROR icon Humboldt-Universität zu Berlin
  • 1. ROR icon L3S Research Center
  • 2. EDMO icon Heinrich-Heine-University Düsseldorf
  • 3. ROR icon GESIS - Leibniz-Institute for the Social Sciences
  • 4. ROR icon Humboldt-Universität zu Berlin

Description

This dataset contains 1-, 2-, and 3-grams generated from a corpus of more than two billion English tweets from the random 1% sample of the Twitter streaming API harvested between January 2013 and June 2023.

The tweets are a subset of the tweets that were used to generate TweetsKB. Starting from 3,716,933,904 English tweets, 1,606,843,572 retweets and 2,718,439 duplicates (due to redundant harvesting) were removed, resulting in 2,107,371,893 tweets. All URLs and @mentions of user names were removed from their textual content and the text was tokenised using twokenize.

In one way or another, the following people contributed to the creation of this dataset: Erdal Baran, Ernesto Diaz-Aviles, Stefan Dietze, Dimitar Dimitrov, Elso Dittfeld, Pavlos Fafalios, Vasileios Iosifidis, Robert Jäschke, Sebastian Schellhammer, Sebastian Tiesler, Yudong Zhang, Asmelash Teka Hadgu, Ran Yu, Xiaofei Zhu, Matthäus Zloch.

The dataset consists of two parts: one part where the case of letters is preserved and one where all letters are normalised to lower case (file name prefix lc_).  Each part consists of eleven TAR files (one for each year) and each TAR file consists of up to 12 gzip-compressed TSV files (one for each month of the year). Overall, there are 125 TSV files containing n-grams and their monthly frequencies from 01/2013 to 06/2023. Each line in a TSV file represents an n-gram (sorted descending by frequency) and has the following columns:

  • ngram_type: 1, 2, or 3 (for 1-grams, 2-grams, and 3-grams, respectively)
  • count: the frequency of the n-gram, that is, the number of times it appears in that month (not the number of tweets, since an n-gram can occur several times in one tweet)
  • ngram: the n-gram itself

As an example, the first ten rows (plus header) of the file 2018-01.tsv.gz are:

ngram_type      count   ngram
1       195891  .
1       145089  the
1       142295  ,
1       134689  to
1       111336  a
1       106765  I
1       98259   …
1       85359   and
1       77553   you
1       73279   of

Files

Files (13.2 GB)

Name Size Download all
md5:fe2efb3f47f985594e8da4dd9e0190cb
894.3 MB Download
md5:e1f0bb7555a2d6f54a620bc40f26dc3e
875.1 MB Download
md5:9c8875872c9ec8d280695187dab5a6e7
784.0 MB Download
md5:91ea6cf4812050932e87abaa44b32549
697.4 MB Download
md5:646e29ae375342fd5ab4317733bb1ce3
594.0 MB Download
md5:08a08e096b8c6c32dab67528f93554c9
491.2 MB Download
md5:cd1081e67272609b20a510011280c767
508.4 MB Download
md5:e6a571403fafaaf85500791c4a94013e
664.6 MB Download
md5:aed539b75f4128aef5d893ea62046022
645.5 MB Download
md5:beaf7ed13dd77b15260ca5c0a5b871d1
588.9 MB Download
md5:767a889833df6500529064ef757cb60f
241.3 MB Download
md5:d82835af23e8414fdba2e4ca38bf235f
782.6 MB Download
md5:00e6503e82e9ad5a775e45290fa86d0d
767.8 MB Download
md5:c164dc9f0d97ac77fc5fca4a7a0594d9
688.3 MB Download
md5:3dcc27d5688a0b0154fe0bf8fce19c73
614.5 MB Download
md5:e032218318813e2b1001e3c3d3ccb3cd
524.3 MB Download
md5:76741a237fcb13030d04b9bafd17f2c4
435.0 MB Download
md5:69ecd8764a322227f5f15cec8c409343
451.6 MB Download
md5:a1da1000727af3605786c0c15a95718b
589.0 MB Download
md5:0a2a3b77debac67e6f0a4939a7bb54ec
573.3 MB Download
md5:8f396641a116991bc3fc294f7d2eae36
525.0 MB Download
md5:e2ab824903bfe2ff12a8d21c5e170671
217.7 MB Download

Additional details

Related works

Is variant form of
Conference paper: 10.1007/978-3-319-93417-4_12 (DOI)

Dates

Other
2013-01-31
Start Date
Other
2023-06-09
End Date