This tool allows to parse .json files containing telegram chats generated with Telegram desktop (v. 2.9.2.0). It takes as input a .json chat file, and outputs:
Dependencies:
Here we use service messages from the chat to analyse the growth or decline of the total amount of users.
Important: not having the user count at the beginning of the chat, the initial count is calculated using the current amount of users and the users' variation. This proved correct in test chats, but can yield negative results in larger ones. Still looking for a possible explanation
In synthesis:
100%|██████████| 1/1 [00:00<00:00, 1018.53it/s] 100%|██████████| 1/1 [00:00<?, ?it/s] 1it [00:00, 959.79it/s]
current users: 2 initial users: 2 users who joined: 0 invited users: 0 removed users: 0
joined | invited | removed | total (estimated) | |
---|---|---|---|---|
2021-09-21 | 0 | 0 | 0 | 2 |
The files provided with the notebook contain italian names and toponyms (kudos to Phelipe de Sterlich and ISTAT). Said files can easily be replaced if need be.
To acheive a higher degree of precision, this is done with regex. It takes time, be patient.
IMPORTANT: Surnames are not removed from messages. The reason is: very seldom people refer to other members of the chat or to themselves using the surname. Surnames are more often used to refer to public figures or sources of information, and are thus a valuable information for the analysis. This could be esily done if need be copypasting some lines of code ad using a list line this one Keeping this in mind, even if rather comprehensive and accurate, the anonymization process does not guarantee the absence of other identifiers in the text. Therefore, it is suggested to release datasets generated with this software as "available upon request".
100%|██████████| 171/171 [00:00<?, ?it/s]
</span>
100%|██████████| 9063/9063 [00:00<00:00, 3005216.00it/s] 100%|██████████| 7903/7903 [00:00<00:00, 2642926.53it/s] 100%|██████████| 171/171 [00:02<00:00, 81.79it/s]
id | date | pseudonym | message | |
---|---|---|---|---|
1 | 63213 | 2021-09-21 16:21:41 | dominica | All'inizio siamo partiti che dire che il proge... |
2 | 63214 | 2021-09-21 16:21:52 | dominica | [name] di [name] Torre, [name] Ciarrapico e [n... |
3 | 63215 | 2021-09-21 16:22:02 | dominica | [name] fa ridere e riflettere [name] stesso te... |
4 | 63216 | 2021-09-21 16:22:13 | dominica | Ciò che è successo il giorno della conferenza ... |
5 | 63217 | 2021-09-21 16:22:25 | dominica | Crediamo che sia meravigliosa la longevità di ... |
... | ... | ... | ... | ... |
167 | 63445 | 2021-09-24 10:13:22 | dominica | A causa delle loro dimensioni e della loro nat... |
168 | 63446 | 2021-09-24 10:13:29 | dominica | Se sei il [name] proprietario di un uccello o ... |
169 | 63447 | 2021-09-24 10:13:57 | dominica | Questa è la prima cosa. [name] il green pass p... |
170 | 63448 | 2021-09-24 10:14:17 | dominica | La molla del linciaggio è la ricerca del creti... |
171 | 63449 | 2021-09-24 10:14:46 | dominica | Due gioielli politici si sono opposti con tutt... |
171 rows × 4 columns
Here we calculate how many users are active (at least 2 messages sent), and how many are "very active" (arbitrarily defined as users in the 75% quantile).
Important: every "very active user" is by definition also an "active user". Hence, to plot them in a meaningful way we calculate and plot the amount of users who are "active" but not "very active".
Important: it is possible that a former user has been a very active user, hence the percentage is calculated on "total users", not on "total current users"
Total active users (at least 1 message, including former active users): 2 Interacting users (at least two messages): 2 (100.0%) Very active users (message count in 75% quantile): 1 (50.0%)
count | frequency | |
---|---|---|
dominica | 151 | 0.883041 |
rodrigo | 20 | 0.116959 |
count 2.000000 mean 85.500000 std 92.630988 min 20.000000 25% 52.750000 50% 85.500000 75% 118.250000 max 151.000000 Name: count, dtype: float64
Here we calculate the amount of messages written to the chat every day.
100%|██████████| 4/4 [00:00<00:00, 1998.95it/s]
count | |
---|---|
2021-09-21 | 56 |
2021-09-22 | 57 |
2021-09-23 | 2 |
2021-09-24 | 56 |
Here we use dictionary files for autocoding entire messages. The assumption is that a message in a large group chat can be consideret as a minimal conceptual unit, i.e. a text in which a user develops one main argument or touches one main topic. Hence, if one or more rules from a dict fire for a given message, that message is autocoded as belonging to that dict.
The dictionary files are plain text files stored in /dict; the name of the file is used as the name of the code defined by the rules contained in the file.
The rules are written in regex, e.g: 'vaccin.*' will capture 'vaccine', 'vaccines', 'vaccination', and so on.
Regex allows the definition of fairly complex rules. As an example:
(tesser.\sverd.?|pass\sverd.?|certifica\w*\sverd.?)
This rule will fire on "tessera verde" or "tessere verdi" or "pass verde" or "certificato verde", but not for "casa verde" or "verderame" or "tessera del cinema".
For more details on regex and to develop and test new rules, check regex101.
The code has a weight system: if only one rule from the dict fires, the autocode has a weight of 1, if 2 rules fire, the weight will be 2 and so on. MaxQDA does not support (yet) the import of weighted codes, but it might in the future. Moreover, these values can be used for further analyses and are thus exported in the dataframe.
100%|██████████| 7/7 [00:00<00:00, 146.14it/s]
id | date | pseudonym | message | covid-19 | freedom | green pass | links | parrot | university | vaccine | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 63213 | 2021-09-21 16:21:41 | dominica | All'inizio siamo partiti che dire che il proge... | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 63214 | 2021-09-21 16:21:52 | dominica | [name] di [name] Torre, [name] Ciarrapico e [n... | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 63215 | 2021-09-21 16:22:02 | dominica | [name] fa ridere e riflettere [name] stesso te... | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 63216 | 2021-09-21 16:22:13 | dominica | Ciò che è successo il giorno della conferenza ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 63217 | 2021-09-21 16:22:25 | dominica | Crediamo che sia meravigliosa la longevità di ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
167 | 63445 | 2021-09-24 10:13:22 | dominica | A causa delle loro dimensioni e della loro nat... | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
168 | 63446 | 2021-09-24 10:13:29 | dominica | Se sei il [name] proprietario di un uccello o ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
169 | 63447 | 2021-09-24 10:13:57 | dominica | Questa è la prima cosa. [name] il green pass p... | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
170 | 63448 | 2021-09-24 10:14:17 | dominica | La molla del linciaggio è la ricerca del creti... | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
171 | 63449 | 2021-09-24 10:14:46 | dominica | Due gioielli politici si sono opposti con tutt... | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
171 rows × 11 columns
Here we calculate the prevalence of the codes. The value used as 'count' represents the weight of the code, so the amount of times each one of the rules fired on each one of the messages. Normalization is performed dividing the 'count' value by the number of messages, and multiplying by 100.
count | frequency | |
---|---|---|
covid-19 | 38 | 22.222222 |
freedom | 8 | 4.678363 |
green pass | 3 | 1.754386 |
links | 4 | 2.339181 |
parrot | 13 | 7.602339 |
university | 3 | 1.754386 |
vaccine | 32 | 18.713450 |
Here we create a bag of words using the anonymized messages, lemmatize with Spacy, and calculate the frequecies of lemmas.
Important: remember to specify the correct linguistical model and eventually to add custom stopwords to the stoplist (first code cell of this notebook)
['text', 'type', 'e', 'essere', 'il', '', ' ', 'place', 'name', '\\n', '\\n\\n', 'o']
lemma | count | frequency | |
---|---|---|---|
0 | protagonista | 44 | 2.5086 |
1 | voce | 40 | 2.2805 |
2 | campo | 38 | 2.1665 |
3 | sapere | 21 | 1.1973 |
4 | tyler | 21 | 1.1973 |
... | ... | ... | ... |
1749 | microonde | 1 | 0.0570 |
1750 | bleu | 1 | 0.0570 |
1751 | cordon | 1 | 0.0570 |
1752 | hobby | 1 | 0.0570 |
1753 | ricoverare | 1 | 0.0570 |
1754 rows × 3 columns
Same as above, but instead of using a single bag of words we create a bag of words for each code.
100%|██████████| 7/7 [00:00<00:00, 21.31it/s]
lemma | count | frequency | |
---|---|---|---|
0 | virus | 12 | 2.2222 |
1 | coronavirus | 10 | 1.8519 |
2 | globale | 8 | 1.4815 |
3 | covid | 8 | 1.4815 |
4 | covid-19 | 7 | 1.2963 |
... | ... | ... | ... |
535 | intero | 1 | 0.1852 |
536 | colpire | 1 | 0.1852 |
537 | rapidità | 1 | 0.1852 |
538 | inesorabile | 1 | 0.1852 |
539 | cacciatore | 1 | 0.1852 |
540 rows × 3 columns
</span>
lemma | count | frequency | |
---|---|---|---|
0 | vaccinazione | 8 | 3.3195 |
1 | libertà | 7 | 2.9046 |
2 | credere | 6 | 2.4896 |
3 | raccontare | 4 | 1.6598 |
4 | fronte | 3 | 1.2448 |
... | ... | ... | ... |
236 | grande | 1 | 0.4149 |
237 | scienziato | 1 | 0.4149 |
238 | virus | 1 | 0.4149 |
239 | debole | 1 | 0.4149 |
240 | davvero | 1 | 0.4149 |
241 rows × 3 columns
</span>
lemma | count | frequency | |
---|---|---|---|
0 | green | 4 | 6.6667 |
1 | pass | 4 | 6.6667 |
2 | cretino | 3 | 5.0000 |
3 | addossare | 3 | 5.0000 |
4 | dare | 3 | 5.0000 |
5 | aprire | 3 | 5.0000 |
6 | c' | 2 | 3.3333 |
7 | volere | 2 | 3.3333 |
8 | opporre | 2 | 3.3333 |
9 | politico | 2 | 3.3333 |
10 | condizione | 1 | 1.6667 |
11 | decisione | 1 | 1.6667 |
12 | esponente | 1 | 1.6667 |
13 | lockdown | 1 | 1.6667 |
14 | migliaio | 1 | 1.6667 |
15 | ricoverati | 1 | 1.6667 |
16 | morto | 1 | 1.6667 |
17 | organizzare | 1 | 1.6667 |
18 | corteo | 1 | 1.6667 |
19 | mascherina | 1 | 1.6667 |
20 | signore | 1 | 1.6667 |
21 | contenimento | 1 | 1.6667 |
22 | ospedale | 1 | 1.6667 |
23 | pericolo | 1 | 1.6667 |
24 | qual | 1 | 1.6667 |
25 | l’[name | 1 | 1.6667 |
26 | ginocchio | 1 | 1.6667 |
27 | scuola | 1 | 1.6667 |
28 | privo | 1 | 1.6667 |
29 | norma | 1 | 1.6667 |
30 | minimo | 1 | 1.6667 |
31 | sicurezza | 1 | 1.6667 |
32 | totale | 1 | 1.6667 |
33 | proposta | 1 | 1.6667 |
34 | salvini | 1 | 1.6667 |
35 | capitare | 1 | 1.6667 |
36 | giornata | 1 | 1.6667 |
37 | socializzazione | 1 | 1.6667 |
38 | molla | 1 | 1.6667 |
39 | linciaggio | 1 | 1.6667 |
40 | ricerca | 1 | 1.6667 |
41 | svegliare | 1 | 1.6667 |
42 | deputato | 1 | 1.6667 |
43 | covid | 1 | 1.6667 |
44 | dovere | 1 | 1.6667 |
45 | sieropositivo | 1 | 1.6667 |
46 | mezzo | 1 | 1.6667 |
47 | mattina | 1 | 1.6667 |
48 | sentire | 1 | 1.6667 |
49 | lo | 1 | 1.6667 |
50 | dire | 1 | 1.6667 |
51 | grosso | 1 | 1.6667 |
52 | accanito | 1 | 1.6667 |
53 | cacciatore | 1 | 1.6667 |
54 | pensare | 1 | 1.6667 |
55 | gioiello | 1 | 1.6667 |
56 | forza | 1 | 1.6667 |
57 | meloni | 1 | 1.6667 |
58 | cosa | 1 | 1.6667 |
59 | ricoverare | 1 | 1.6667 |
</span>
lemma | count | frequency | |
---|---|---|---|
0 | link | 2 | 28.5714 |
1 | https://it.wikiquote.org/wiki/pandemia_di_covi... | 1 | 14.2857 |
2 | buttare | 1 | 14.2857 |
3 | mention_name | 1 | 14.2857 |
4 | user_id | 1 | 14.2857 |
5 | 1971944511 | 1 | 14.2857 |
6 | https://it.wikiquote.org/wiki/pandemia_di_covi... | 1 | 14.2857 |
</span>
lemma | count | frequency | |
---|---|---|---|
0 | pappagallo | 8 | 4.7904 |
1 | cocorite | 6 | 3.5928 |
2 | parlare | 5 | 2.9940 |
3 | uccello | 5 | 2.9940 |
4 | animale | 5 | 2.9940 |
... | ... | ... | ... |
162 | intelligente | 1 | 0.5988 |
163 | incredibilmente | 1 | 0.5988 |
164 | vivace | 1 | 0.5988 |
165 | senso | 1 | 0.5988 |
166 | cocorita | 1 | 0.5988 |
167 rows × 3 columns
</span>
lemma | count | frequency | |
---|---|---|---|
0 | arma | 4 | 2.7972 |
1 | biologico | 4 | 2.7972 |
2 | internazionale | 2 | 1.3986 |
3 | intendere | 2 | 1.3986 |
4 | contrario | 2 | 1.3986 |
... | ... | ... | ... |
138 | 500 | 1 | 0.6993 |
139 | vero | 1 | 0.6993 |
140 | equilibrio | 1 | 0.6993 |
141 | covid | 1 | 0.6993 |
142 | rafforzare | 1 | 0.6993 |
143 rows × 3 columns
</span>
lemma | count | frequency | |
---|---|---|---|
0 | vaccino | 12 | 2.8369 |
1 | vaccinazione | 10 | 2.3641 |
2 | salute | 6 | 1.4184 |
3 | malattia | 5 | 1.1820 |
4 | questione | 4 | 0.9456 |
... | ... | ... | ... |
418 | popolazione | 1 | 0.2364 |
419 | gran | 1 | 0.2364 |
420 | annientare | 1 | 0.2364 |
421 | gates | 1 | 0.2364 |
422 | cacciatore | 1 | 0.2364 |
423 rows × 3 columns
</span>
Sentiment analysis calculates the probability of positive or negative sentiment per each message. This is performed using FEEL-IT: Emotion and Sentiment Classification for the Italian Language. Kudos to Federico Bianchi f.bianchi@unibocconi.it; Debora Nozza debora.nozza@unibocconi.it; and Dirk Hovy dirk.hovy@unibocconi.it for their amazing work.
In order to bin the sentiment probability and to plot it, we define "positive sentiment" when the relative probability of positive sentiment is > 0.75 and "negative sentiment" when the relative probability of negative sentiment is > 0.75.
100%|██████████| 171/171 [00:10<00:00, 16.45it/s] 100%|██████████| 4/4 [00:00<00:00, 10.64it/s]
positive | negative | |
---|---|---|
2021-09-21 | 0.336329 | 0.663671 |
2021-09-22 | 0.158732 | 0.841268 |
2021-09-23 | 0.499950 | 0.500050 |
2021-09-24 | 0.141189 | 0.858811 |
positive | negative | |
---|---|---|
covid-19 | 0.035784 | 0.964216 |
freedom | 0.375025 | 0.624975 |
green pass | 0.333367 | 0.666633 |
links | 0.000350 | 0.999650 |
parrot | 0.546236 | 0.453764 |
university | 0.000200 | 0.999800 |
vaccine | 0.055769 | 0.944231 |
Here we export the files to be used for further analyses:
Tabular data in .csv files:
Structured data in text files:
Exports of the notebook as .html files: