This tool allows to parse .json files containing telegram chats generated with Telegram desktop (v. 2.9.2.0). It takes as input a .json chat file, and outputs:
Dependencies:
available files: result_esercenti.json result_giardinaggio.json result_pappagalli.json result_rifiuto il green pass.json result_sapienza.json result_unipd.json
Here we use service messages from the chat to analyse the growth or decline of the total amount of users. Not having a user count at the beginning of the chat, we assume that the initial count is 2, i.e. the minimum amount of users to create a group.
100%|██████████| 16/16 [00:17<00:00, 1.11s/it] 100%|██████████| 16/16 [00:00<00:00, 164.51it/s] 16it [00:00, 16039.40it/s]
joined | invited | removed | daily variation | total | |
---|---|---|---|---|---|
2021-08-08 | 442 | 1 | 1 | 442 | 444 |
2021-08-09 | 527 | 1 | 0 | 528 | 972 |
2021-08-10 | 278 | 14 | 6 | 286 | 1258 |
2021-08-11 | 203 | 52 | 78 | 177 | 1435 |
2021-08-12 | 126 | 26 | 42 | 110 | 1545 |
2021-08-13 | 143 | 0 | 0 | 143 | 1688 |
2021-08-14 | 117 | 0 | 0 | 117 | 1805 |
2021-08-15 | 87 | 0 | 0 | 87 | 1892 |
2021-08-16 | 0 | 0 | 0 | 0 | 1892 |
2021-08-17 | 0 | 0 | 0 | 0 | 1892 |
2021-08-18 | 1 | 0 | 0 | 1 | 1893 |
2021-08-19 | 1 | 1 | 0 | 2 | 1895 |
2021-08-20 | 2 | 0 | 0 | 2 | 1897 |
2021-08-21 | 0 | 0 | 0 | 0 | 1897 |
2021-08-22 | 0 | 0 | 0 | 0 | 1897 |
2021-08-23 | 0 | 0 | 0 | 0 | 1897 |
The files provided with the notebook contain italian names and toponyms (kudos to Phelipe de Sterlich and ISTAT). Said files can easily be replaced if need be.
IMPORTANT: Surnames are not removed from messages. The reason is: very seldom people refer to other members of the chat or to themselves using the surname. Surnames are more often used to refer to public figures or sources of information, and are thus a valuable information for the analysis. This could be esily done if need be copypasting some lines of code ad using a list line this one Keeping this in mind, even if rather comprehensive and accurate, the anonymization process does not guarantee the absence of other identifiers in the text. Therefore, it is suggested to release datasets generated with this software as "available upon request".
Pseudonymization: replace user id with univocal name
100%|██████████| 4838/4838 [00:00<00:00, 1617411.35it/s]
Anonymization: remove names and places from messages
100%|██████████| 4838/4838 [00:21<00:00, 225.39it/s]
Total messages: 4838
id | date | pseudonym | message | |
---|---|---|---|---|
2 | 3 | 2021-08-08 13:00:34 | Pieruccia | [{'type': 'bot_command', 'text': '/start@Group... |
3 | 4 | 2021-08-08 13:00:34 | Leonarda | [{'type': 'bold', 'text': 'Grazie'}, ' per ave... |
4 | 5 | 2021-08-08 13:00:34 | Leonarda | ['Per configurarmi, usa ', {'type': 'code', 't... |
6 | 7 | 2021-08-08 13:07:22 | Micol | Ciao a tutti! |
11 | 12 | 2021-08-08 13:12:57 | Lello | [place] ciao |
... | ... | ... | ... | ... |
6988 | 9640 | 2021-08-24 18:19:37 | Eolo | Scusa, in una sede istituzionale non puo' asso... |
6989 | 9645 | 2021-08-24 18:20:37 | Cordelio | [{'type': 'mention_name', 'text': 'K', 'user_i... |
6990 | 9646 | 2021-08-24 18:20:38 | Leonarda | [{'type': 'mention_name', 'text': 'K', 'user_i... |
6991 | 9647 | 2021-08-24 18:33:13 | Fanino | ma no, una domanda non lede niente, casomai è ... |
6992 | 9648 | 2021-08-24 18:45:17 | Elisabetta | ["💫 PETIZIONE SCUOLA UNIVERSITÀ PROF GRANARA\n... |
4838 rows × 4 columns
Here we calculate how many users are active (at least 1 message sent), how many are "very active" (arbitrarily defined as users in the 75% quantile), and their relative frequency, expressed as a percentage.
Total users: 1897 Active users (at least one message): 535 (28.2%) Very active users (message count in 75% quantile): 144 (7.59%) Frequencies of messages of active users:
count | frequency | |
---|---|---|
Micol | 361 | 0.074618 |
Elisabetta | 224 | 0.046300 |
Erode | 186 | 0.038446 |
Osema | 153 | 0.031625 |
Amandino | 147 | 0.030384 |
... | ... | ... |
Trieste | 1 | 0.000207 |
Adalia | 1 | 0.000207 |
Catino | 1 | 0.000207 |
Bovo | 1 | 0.000207 |
Ombra | 1 | 0.000207 |
535 rows × 2 columns
count 535.000000 mean 9.042991 std 25.750969 min 1.000000 25% 1.000000 50% 3.000000 75% 7.000000 max 361.000000 Name: count, dtype: float64
Here we calculate the amount of messages written to the chat every day.
100%|██████████| 17/17 [00:42<00:00, 2.50s/it]
count | |
---|---|
2021-08-08 | 954 |
2021-08-09 | 784 |
2021-08-10 | 478 |
2021-08-11 | 279 |
2021-08-12 | 173 |
2021-08-13 | 168 |
2021-08-14 | 189 |
2021-08-15 | 104 |
2021-08-16 | 183 |
2021-08-17 | 196 |
2021-08-18 | 153 |
2021-08-19 | 216 |
2021-08-20 | 136 |
2021-08-21 | 201 |
2021-08-22 | 120 |
2021-08-23 | 304 |
2021-08-24 | 200 |
Here we use dictionary files for autocoding entire messages. The assumption is that a message in a large group chat can be consideret as a minimal conceptual unit, i.e. a text in which a user develops one main argument or touches one main topic. Hence, if one or more rules from a dict fire for a given message, that message is autocoded as belonging to that dict.
The dictionary files are plain text files stored in /dict; the name of the file is used as the name of the code defined by the rules contained in the file.
The rules are written in regex, e.g: 'vaccin.*' will capture 'vaccine', 'vaccines', 'vaccination', and so on.
Regex allows the definition of fairly complex rules. As an example:
\d* ?((second)|(seconds)|(minute)|(minutes)|(hour)|(hours))
This rule will fire every time any anount of digits (0-9) is followed by 0 or more spaces, and then by a time unit (second, minute, hour). Used on a recipe, it would identify the passages in which a precise amount of time is expressed.
For more details on regex and to develop and test new rules, check regex101.
The code has a weight system: if only one rule from the dict fires, the autocode has a weight of 1, if 2 rules fire, the weight will be 2 and so on. MaxQDA does not support (yet) the import of weighted codes, but it might in the future. Moreover, these values can be used for further analyses and are thus exported in the dataframe.
100%|██████████| 6/6 [00:00<00:00, 17.29it/s]
id | date | pseudonym | message | covid-19 | freedom | green pass | links | university | vaccine | |
---|---|---|---|---|---|---|---|---|---|---|
2 | 3 | 2021-08-08 13:00:34 | Pieruccia | [{'type': 'bot_command', 'text': '/start@Group... | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 4 | 2021-08-08 13:00:34 | Leonarda | [{'type': 'bold', 'text': 'Grazie'}, ' per ave... | 0 | 0 | 0 | 2 | 0 | 0 |
4 | 5 | 2021-08-08 13:00:34 | Leonarda | ['Per configurarmi, usa ', {'type': 'code', 't... | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 7 | 2021-08-08 13:07:22 | Micol | Ciao a tutti! | 0 | 0 | 0 | 0 | 0 | 0 |
11 | 12 | 2021-08-08 13:12:57 | Lello | [place] ciao | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6988 | 9640 | 2021-08-24 18:19:37 | Eolo | Scusa, in una sede istituzionale non puo' asso... | 0 | 0 | 0 | 0 | 0 | 0 |
6989 | 9645 | 2021-08-24 18:20:37 | Cordelio | [{'type': 'mention_name', 'text': 'K', 'user_i... | 0 | 0 | 0 | 0 | 0 | 0 |
6990 | 9646 | 2021-08-24 18:20:38 | Leonarda | [{'type': 'mention_name', 'text': 'K', 'user_i... | 0 | 0 | 1 | 2 | 1 | 0 |
6991 | 9647 | 2021-08-24 18:33:13 | Fanino | ma no, una domanda non lede niente, casomai è ... | 0 | 0 | 0 | 0 | 0 | 0 |
6992 | 9648 | 2021-08-24 18:45:17 | Elisabetta | ["💫 PETIZIONE SCUOLA UNIVERSITÀ PROF GRANARA\n... | 1 | 1 | 0 | 2 | 2 | 0 |
4838 rows × 10 columns
Here we calculate the prevalence of the codes. The value used as 'count' represents the weight of the code, so the amount of times each one of the rules fired on each one of the messages. Normalization is performed dividing the 'count' value by the number of messages, and multiplying by 100.
count | frequency | |
---|---|---|
covid-19 | 215 | 4.443985 |
freedom | 242 | 5.002067 |
green pass | 548 | 11.326995 |
links | 1107 | 22.881356 |
university | 1235 | 25.527077 |
vaccine | 1220 | 25.217032 |
Here we export the files to be used for further analyses:
All good, files exported!
--------------------------------------------------------------------------- CalledProcessError Traceback (most recent call last) <ipython-input-19-25820f687dfe> in <module> 1 import subprocess ----> 2 subprocess.run("jupyter nbconvert --to-pdf --no-input Untitled.ipynb ", shell=True, check=True) ~\anaconda3\envs\TelegramHistory\lib\subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs) 514 retcode = process.poll() 515 if check and retcode: --> 516 raise CalledProcessError(retcode, process.args, 517 output=stdout, stderr=stderr) 518 return CompletedProcess(process.args, retcode, stdout, stderr) CalledProcessError: Command 'jupyter nbconvert --to-pdf --no-input Untitled.ipynb ' returned non-zero exit status 2.