Json parser and autocoder for Telegram chats

This tool allows to parse .json files containing telegram chats generated with Telegram desktop (v. 2.9.2.0). It takes as input a .json chat file, and outputs:

Dependencies:

User count

Here we use service messages from the chat to analyse the growth or decline of the total amount of users.

Important: not having the user count at the beginning of the chat, the initial count is calculated using the current amount of users and the users' variation. This proved correct in test chats, but can yield negative results in larger ones. Still looking for a possible explanation

In synthesis:

Anonymization and pseudonymization

The files provided with the notebook contain italian names and toponyms (kudos to Phelipe de Sterlich and ISTAT). Said files can easily be replaced if need be.

To acheive a higher degree of precision, this is done with regex. It takes time, be patient.

IMPORTANT: Surnames are not removed from messages. The reason is: very seldom people refer to other members of the chat or to themselves using the surname. Surnames are more often used to refer to public figures or sources of information, and are thus a valuable information for the analysis. This could be esily done if need be copypasting some lines of code ad using a list line this one Keeping this in mind, even if rather comprehensive and accurate, the anonymization process does not guarantee the absence of other identifiers in the text. Therefore, it is suggested to release datasets generated with this software as "available upon request".

User activity

Here we calculate how many users are active (at least 2 messages sent), and how many are "very active" (arbitrarily defined as users in the 75% quantile).

Important: every "very active user" is by definition also an "active user". Hence, to plot them in a meaningful way we calculate and plot the amount of users who are "active" but not "very active".

Important: it is possible that a former user has been a very active user, hence the percentage is calculated on "total users", not on "total current users"

Messages per day

Here we calculate the amount of messages written to the chat every day.

Autocoding

Here we use dictionary files for autocoding entire messages. The assumption is that a message in a large group chat can be consideret as a minimal conceptual unit, i.e. a text in which a user develops one main argument or touches one main topic. Hence, if one or more rules from a dict fire for a given message, that message is autocoded as belonging to that dict.

The dictionary files are plain text files stored in /dict; the name of the file is used as the name of the code defined by the rules contained in the file.

The rules are written in regex, e.g: 'vaccin.*' will capture 'vaccine', 'vaccines', 'vaccination', and so on.

Regex allows the definition of fairly complex rules. As an example:

(tesser.\sverd.?|pass\sverd.?|certifica\w*\sverd.?)

This rule will fire on "tessera verde" or "tessere verdi" or "pass verde" or "certificato verde", but not for "casa verde" or "verderame" or "tessera del cinema".

For more details on regex and to develop and test new rules, check regex101.

The code has a weight system: if only one rule from the dict fires, the autocode has a weight of 1, if 2 rules fire, the weight will be 2 and so on. MaxQDA does not support (yet) the import of weighted codes, but it might in the future. Moreover, these values can be used for further analyses and are thus exported in the dataframe.

Prevalence of codes

Here we calculate the prevalence of the codes. The value used as 'count' represents the weight of the code, so the amount of times each one of the rules fired on each one of the messages. Normalization is performed dividing the 'count' value by the number of messages, and multiplying by 100.

Frequency of lemmas

Here we create a bag of words using the anonymized messages, lemmatize with Spacy, and calculate the frequecies of lemmas.

Important: remember to specify the correct linguistical model and eventually to add custom stopwords to the stoplist (first code cell of this notebook)

Frequency of lemmas in coded messages

Same as above, but instead of using a single bag of words we create a bag of words for each code.

Sentiment analysis

Sentiment analysis calculates the probability of positive or negative sentiment per each message. This is performed using FEEL-IT: Emotion and Sentiment Classification for the Italian Language. Kudos to Federico Bianchi f.bianchi@unibocconi.it; Debora Nozza debora.nozza@unibocconi.it; and Dirk Hovy dirk.hovy@unibocconi.it for their amazing work.

In order to bin the sentiment probability and to plot it, we define "positive sentiment" when the relative probability of positive sentiment is > 0.75 and "negative sentiment" when the relative probability of negative sentiment is > 0.75.

Export

Here we export the files to be used for further analyses:

Tabular data in .csv files:

Structured data in text files:

Exports of the notebook as .html files: