Json parser and autocoder for Telegram chats

This tool allows to parse .json files containing telegram chats generated with Telegram desktop (v. 2.9.2.0). It takes as input a .json chat file, and outputs:

Dependencies:

User count

Here we use service messages from the chat to analyse the growth or decline of the total amount of users. Not having a user count at the beginning of the chat, we assume that the initial count is 2, i.e. the minimum amount of users to create a group.

Anonymization and pseudonymization

The files provided with the notebook contain italian names and toponyms (kudos to Phelipe de Sterlich and ISTAT). Said files can easily be replaced if need be.

IMPORTANT: Surnames are not removed from messages. The reason is: very seldom people refer to other members of the chat or to themselves using the surname. Surnames are more often used to refer to public figures or sources of information, and are thus a valuable information for the analysis. This could be esily done if need be copypasting some lines of code ad using a list line this one Keeping this in mind, even if rather comprehensive and accurate, the anonymization process does not guarantee the absence of other identifiers in the text. Therefore, it is suggested to release datasets generated with this software as "available upon request".

User activity

Here we calculate how many users are active (at least 1 message sent), how many are "very active" (arbitrarily defined as users in the 75% quantile), and their relative frequency, expressed as a percentage.

Messages per day

Here we calculate the amount of messages written to the chat every day.

Autocoding

Here we use dictionary files for autocoding entire messages. The assumption is that a message in a large group chat can be consideret as a minimal conceptual unit, i.e. a text in which a user develops one main argument or touches one main topic. Hence, if one or more rules from a dict fire for a given message, that message is autocoded as belonging to that dict.

The dictionary files are plain text files stored in /dict; the name of the file is used as the name of the code defined by the rules contained in the file.

The rules are written in regex, e.g: 'vaccin.*' will capture 'vaccine', 'vaccines', 'vaccination', and so on.

Regex allows the definition of fairly complex rules. As an example:

\d* ?((second)|(seconds)|(minute)|(minutes)|(hour)|(hours))

This rule will fire every time any anount of digits (0-9) is followed by 0 or more spaces, and then by a time unit (second, minute, hour). Used on a recipe, it would identify the passages in which a precise amount of time is expressed.

For more details on regex and to develop and test new rules, check regex101.

The code has a weight system: if only one rule from the dict fires, the autocode has a weight of 1, if 2 rules fire, the weight will be 2 and so on. MaxQDA does not support (yet) the import of weighted codes, but it might in the future. Moreover, these values can be used for further analyses and are thus exported in the dataframe.

Prevalence of codes

Here we calculate the prevalence of the codes. The value used as 'count' represents the weight of the code, so the amount of times each one of the rules fired on each one of the messages. Normalization is performed dividing the 'count' value by the number of messages, and multiplying by 100.

Export

Here we export the files to be used for further analyses:

test code and notes below