<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>Giovanni Spitale</dc:creator>
  <dc:creator>Federico Germani</dc:creator>
  <dc:creator>Nikola Biller - Andorno</dc:creator>
  <dc:date>2021-09-28</dc:date>
  <dc:description>&amp;lt;p&amp;gt;The purpose of this tool is performing NLP analysis on Telegram chats. Telegram chats can be exported as .json files from the official client, Telegram Desktop (v. 2.9.2.0).&amp;nbsp;&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;The files are parsed, the content is used to populate a message dataframe, which is then anonymized.&amp;nbsp;&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;&amp;lt;strong&amp;gt;The software calculates and displays the following information:&amp;lt;/strong&amp;gt;&amp;lt;/p&amp;gt;

&amp;lt;ul&amp;gt;
	&amp;lt;li&amp;gt;user count (n of users, new users per day, removed users per day);&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;message count (n and relative frequency of messages, messages per day);&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;autocoded messages (anonymized message dataframe with code weights assigned to each message based on a customizable set of regex rules);&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;prevalence of codes (n and relative frequency);&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;prevalence of lemmas&amp;nbsp;(n and relative frequency);&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;prevalence of lemmas segmented by autocode (n and relative frequency);&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;mean sentiment per day;&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;mean sentiment&amp;nbsp;segmented by autocode.&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;

&amp;lt;p&amp;gt;&amp;lt;strong&amp;gt;The software outputs:&amp;lt;/strong&amp;gt;&amp;lt;/p&amp;gt;

&amp;lt;ul&amp;gt;
	&amp;lt;li&amp;gt;messages_df_anon.csv - an anonymized file containing the progressive id of the message, the date, the univocal pseudonym of the sender, and the text;&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;usercount_df.csv - user count dataframe;&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;user_activity_df.csv - user activity dataframe;&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;messagecount_df.csv - message count dataframe;&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;messages_df_anon_coded.csv -&amp;nbsp;an anonymized file containing the progressive id of the message, the date, the univocal pseudonym of the sender,&amp;nbsp;the text, the codes, and the sentiment;&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;autocode_freq_df.csv - general prevalence of codes;&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;lemma_df.csv - lemma frequency;&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;autocode_freq_df_[rule_name].csv - lemma frequency in coded messages, one file per rule;&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;daily_sentiment_df.csv - daily sentiment;&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;sentiment_by_code_df.csv - sentiment segmented by code;&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;messages_anon.txt - anonymized text file generated from the message data frame, for easy import in other software for text mining or qualitative analysis;&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;messages_anon_MaxQDA.txt - anonymized text file generated from the message data frame, formatted specifically for MaxQDA (to track speakers and codes).&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;

&amp;lt;p&amp;gt;Dependencies:&amp;lt;/p&amp;gt;

&amp;lt;ul&amp;gt;
	&amp;lt;li&amp;gt;pandas (1.2.1)&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;json&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;random&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;os&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;re&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;tqdm (4.62.2)&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;datetime (4.3)&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;matplotlib (3.4.3)&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;Spacy (3.1.2) + it_core_news_md&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;wordcloud (1.8.1)&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;Counter&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;feel_it (1.0.3)&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;torch (1.9.0)&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;numpy (1.21.1)&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;transformers (4.3.3)&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;

&amp;lt;p&amp;gt;This code is optimized for Italian.&amp;nbsp;&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Lemma analysis is based on spaCy, which provides several other models for other languages (&amp;nbsp;&amp;lt;a href="https://spacy.io/models"&amp;gt;https://spacy.io/models&amp;lt;/a&amp;gt;&amp;nbsp;) so it can easily be adapted.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Sentiment analysis is performed using &amp;lt;a href="https://github.com/MilaNLProc/feel-it"&amp;gt;FEEL-IT: Emotion and Sentiment Classification for the Italian Language&amp;lt;/a&amp;gt;&amp;nbsp;(Kudos to Federico Bianchi &amp;lt;f.bianchi@unibocconi.it&amp;gt;; Debora Nozza &amp;lt;debora.nozza@unibocconi.it&amp;gt;; and Dirk Hovy &amp;lt;dirk.hovy@unibocconi.it&amp;gt;). Their work is specific for Italian. To perform sentiment analysis in other languages one could consider nltk.sentiment&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;The code is structured in a Jupyter-lab notebook, heavily commented for future reference.&amp;lt;/p&amp;gt;</dc:description>
  <dc:identifier>https://doi.org/10.5281/zenodo.5533907</dc:identifier>
  <dc:identifier>oai:zenodo.org:5533907</dc:identifier>
  <dc:language>eng</dc:language>
  <dc:publisher>Zenodo</dc:publisher>
  <dc:relation>https://doi.org/10.5281/zenodo.5533906</dc:relation>
  <dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
  <dc:rights>Creative Commons Attribution 4.0 International</dc:rights>
  <dc:rights>https://creativecommons.org/licenses/by/4.0/legalcode</dc:rights>
  <dc:subject>natural language processing</dc:subject>
  <dc:subject>NLP</dc:subject>
  <dc:subject>telegram</dc:subject>
  <dc:subject>covid-19</dc:subject>
  <dc:subject>social listening</dc:subject>
  <dc:subject>green pass</dc:subject>
  <dc:subject>vaccine</dc:subject>
  <dc:subject>freedom</dc:subject>
  <dc:subject>ethics</dc:subject>
  <dc:title>The TSL machine: parser, lemma analysis, sentiment analysis and autocoding for Telegram chats</dc:title>
  <dc:type>info:eu-repo/semantics/other</dc:type>
</oai_dc:dc>