Software Open Access

The TSL machine: parser, lemma analysis, sentiment analysis and autocoding for Telegram chats

Giovanni Spitale; Federico Germani; Nikola Biller - Andorno


DataCite XML Export

<?xml version='1.0' encoding='utf-8'?>
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd">
  <identifier identifierType="DOI">10.5281/zenodo.5534045</identifier>
  <creators>
    <creator>
      <creatorName>Giovanni Spitale</creatorName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-6812-0979</nameIdentifier>
      <affiliation>University of Zurich - Institute of Biomedical Ethics and History of Medicine</affiliation>
    </creator>
    <creator>
      <creatorName>Federico Germani</creatorName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-5604-0437</nameIdentifier>
      <affiliation>University of Zurich - Institute of Biomedical Ethics and History of Medicine</affiliation>
    </creator>
    <creator>
      <creatorName>Nikola Biller - Andorno</creatorName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0001-7661-1324</nameIdentifier>
      <affiliation>University of Zurich - Institute of Biomedical Ethics and History of Medicine</affiliation>
    </creator>
  </creators>
  <titles>
    <title>The TSL machine: parser, lemma analysis, sentiment analysis and autocoding for Telegram chats</title>
  </titles>
  <publisher>Zenodo</publisher>
  <publicationYear>2021</publicationYear>
  <subjects>
    <subject>natural language processing</subject>
    <subject>NLP</subject>
    <subject>telegram</subject>
    <subject>covid-19</subject>
    <subject>social listening</subject>
    <subject>green pass</subject>
    <subject>vaccine</subject>
    <subject>freedom</subject>
    <subject>ethics</subject>
  </subjects>
  <dates>
    <date dateType="Issued">2021-09-28</date>
  </dates>
  <language>en</language>
  <resourceType resourceTypeGeneral="Software"/>
  <alternateIdentifiers>
    <alternateIdentifier alternateIdentifierType="url">https://zenodo.org/record/5534045</alternateIdentifier>
  </alternateIdentifiers>
  <relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.5533906</relatedIdentifier>
  </relatedIdentifiers>
  <version>1.0.1</version>
  <rightsList>
    <rights rightsURI="https://creativecommons.org/licenses/by/4.0/legalcode">Creative Commons Attribution 4.0 International</rights>
    <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
  </rightsList>
  <descriptions>
    <description descriptionType="Abstract">&lt;p&gt;The purpose of this tool is performing NLP analysis on Telegram chats. Telegram chats can be exported as .json files from the official client, Telegram Desktop (v. 2.9.2.0).&amp;nbsp;&lt;/p&gt;

&lt;p&gt;The files are parsed, the content is used to populate a message dataframe, which is then anonymized.&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The software calculates and displays the following information:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;user count (n of users, new users per day, removed users per day);&lt;/li&gt;
	&lt;li&gt;message count (n and relative frequency of messages, messages per day);&lt;/li&gt;
	&lt;li&gt;autocoded messages (anonymized message dataframe with code weights assigned to each message based on a customizable set of regex rules);&lt;/li&gt;
	&lt;li&gt;prevalence of codes (n and relative frequency);&lt;/li&gt;
	&lt;li&gt;prevalence of lemmas&amp;nbsp;(n and relative frequency);&lt;/li&gt;
	&lt;li&gt;prevalence of lemmas segmented by autocode (n and relative frequency);&lt;/li&gt;
	&lt;li&gt;mean sentiment per day;&lt;/li&gt;
	&lt;li&gt;mean sentiment&amp;nbsp;segmented by autocode.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The software outputs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;messages_df_anon.csv - an anonymized file containing the progressive id of the message, the date, the univocal pseudonym of the sender, and the text;&lt;/li&gt;
	&lt;li&gt;usercount_df.csv - user count dataframe;&lt;/li&gt;
	&lt;li&gt;user_activity_df.csv - user activity dataframe;&lt;/li&gt;
	&lt;li&gt;messagecount_df.csv - message count dataframe;&lt;/li&gt;
	&lt;li&gt;messages_df_anon_coded.csv -&amp;nbsp;an anonymized file containing the progressive id of the message, the date, the univocal pseudonym of the sender,&amp;nbsp;the text, the codes, and the sentiment;&lt;/li&gt;
	&lt;li&gt;autocode_freq_df.csv - general prevalence of codes;&lt;/li&gt;
	&lt;li&gt;lemma_df.csv - lemma frequency;&lt;/li&gt;
	&lt;li&gt;autocode_freq_df_[rule_name].csv - lemma frequency in coded messages, one file per rule;&lt;/li&gt;
	&lt;li&gt;daily_sentiment_df.csv - daily sentiment;&lt;/li&gt;
	&lt;li&gt;sentiment_by_code_df.csv - sentiment segmented by code;&lt;/li&gt;
	&lt;li&gt;messages_anon.txt - anonymized text file generated from the message data frame, for easy import in other software for text mining or qualitative analysis;&lt;/li&gt;
	&lt;li&gt;messages_anon_MaxQDA.txt - anonymized text file generated from the message data frame, formatted specifically for MaxQDA (to track speakers and codes).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Dependencies:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;pandas (1.2.1)&lt;/li&gt;
	&lt;li&gt;json&lt;/li&gt;
	&lt;li&gt;random&lt;/li&gt;
	&lt;li&gt;os&lt;/li&gt;
	&lt;li&gt;re&lt;/li&gt;
	&lt;li&gt;tqdm (4.62.2)&lt;/li&gt;
	&lt;li&gt;datetime (4.3)&lt;/li&gt;
	&lt;li&gt;matplotlib (3.4.3)&lt;/li&gt;
	&lt;li&gt;Spacy (3.1.2) + it_core_news_md&lt;/li&gt;
	&lt;li&gt;wordcloud (1.8.1)&lt;/li&gt;
	&lt;li&gt;Counter&lt;/li&gt;
	&lt;li&gt;feel_it (1.0.3)&lt;/li&gt;
	&lt;li&gt;torch (1.9.0)&lt;/li&gt;
	&lt;li&gt;numpy (1.21.1)&lt;/li&gt;
	&lt;li&gt;transformers (4.3.3)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;This code is optimized for Italian, however:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;Lemma analysis is based on spaCy, which provides several other models for other languages (&amp;nbsp;&lt;a href="https://spacy.io/models"&gt;https://spacy.io/models&lt;/a&gt;&amp;nbsp;) so it can easily be adapted.&lt;/li&gt;
	&lt;li&gt;Sentiment analysis is performed using &lt;a href="https://github.com/MilaNLProc/feel-it"&gt;FEEL-IT: Emotion and Sentiment Classification for the Italian Language&lt;/a&gt;&amp;nbsp;(Kudos to Federico Bianchi &amp;lt;f.bianchi@unibocconi.it&amp;gt;; Debora Nozza &amp;lt;debora.nozza@unibocconi.it&amp;gt;; and Dirk Hovy &amp;lt;dirk.hovy@unibocconi.it&amp;gt;). Their work is specific for Italian. To perform sentiment analysis in other languages one could consider nltk.sentiment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code is structured in a Jupyter-lab notebook, heavily commented for future reference.&lt;/p&gt;

&lt;p&gt;The software comes with a toy dataset comprised of Wikiquotes copy-pasted in a chat created by the research group. Have fun exploring it.&lt;/p&gt;</description>
    <description descriptionType="Other">{"references": ["Bianchi F, Nozza D, Hovy D. FEEL-IT: Emotion and Sentiment Classification for the Italian Language. In: Proceedings of the 11th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Association for Computational Linguistics; 2021. https://github.com/MilaNLProc/feel-it"]}</description>
  </descriptions>
</resource>
290
17
views
downloads
All versions This version
Views 290275
Downloads 1715
Data volume 63.4 MB51.6 MB
Unique views 265258
Unique downloads 1515

Share

Cite as