Software Open Access
Giovanni Spitale;
Federico Germani;
Nikola Biller - Andorno
<?xml version='1.0' encoding='utf-8'?> <resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd"> <identifier identifierType="DOI">10.5281/zenodo.5533907</identifier> <creators> <creator> <creatorName>Giovanni Spitale</creatorName> <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-6812-0979</nameIdentifier> <affiliation>University of Zurich - Institute of Biomedical Ethics and History of Medicine</affiliation> </creator> <creator> <creatorName>Federico Germani</creatorName> <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-5604-0437</nameIdentifier> <affiliation>University of Zurich - Institute of Biomedical Ethics and History of Medicine</affiliation> </creator> <creator> <creatorName>Nikola Biller - Andorno</creatorName> <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0001-7661-1324</nameIdentifier> <affiliation>University of Zurich - Institute of Biomedical Ethics and History of Medicine</affiliation> </creator> </creators> <titles> <title>The TSL machine: parser, lemma analysis, sentiment analysis and autocoding for Telegram chats</title> </titles> <publisher>Zenodo</publisher> <publicationYear>2021</publicationYear> <subjects> <subject>natural language processing</subject> <subject>NLP</subject> <subject>telegram</subject> <subject>covid-19</subject> <subject>social listening</subject> <subject>green pass</subject> <subject>vaccine</subject> <subject>freedom</subject> <subject>ethics</subject> </subjects> <dates> <date dateType="Issued">2021-09-28</date> </dates> <language>en</language> <resourceType resourceTypeGeneral="Software"/> <alternateIdentifiers> <alternateIdentifier alternateIdentifierType="url">https://zenodo.org/record/5533907</alternateIdentifier> </alternateIdentifiers> <relatedIdentifiers> <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.5533906</relatedIdentifier> </relatedIdentifiers> <version>1.0.0</version> <rightsList> <rights rightsURI="https://creativecommons.org/licenses/by/4.0/legalcode">Creative Commons Attribution 4.0 International</rights> <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights> </rightsList> <descriptions> <description descriptionType="Abstract"><p>The purpose of this tool is performing NLP analysis on Telegram chats. Telegram chats can be exported as .json files from the official client, Telegram Desktop (v. 2.9.2.0).&nbsp;</p> <p>The files are parsed, the content is used to populate a message dataframe, which is then anonymized.&nbsp;</p> <p><strong>The software calculates and displays the following information:</strong></p> <ul> <li>user count (n of users, new users per day, removed users per day);</li> <li>message count (n and relative frequency of messages, messages per day);</li> <li>autocoded messages (anonymized message dataframe with code weights assigned to each message based on a customizable set of regex rules);</li> <li>prevalence of codes (n and relative frequency);</li> <li>prevalence of lemmas&nbsp;(n and relative frequency);</li> <li>prevalence of lemmas segmented by autocode (n and relative frequency);</li> <li>mean sentiment per day;</li> <li>mean sentiment&nbsp;segmented by autocode.</li> </ul> <p><strong>The software outputs:</strong></p> <ul> <li>messages_df_anon.csv - an anonymized file containing the progressive id of the message, the date, the univocal pseudonym of the sender, and the text;</li> <li>usercount_df.csv - user count dataframe;</li> <li>user_activity_df.csv - user activity dataframe;</li> <li>messagecount_df.csv - message count dataframe;</li> <li>messages_df_anon_coded.csv -&nbsp;an anonymized file containing the progressive id of the message, the date, the univocal pseudonym of the sender,&nbsp;the text, the codes, and the sentiment;</li> <li>autocode_freq_df.csv - general prevalence of codes;</li> <li>lemma_df.csv - lemma frequency;</li> <li>autocode_freq_df_[rule_name].csv - lemma frequency in coded messages, one file per rule;</li> <li>daily_sentiment_df.csv - daily sentiment;</li> <li>sentiment_by_code_df.csv - sentiment segmented by code;</li> <li>messages_anon.txt - anonymized text file generated from the message data frame, for easy import in other software for text mining or qualitative analysis;</li> <li>messages_anon_MaxQDA.txt - anonymized text file generated from the message data frame, formatted specifically for MaxQDA (to track speakers and codes).</li> </ul> <p>Dependencies:</p> <ul> <li>pandas (1.2.1)</li> <li>json</li> <li>random</li> <li>os</li> <li>re</li> <li>tqdm (4.62.2)</li> <li>datetime (4.3)</li> <li>matplotlib (3.4.3)</li> <li>Spacy (3.1.2) + it_core_news_md</li> <li>wordcloud (1.8.1)</li> <li>Counter</li> <li>feel_it (1.0.3)</li> <li>torch (1.9.0)</li> <li>numpy (1.21.1)</li> <li>transformers (4.3.3)</li> </ul> <p>This code is optimized for Italian.&nbsp;</p> <p>Lemma analysis is based on spaCy, which provides several other models for other languages (&nbsp;<a href="https://spacy.io/models">https://spacy.io/models</a>&nbsp;) so it can easily be adapted.</p> <p>Sentiment analysis is performed using <a href="https://github.com/MilaNLProc/feel-it">FEEL-IT: Emotion and Sentiment Classification for the Italian Language</a>&nbsp;(Kudos to Federico Bianchi &lt;f.bianchi@unibocconi.it&gt;; Debora Nozza &lt;debora.nozza@unibocconi.it&gt;; and Dirk Hovy &lt;dirk.hovy@unibocconi.it&gt;). Their work is specific for Italian. To perform sentiment analysis in other languages one could consider nltk.sentiment</p> <p>The code is structured in a Jupyter-lab notebook, heavily commented for future reference.</p></description> <description descriptionType="Other">{"references": ["Bianchi F, Nozza D, Hovy D. FEEL-IT: Emotion and Sentiment Classification for the Italian Language. In: Proceedings of the 11th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Association for Computational Linguistics; 2021. https://github.com/MilaNLProc/feel-it"]}</description> </descriptions> </resource>
All versions | This version | |
---|---|---|
Views | 290 | 15 |
Downloads | 17 | 2 |
Data volume | 63.4 MB | 11.8 MB |
Unique views | 265 | 15 |
Unique downloads | 15 | 2 |