Structural invariants and semantic fingerprints in the "ego network" of words

Kilian Ollivier; Chiara Boldrini; Andrea Passarella; Marco Conti

doi:10.5281/zenodo.5932713

Published January 31, 2022 | Version 1.0

Journal article Restricted

Structural invariants and semantic fingerprints in the "ego network" of words

1. IIT-CNR

The plos_one folder contains code and datasets necessary to reproduce the results obtained in the paper. The code is embedded in a Jupyter notebook (PLOS_Open_Data.ipynb).

plos_one/
├─ bert_vec/
├─ botometer/
├─ egonets/
├─ egonets/
├─ soft_cluster/
├─ tokens/
├─ topics_distrib/
├─ topics_merge_steps/
├─ tweets/
├─ tweets_tokens/
├─ PLOS_Open_Data.ipynb
├─ tokenize_tools.py
├─ README.md

Four datasets were extracted from Twitter:
- Journalists from the NYT
- Science writers
- Random users #1
- Random users #2
Data files with content derivated from these four datasets are generally prefixed respectively with : nyt, science, random1, random2.

As twitter imposes the anonymisation of datasets extracted from its platform, we only provide the ids of the tweets and the users who wrote them. It is up to the reader to inflate the datasets with the original content of the tweets.

Eg: tweets/nyt_twitter_anon.csv

user_id	id
9283938	6828728732837
9283938	239238209382093
...	...

As we filter out many users and tweets (bots, retweets, low active users etc), we also provide a reduced list of these tweets, which have been saved for later use.

Eg: tweets/nyt_filtered_twitter_anon.csv

The bot detection is made with the online platform "Botmeter". The webservice provides different kinds of probabilities that a given account is a bot. These informations are stored in the directory botmeter.

Eg: botmeter/botmeter_random1.csv

3090490

id	all_features	cap
9283938	.3	.1
9283938	.8	.3
...	...	...

Directories egonets, soft_cluster, tokens, topics_merge_steps and tweet_tokens contains intermediary results, whose formatting is explained in the notebook.

Directory bert_vec/ contains the output of BERT (the embeddings) and the output of UMAP (the embeddings after dimension reduction). The latter embeddings are provided.

The directory topics_distrib/ contains the dataframes where the topic distribution vectors are stored. Such as :

topics_distrib/nyt_topics.csv

ego_id	nb_rings	ring	0	1	...	99
9283938	6	1	0	0	...	.1
9283938	6	2	.2	.01	...	0
...	...		...	...	...	...

topics_distrib/nyt_topics_ego.csv

ego_id	nb_rings	0	1	...	99
9283938	6	0	.1	...	.2
3093939	6	.2	.002	...	0
...	...	...	...	...	...

Figures are output in both format PDF and TIFF in the directory figures_export/

The tokenize_tools.py contains is a class of tools useful for the extraction the lemmatized tokens from the tweets. This class is separated from the notebook in order to keep the latter as readable as possible.

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/5932713">Log in</a> to check if you have access.

Request access

If you would like to request access to these files, please fill out the form below.

Code and data in this repository support the submission of the paper "Structural invariants and semantic fingerprints in the "ego network" of words". Files are are freely available for reviewers only (no need to provide the name).

You are currently not logged in. Do you have an account? Log in here

Additional details

European Commission
SoBigData-PlusPlus - SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics 871042

	All versions	This version
Views	134	134
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Structural invariants and semantic fingerprints in the "ego network" of words

Authors/Creators

Description

Files

Restricted

Request access

Additional details

Funding