Published January 31, 2022 | Version 1.0
Journal article Restricted

Structural invariants and semantic fingerprints in the "ego network" of words

Description

The plos_one folder contains code and datasets necessary to reproduce the results obtained in the paper. The code is embedded in a Jupyter notebook (PLOS_Open_Data.ipynb).

plos_one/
├─ bert_vec/
├─ botometer/
├─ egonets/
├─ egonets/
├─ soft_cluster/
├─ tokens/
├─ topics_distrib/
├─ topics_merge_steps/
├─ tweets/
├─ tweets_tokens/
├─ PLOS_Open_Data.ipynb
├─ tokenize_tools.py
├─ README.md

Four datasets were extracted from Twitter:
- Journalists from the NYT
- Science writers
- Random users #1
- Random users #2
Data files with content derivated from these four datasets are generally prefixed respectively with : nyt, science, random1, random2.

As twitter imposes the anonymisation of datasets extracted from its platform, we only provide the ids of the tweets and the users who wrote them. It is up to the reader to inflate the datasets with the original content of the tweets.

Eg: tweets/nyt_twitter_anon.csv

user_id id
9283938 6828728732837
9283938 239238209382093
... ...

As we filter out many users and tweets (bots, retweets, low active users etc), we also provide a reduced list of these tweets, which have been saved for later use.

Eg: tweets/nyt_filtered_twitter_anon.csv

The bot detection is made with the online platform "Botmeter". The webservice provides different kinds of probabilities that a given account is a bot. These informations are stored in the directory botmeter.

Eg: botmeter/botmeter_random1.csv

3090490

id all_features cap
9283938 .3 .1
9283938 .8 .3 
... ... ...

 

Directories egonets, soft_cluster, tokens, topics_merge_steps and tweet_tokens contains intermediary results, whose formatting is explained in the notebook.

 

Directory bert_vec/ contains the output of BERT (the embeddings) and the output of UMAP (the embeddings after dimension reduction). The latter embeddings are provided.

The directory topics_distrib/ contains the dataframes where the topic distribution vectors are stored. Such as :

topics_distrib/nyt_topics.csv

ego_id nb_rings ring 0 1 ... 99
9283938 6 1 0 0 ... .1
9283938 6 2 .2 .01 ... 0
... ...   ... ... ... ...

topics_distrib/nyt_topics_ego.csv

ego_id nb_rings 0 1 ... 99
9283938 6 0 .1 ... .2
3093939 6 .2 .002 ... 0
... ... ... ... ... ...

Figures are output in both format PDF and TIFF in the directory figures_export/

The tokenize_tools.py contains is a class of tools useful for the extraction the lemmatized tokens from the tweets. This class is separated from the notebook in order to keep the latter as readable as possible.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

Code and data in this repository support the submission of the paper "Structural invariants and semantic fingerprints in the "ego network" of words". Files are are freely available for reviewers only (no need to provide the name).

You are currently not logged in. Do you have an account? Log in here

Additional details

Funding

European Commission
SoBigData-PlusPlus - SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics 871042