Structural invariants and semantic fingerprints in the "ego network" of words
Description
The plos_one folder contains code and datasets necessary to reproduce the results obtained in the paper. The code is embedded in a Jupyter notebook (PLOS_Open_Data.ipynb).
plos_one/
├─ bert_vec/
├─ botometer/
├─ egonets/
├─ egonets/
├─ soft_cluster/
├─ tokens/
├─ topics_distrib/
├─ topics_merge_steps/
├─ tweets/
├─ tweets_tokens/
├─ PLOS_Open_Data.ipynb
├─ tokenize_tools.py
├─ README.md
Four datasets were extracted from Twitter:
- Journalists from the NYT
- Science writers
- Random users #1
- Random users #2
Data files with content derivated from these four datasets are generally prefixed respectively with : nyt, science, random1, random2.
As twitter imposes the anonymisation of datasets extracted from its platform, we only provide the ids of the tweets and the users who wrote them. It is up to the reader to inflate the datasets with the original content of the tweets.
Eg: tweets/nyt_twitter_anon.csv
| user_id | id |
|---|---|
| 9283938 | 6828728732837 |
| 9283938 | 239238209382093 |
| ... | ... |
As we filter out many users and tweets (bots, retweets, low active users etc), we also provide a reduced list of these tweets, which have been saved for later use.
Eg: tweets/nyt_filtered_twitter_anon.csv
The bot detection is made with the online platform "Botmeter". The webservice provides different kinds of probabilities that a given account is a bot. These informations are stored in the directory botmeter.
Eg: botmeter/botmeter_random1.csv
3090490
| id | all_features | cap |
| 9283938 | .3 | .1 |
| 9283938 | .8 | .3 |
| ... | ... | ... |
Directories egonets, soft_cluster, tokens, topics_merge_steps and tweet_tokens contains intermediary results, whose formatting is explained in the notebook.
Directory bert_vec/ contains the output of BERT (the embeddings) and the output of UMAP (the embeddings after dimension reduction). The latter embeddings are provided.
The directory topics_distrib/ contains the dataframes where the topic distribution vectors are stored. Such as :
topics_distrib/nyt_topics.csv
| ego_id | nb_rings | ring | 0 | 1 | ... | 99 |
|---|---|---|---|---|---|---|
| 9283938 | 6 | 1 | 0 | 0 | ... | .1 |
| 9283938 | 6 | 2 | .2 | .01 | ... | 0 |
| ... | ... | ... | ... | ... | ... |
topics_distrib/nyt_topics_ego.csv
| ego_id | nb_rings | 0 | 1 | ... | 99 |
|---|---|---|---|---|---|
| 9283938 | 6 | 0 | .1 | ... | .2 |
| 3093939 | 6 | .2 | .002 | ... | 0 |
| ... | ... | ... | ... | ... | ... |
Figures are output in both format PDF and TIFF in the directory figures_export/
The tokenize_tools.py contains is a class of tools useful for the extraction the lemmatized tokens from the tweets. This class is separated from the notebook in order to keep the latter as readable as possible.