{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analysis of English data\n",
"This program imports the files generated by the parser (divided by month to put less load on the memory) and analyses them. In it **not language agnostic:** correct linguistic settings must be specified in **\"setting up\", \"NLP\" and \"additional rules\".** \n",
"\n",
"First some additional rules for NER are defined. Some are general, some are language-specific, as specified in the relevant section. \n",
"\n",
"The files are opened and preprocessed, then lemma frequency and NER frequency are calculated per each month and in the whole corpus. **important:** in case of empty months (so, when analysing less than one year of data) **remember to exclude them from the mean,** otherwise the mean will be distorted by the empty months. \n",
"\n",
"All the dataframes are exported as CSV files for further analisys or for data visualization."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting up\n",
"**Remember to check the folder paths.**"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"from tqdm.notebook import tqdm as tqdm #for progress bars\n",
"tqdm().pandas()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"import os\n",
"import pandas as pd\n",
"import spacy\n",
"from collections import Counter\n",
"from datetime import datetime\n",
"\n",
"# Measure execution time\n",
"start_time = datetime.now()\n",
"\n",
"# folder paths (1. containing a substet homogeneous by language and divided by date; 2. for exports)\n",
"folder = Path(\"C://Users/copam/Desktop/jupyter test/exports_parser/EN\")\n",
"export = Path(\"C://Users/copam/Desktop/jupyter test/exports_NLP/EN\")\n",
"\n",
"# month files (if need be, add other months here and in the list below).\n",
"january = open(os.path.join(folder, \"1.txt\"),encoding=\"utf8\").read()\n",
"february = open(os.path.join(folder, \"2.txt\"),encoding=\"utf8\").read()\n",
"march = open(os.path.join(folder, \"3.txt\"),encoding=\"utf8\").read()\n",
"april = open(os.path.join(folder, \"4.txt\"),encoding=\"utf8\").read()\n",
"may = open(os.path.join(folder, \"5.txt\"),encoding=\"utf8\").read()\n",
"june = open(os.path.join(folder, \"6.txt\"),encoding=\"utf8\").read()\n",
"july = open(os.path.join(folder, \"7.txt\"),encoding=\"utf8\").read()\n",
"august = open(os.path.join(folder, \"8.txt\"),encoding=\"utf8\").read()\n",
"september = open(os.path.join(folder, \"9.txt\"),encoding=\"utf8\").read()\n",
"october = open(os.path.join(folder, \"10.txt\"),encoding=\"utf8\").read()\n",
"november = open(os.path.join(folder, \"11.txt\"),encoding=\"utf8\").read()\n",
"december = open(os.path.join(folder, \"12.txt\"),encoding=\"utf8\").read()\n",
"\n",
"months = [january,february,march, april, may, june, july, august, september, october, november, december]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## NLP\n",
"**Remember to check the language and the max_lenght.**\n",
"References on models here: https://spacy.io/models"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"nlp = spacy.load('en_core_web_md')\n",
"spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS\n",
"nlp.max_length = 100000000"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Additional rules for COVID19 NER\n",
"**Remember to adapt for the specific language (below the comment).** References here: https://spacy.io/usage/rule-based-matching#models-rules"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from spacy.pipeline import EntityRuler\n",
"ruler = EntityRuler(nlp)\n",
"ruler.overwrite_ents = True\n",
"patterns = [{\"label\": \"COVID19\", \"pattern\": \"Covid\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"covid\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"Covid19\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"covid19\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"Covid 19\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"Covid-19\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"covid-19\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"covid 19\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"Corvid\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"corvid\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"Corvid19\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"corvid19\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"Corvid 19\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"corvid 19\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"Coronavirus\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"coronavirus\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"Corona virus\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"Corona Virus\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"corona virus\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"COVID\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"COVID19\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"COVID 19\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"2019-nCoV\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"ncov\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"nCoV\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"sars\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"SARS\"},\n",
" {\"label\": \"COVID19\", \"pattern\": \"SARS-CoV2\"},\n",
"## language-specific rules\n",
"## consider adding rules for scarce resources allocation, anxiety, ...\n",
" {\"label\": \"COVID19r\", \"pattern\": \"Wuhan-virus\"},\n",
" {\"label\": \"COVID19r\", \"pattern\": \"Wuhan-Virus\"},\n",
" {\"label\": \"COVID19r\", \"pattern\": \"China-virus\"},\n",
" {\"label\": \"COVID19r\", \"pattern\": \"China-Virus\"},\n",
" {\"label\": \"COVID19r\", \"pattern\": \"chinesischer virus\"},\n",
" {\"label\": \"COVID19r\", \"pattern\": \"Chinesischer Virus\"}\n",
" ]\n",
"ruler.add_patterns(patterns)\n",
"nlp.add_pipe(ruler)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0487ffd261174fd493d1ebf0d291e2b8",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=12.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"file_doc = {}\n",
"for x in tqdm(months):\n",
" file_doc[x] = nlp(x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preprocessing"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Definition of the preprocessing functions\n",
"def is_token_allowed(token):\n",
" if (not token or not token.string.strip() or token.is_stop or token.is_punct or token in spacy_stopwords):\n",
" return False\n",
" return True\n",
"\n",
"def preprocess_token(token):\n",
" if is_token_allowed:\n",
" return token.lemma_.strip().lower()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "d92b1c50e5e04de496bc81c1609c9ec3",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=12.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"# Actual preprocessing\n",
"complete_filtered_tokens = {}\n",
"for x in tqdm(months):\n",
" complete_filtered_tokens[x] = [preprocess_token(token) for token in file_doc[x] if is_token_allowed(token)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lemma frequency\n",
"calculates and exports lemma frequency, in general and per month."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "e3b36172f2644c4181f53d7cd73fdd6b",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=12.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"lemmas_freq = {}\n",
"for x in tqdm(months):\n",
" lemmas_freq[x] = Counter(complete_filtered_tokens[x]).most_common()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"## january\n",
"lemmas_freq_january = lemmas_freq[january]\n",
"df_lemmas_freq_january = pd.DataFrame(lemmas_freq_january, columns={'Lemma':[1],'Count':[2]})\n",
"df_lemmas_freq_january.index += 1 \n",
"df_lemmas_freq_january.to_csv(os.path.join(export, \"lemmas\\lemmas-frequency-1.csv\"))\n",
"\n",
"## february\n",
"lemmas_freq_february = lemmas_freq[february]\n",
"df_lemmas_freq_february = pd.DataFrame(lemmas_freq_february, columns={'Lemma':[1],'Count':[2]})\n",
"df_lemmas_freq_february.index += 1 \n",
"df_lemmas_freq_february.to_csv(os.path.join(export, \"lemmas\\lemmas-frequency-2.csv\"))\n",
"\n",
"## march\n",
"lemmas_freq_march = lemmas_freq[march]\n",
"df_lemmas_freq_march = pd.DataFrame(lemmas_freq_march, columns={'Lemma':[1],'Count':[2]})\n",
"df_lemmas_freq_march.index += 1 \n",
"df_lemmas_freq_march.to_csv(os.path.join(export, \"lemmas\\lemmas-frequency-3.csv\"))\n",
"\n",
"## april\n",
"lemmas_freq_april = lemmas_freq[april]\n",
"df_lemmas_freq_april = pd.DataFrame(lemmas_freq_april, columns={'Lemma':[1],'Count':[2]})\n",
"df_lemmas_freq_april.index += 1 \n",
"df_lemmas_freq_april.to_csv(os.path.join(export, \"lemmas\\lemmas-frequency-4.csv\"))\n",
"\n",
"## may\n",
"lemmas_freq_may = lemmas_freq[may]\n",
"df_lemmas_freq_may = pd.DataFrame(lemmas_freq_may, columns={'Lemma':[1],'Count':[2]})\n",
"df_lemmas_freq_may.index += 1 \n",
"df_lemmas_freq_may.to_csv(os.path.join(export, \"lemmas\\lemmas-frequency-5.csv\"))\n",
"\n",
"## june\n",
"lemmas_freq_june = lemmas_freq[june]\n",
"df_lemmas_freq_june = pd.DataFrame(lemmas_freq_june, columns={'Lemma':[1],'Count':[2]})\n",
"df_lemmas_freq_june.index += 1 \n",
"df_lemmas_freq_june.to_csv(os.path.join(export, \"lemmas\\lemmas-frequency-6.csv\"))\n",
"\n",
"## july\n",
"lemmas_freq_july = lemmas_freq[july]\n",
"df_lemmas_freq_july = pd.DataFrame(lemmas_freq_july, columns={'Lemma':[1],'Count':[2]})\n",
"df_lemmas_freq_july.index += 1 \n",
"df_lemmas_freq_july.to_csv(os.path.join(export, \"lemmas\\lemmas-frequency-7.csv\"))\n",
"\n",
"## august\n",
"lemmas_freq_august = lemmas_freq[august]\n",
"df_lemmas_freq_august = pd.DataFrame(lemmas_freq_august, columns={'Lemma':[1],'Count':[2]})\n",
"df_lemmas_freq_august.index += 1 \n",
"df_lemmas_freq_august.to_csv(os.path.join(export, \"lemmas\\lemmas-frequency-8.csv\"))\n",
"\n",
"## september\n",
"lemmas_freq_september = lemmas_freq[september]\n",
"df_lemmas_freq_september = pd.DataFrame(lemmas_freq_september, columns={'Lemma':[1],'Count':[2]})\n",
"df_lemmas_freq_september.index += 1 \n",
"df_lemmas_freq_september.to_csv(os.path.join(export, \"lemmas\\lemmas-frequency-9.csv\"))\n",
"\n",
"## october\n",
"lemmas_freq_october = lemmas_freq[october]\n",
"df_lemmas_freq_october = pd.DataFrame(lemmas_freq_october, columns={'Lemma':[1],'Count':[2]})\n",
"df_lemmas_freq_october.index += 1 \n",
"df_lemmas_freq_october.to_csv(os.path.join(export, \"lemmas\\lemmas-frequency-10.csv\"))\n",
"\n",
"## november\n",
"lemmas_freq_november = lemmas_freq[november]\n",
"df_lemmas_freq_november = pd.DataFrame(lemmas_freq_november, columns={'Lemma':[1],'Count':[2]})\n",
"df_lemmas_freq_november.index += 1 \n",
"df_lemmas_freq_november.to_csv(os.path.join(export, \"lemmas\\lemmas-frequency-11.csv\"))\n",
"\n",
"## december\n",
"lemmas_freq_december = lemmas_freq[december]\n",
"df_lemmas_freq_december = pd.DataFrame(lemmas_freq_december, columns={'Lemma':[1],'Count':[2]})\n",
"df_lemmas_freq_december.index += 1 \n",
"df_lemmas_freq_december.to_csv(os.path.join(export, \"lemmas\\lemmas-frequency-12.csv\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Trends of the lemmas per month\n",
"\"general\" takes the data from the whole corpus. \"mean\" is the mean of the months.\n",
"\n",
"**Important:** in case of empty months (so, when analysing less than one year of data) **remember to exclude them from the mean!**"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" lemma | \n",
" total | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" mean | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" e | \n",
" 432 | \n",
" 23 | \n",
" 39 | \n",
" 83 | \n",
" 27 | \n",
" 39 | \n",
" 34 | \n",
" 17 | \n",
" 34 | \n",
" 34 | \n",
" 34 | \n",
" 34 | \n",
" 34 | \n",
" 36.00 | \n",
"
\n",
" \n",
" 2 | \n",
" essere | \n",
" 245 | \n",
" 16 | \n",
" 11 | \n",
" 34 | \n",
" 15 | \n",
" 35 | \n",
" 8 | \n",
" 6 | \n",
" 24 | \n",
" 24 | \n",
" 24 | \n",
" 24 | \n",
" 24 | \n",
" 20.42 | \n",
"
\n",
" \n",
" 3 | \n",
" il | \n",
" 131 | \n",
" 13 | \n",
" 9 | \n",
" 8 | \n",
" 21 | \n",
" 12 | \n",
" 17 | \n",
" 11 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 10.92 | \n",
"
\n",
" \n",
" 4 | \n",
" l’ | \n",
" 124 | \n",
" 7 | \n",
" 12 | \n",
" 23 | \n",
" 7 | \n",
" 36 | \n",
" 30 | \n",
" 9 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 10.33 | \n",
"
\n",
" \n",
" 5 | \n",
" lo | \n",
" 107 | \n",
" 4 | \n",
" 2 | \n",
" 7 | \n",
" 4 | \n",
" 4 | \n",
" 5 | \n",
" 11 | \n",
" 14 | \n",
" 14 | \n",
" 14 | \n",
" 14 | \n",
" 14 | \n",
" 8.92 | \n",
"
\n",
" \n",
" 6 | \n",
" oms | \n",
" 104 | \n",
" 7 | \n",
" 4 | \n",
" 1 | \n",
" 3 | \n",
" 26 | \n",
" 27 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 8.67 | \n",
"
\n",
" \n",
" 7 | \n",
" coronavirus | \n",
" 97 | \n",
" 8 | \n",
" 10 | \n",
" 6 | \n",
" 6 | \n",
" 10 | \n",
" 10 | \n",
" 7 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8.08 | \n",
"
\n",
" \n",
" 8 | \n",
" pandemia | \n",
" 72 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 3 | \n",
" 10 | \n",
" 4 | \n",
" 3 | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 6.00 | \n",
"
\n",
" \n",
" 9 | \n",
" contagio | \n",
" 70 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 13 | \n",
" 1 | \n",
" 1 | \n",
" 4 | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 5.83 | \n",
"
\n",
" \n",
" 10 | \n",
" dell’ | \n",
" 63 | \n",
" 1 | \n",
" 2 | \n",
" 7 | \n",
" 9 | \n",
" 21 | \n",
" 18 | \n",
" 5 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 5.25 | \n",
"
\n",
" \n",
" 11 | \n",
" caso | \n",
" 59 | \n",
" 6 | \n",
" 1 | \n",
" 0 | \n",
" 24 | \n",
" 2 | \n",
" 4 | \n",
" 2 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4.92 | \n",
"
\n",
" \n",
" 12 | \n",
" virus | \n",
" 50 | \n",
" 4 | \n",
" 4 | \n",
" 3 | \n",
" 0 | \n",
" 11 | \n",
" 17 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 4.17 | \n",
"
\n",
" \n",
" 13 | \n",
" nuovo | \n",
" 47 | \n",
" 5 | \n",
" 1 | \n",
" 10 | \n",
" 7 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 3.92 | \n",
"
\n",
" \n",
" 14 | \n",
" cina | \n",
" 45 | \n",
" 6 | \n",
" 5 | \n",
" 0 | \n",
" 5 | \n",
" 5 | \n",
" 14 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 3.75 | \n",
"
\n",
" \n",
" 15 | \n",
" vittima | \n",
" 44 | \n",
" 4 | \n",
" 0 | \n",
" 2 | \n",
" 8 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 3.67 | \n",
"
\n",
" \n",
" 16 | \n",
" morto | \n",
" 42 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 16 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 3.50 | \n",
"
\n",
" \n",
" 17 | \n",
" lungo | \n",
" 42 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 3.50 | \n",
"
\n",
" \n",
" 18 | \n",
" della | \n",
" 42 | \n",
" 6 | \n",
" 0 | \n",
" 3 | \n",
" 2 | \n",
" 3 | \n",
" 3 | \n",
" 5 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 3.50 | \n",
"
\n",
" \n",
" 19 | \n",
" vaccinare | \n",
" 41 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 3.42 | \n",
"
\n",
" \n",
" 20 | \n",
" totale | \n",
" 38 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 4 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 3.17 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" lemma total 1 2 3 4 5 6 7 8 9 10 11 12 mean\n",
"1 e 432 23 39 83 27 39 34 17 34 34 34 34 34 36.00\n",
"2 essere 245 16 11 34 15 35 8 6 24 24 24 24 24 20.42\n",
"3 il 131 13 9 8 21 12 17 11 8 8 8 8 8 10.92\n",
"4 l’ 124 7 12 23 7 36 30 9 0 0 0 0 0 10.33\n",
"5 lo 107 4 2 7 4 4 5 11 14 14 14 14 14 8.92\n",
"6 oms 104 7 4 1 3 26 27 6 6 6 6 6 6 8.67\n",
"7 coronavirus 97 8 10 6 6 10 10 7 8 8 8 8 8 8.08\n",
"8 pandemia 72 0 1 1 3 10 4 3 10 10 10 10 10 6.00\n",
"9 contagio 70 1 0 0 13 1 1 4 10 10 10 10 10 5.83\n",
"10 dell’ 63 1 2 7 9 21 18 5 0 0 0 0 0 5.25\n",
"11 caso 59 6 1 0 24 2 4 2 4 4 4 4 4 4.92\n",
"12 virus 50 4 4 3 0 11 17 1 2 2 2 2 2 4.17\n",
"13 nuovo 47 5 1 10 7 1 1 2 4 4 4 4 4 3.92\n",
"14 cina 45 6 5 0 5 5 14 0 2 2 2 2 2 3.75\n",
"15 vittima 44 4 0 2 8 0 0 0 6 6 6 6 6 3.67\n",
"16 morto 42 1 1 0 16 1 1 2 4 4 4 4 4 3.50\n",
"17 lungo 42 0 0 2 0 0 0 0 8 8 8 8 8 3.50\n",
"18 della 42 6 0 3 2 3 3 5 4 4 4 4 4 3.50\n",
"19 vaccinare 41 0 1 0 0 0 0 0 8 8 8 8 8 3.42\n",
"20 totale 38 1 0 1 4 1 0 1 6 6 6 6 6 3.17"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# List of all lemma dataframes\n",
"df_lemmas_freq_all = [df_lemmas_freq_january, \n",
" df_lemmas_freq_february, \n",
" df_lemmas_freq_march, \n",
" df_lemmas_freq_april, \n",
" df_lemmas_freq_may, \n",
" df_lemmas_freq_june, \n",
" df_lemmas_freq_july, \n",
" df_lemmas_freq_august,\n",
" df_lemmas_freq_september,\n",
" df_lemmas_freq_october,\n",
" df_lemmas_freq_november,\n",
" df_lemmas_freq_december]\n",
"\n",
"# Loop for index and series\n",
"L = []\n",
"for x in df_lemmas_freq_all:\n",
" x = x.set_index('Lemma')\n",
" L.append(pd.Series(x.values.tolist(), index=x.index))\n",
"\n",
"# All together \n",
"df_lemmas_freq_all = pd.concat(L, axis=1, keys=('1','2','3','4','5','6','7','8','9','10','11','12'))\n",
"for month in df_lemmas_freq_all:\n",
" df_lemmas_freq_all[month] = df_lemmas_freq_all[month].str[0]\n",
"\n",
"df_lemmas_freq_all = df_lemmas_freq_all.fillna(0)\n",
"df_lemmas_freq_all = df_lemmas_freq_all.astype('int')\n",
"\n",
"# Calculate the total\n",
"lemmasums = df_lemmas_freq_all.iloc[:, [0,1,2,3,4,5,6,7,8,9,10,11]].sum(axis=1)\n",
"df_lemmas_freq_all = pd.concat([df_lemmas_freq_all, lemmasums], axis = 1)\n",
"df_lemmas_freq_all = df_lemmas_freq_all.rename(columns={0: \"total\"})\n",
"\n",
"# Calculate the mean of the months\n",
"lemmameans = df_lemmas_freq_all.iloc[:, [0,1,2,3,4,5,6,7,8,9,10,11]].mean(axis=1) ## In case of empty months, exclude them from the mean here!\n",
"df_lemmas_freq_all = pd.concat([df_lemmas_freq_all, lemmameans], axis = 1)\n",
"df_lemmas_freq_all = df_lemmas_freq_all.rename(columns={0: \"mean\"})\n",
"df_lemmas_freq_all[\"mean\"] = (df_lemmas_freq_all[\"mean\"].astype('float')).round(2)\n",
"\n",
"# Reorder and reindex\n",
"total_col = df_lemmas_freq_all.pop(\"total\")\n",
"df_lemmas_freq_all.insert(0, \"total\", total_col)\n",
"df_lemmas_freq_all.reset_index(level=0, inplace=True)\n",
"df_lemmas_freq_all = df_lemmas_freq_all.sort_values(by=['total'], ascending=False)\n",
"df_lemmas_freq_all.index = pd.RangeIndex(len(df_lemmas_freq_all.index))\n",
"df_lemmas_freq_all.index += 1 \n",
"df_lemmas_freq_all[\"lemma\"] = df_lemmas_freq_all[\"index\"]\n",
"df_lemmas_freq_all = df_lemmas_freq_all[['lemma','total','1','2','3','4','5','6','7','8','9','10','11','12','mean']]\n",
"\n",
"\n",
"# Export and display\n",
"df_lemmas_freq_all.to_csv(os.path.join(export, \"lemmas\\lemmas-frequency-timeseries.csv\"))\n",
"display(df_lemmas_freq_all.head(20))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## NER\n",
"Calculates and exports named entity frequency, in general and per month. **Remember to check the export name.** References on NER tags here: https://spacy.io/api/annotation#named-entities"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "3adf2cd39c0948679b145b4997ab842a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=12.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "33f1caf2dfff4bebb947032f8a9e8c08",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=12.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"entity_list = {}\n",
"for x in tqdm(months):\n",
" entity_list[x] = []\n",
" for ent in file_doc[x].ents:\n",
" entity_list[x].append((ent.text, ent.label_))\n",
"\n",
"entity_counts = {} \n",
"for x in tqdm(months):\n",
" entity_counts[x] = Counter(entity_list[x]).most_common()\n",
" enticat, count = zip(*entity_counts[x])\n",
" entity, category = zip(*enticat)\n",
" entity_counts[x] = tuple(zip(entity, category,count))\n",
"\n",
"## january\n",
"entity_counts_january = entity_counts[january]\n",
"df_entity_counts_january = pd.DataFrame(entity_counts_january, columns={'Entity':[1],'Category':[2],'Count':[3]})\n",
"df_entity_counts_january.index += 1 \n",
"df_entity_counts_january.to_csv(os.path.join(export, \"entities\\entities-frequency-1.csv\"))\n",
"\n",
"## february\n",
"entity_counts_february = entity_counts[february]\n",
"df_entity_counts_february = pd.DataFrame(entity_counts_february, columns={'Entity':[1],'Category':[2],'Count':[3]})\n",
"df_entity_counts_february.index += 1 \n",
"df_entity_counts_february.to_csv(os.path.join(export, \"entities\\entities-frequency-2.csv\"))\n",
"\n",
"## march\n",
"entity_counts_march = entity_counts[march]\n",
"df_entity_counts_march = pd.DataFrame(entity_counts_march, columns={'Entity':[1],'Category':[2],'Count':[3]})\n",
"df_entity_counts_march.index += 1 \n",
"df_entity_counts_march.to_csv(os.path.join(export, \"entities\\entities-frequency-3.csv\"))\n",
"\n",
"## april\n",
"entity_counts_april = entity_counts[april]\n",
"df_entity_counts_april = pd.DataFrame(entity_counts_april, columns={'Entity':[1],'Category':[2],'Count':[3]})\n",
"df_entity_counts_april.index += 1 \n",
"df_entity_counts_april.to_csv(os.path.join(export, \"entities\\entities-frequency-4.csv\"))\n",
"\n",
"## may\n",
"entity_counts_may = entity_counts[may]\n",
"df_entity_counts_may = pd.DataFrame(entity_counts_may, columns={'Entity':[1],'Category':[2],'Count':[3]})\n",
"df_entity_counts_may.index += 1 \n",
"df_entity_counts_may.to_csv(os.path.join(export, \"entities\\entities-frequency-5.csv\"))\n",
"\n",
"## june\n",
"entity_counts_june = entity_counts[june]\n",
"df_entity_counts_june = pd.DataFrame(entity_counts_june, columns={'Entity':[1],'Category':[2],'Count':[3]})\n",
"df_entity_counts_june.index += 1 \n",
"df_entity_counts_june.to_csv(os.path.join(export, \"entities\\entities-frequency-6.csv\"))\n",
"\n",
"## july\n",
"entity_counts_july = entity_counts[july]\n",
"df_entity_counts_july = pd.DataFrame(entity_counts_july, columns={'Entity':[1],'Category':[2],'Count':[3]})\n",
"df_entity_counts_july.index += 1 \n",
"df_entity_counts_july.to_csv(os.path.join(export, \"entities\\entities-frequency-7.csv\"))\n",
"\n",
"## august\n",
"entity_counts_august = entity_counts[august]\n",
"df_entity_counts_august = pd.DataFrame(entity_counts_august, columns={'Entity':[1],'Category':[2],'Count':[3]})\n",
"df_entity_counts_august.index += 1 \n",
"df_entity_counts_august.to_csv(os.path.join(export, \"entities\\entities-frequency-8.csv\"))\n",
"\n",
"## september\n",
"entity_counts_september = entity_counts[september]\n",
"df_entity_counts_september = pd.DataFrame(entity_counts_september, columns={'Entity':[1],'Category':[2],'Count':[3]})\n",
"df_entity_counts_september.index += 1 \n",
"df_entity_counts_september.to_csv(os.path.join(export, \"entities\\entities-frequency-9.csv\"))\n",
"\n",
"## october\n",
"entity_counts_october = entity_counts[october]\n",
"df_entity_counts_october = pd.DataFrame(entity_counts_october, columns={'Entity':[1],'Category':[2],'Count':[3]})\n",
"df_entity_counts_october.index += 1 \n",
"df_entity_counts_october.to_csv(os.path.join(export, \"entities\\entities-frequency-10.csv\"))\n",
"\n",
"## november\n",
"entity_counts_november = entity_counts[november]\n",
"df_entity_counts_november = pd.DataFrame(entity_counts_november, columns={'Entity':[1],'Category':[2],'Count':[3]})\n",
"df_entity_counts_november.index += 1 \n",
"df_entity_counts_november.to_csv(os.path.join(export, \"entities\\entities-frequency-11.csv\"))\n",
"\n",
"## december\n",
"entity_counts_december = entity_counts[december]\n",
"df_entity_counts_december = pd.DataFrame(entity_counts_december, columns={'Entity':[1],'Category':[2],'Count':[3]})\n",
"df_entity_counts_december.index += 1 \n",
"df_entity_counts_december.to_csv(os.path.join(export, \"entities\\entities-frequency-12.csv\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Trends of the entities per month\n",
"\"general\" takes the data from the whole corpus. \"mean\" is the mean of the months.\n",
"\n",
"**Important:** in case of empty months (so, when analysing less than one year of data) **remember to exclude them from the mean!**"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" entity | \n",
" category | \n",
" total | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" mean | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" Oms | \n",
" ORG | \n",
" 96 | \n",
" 5 | \n",
" 4 | \n",
" 0 | \n",
" 1 | \n",
" 23 | \n",
" 27 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 8.00 | \n",
"
\n",
" \n",
" 2 | \n",
" coronavirus | \n",
" COVID19 | \n",
" 79 | \n",
" 7 | \n",
" 7 | \n",
" 3 | \n",
" 4 | \n",
" 7 | \n",
" 7 | \n",
" 4 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 6.58 | \n",
"
\n",
" \n",
" 3 | \n",
" ANSA | \n",
" ORG | \n",
" 44 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 3.67 | \n",
"
\n",
" \n",
" 4 | \n",
" Cina | \n",
" LOC | \n",
" 43 | \n",
" 6 | \n",
" 4 | \n",
" 0 | \n",
" 5 | \n",
" 4 | \n",
" 14 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 3.58 | \n",
"
\n",
" \n",
" 5 | \n",
" Germania | \n",
" LOC | \n",
" 37 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 6 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 3.08 | \n",
"
\n",
" \n",
" 6 | \n",
" Francia | \n",
" LOC | \n",
" 22 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 1.83 | \n",
"
\n",
" \n",
" 7 | \n",
" Wuhan | \n",
" LOC | \n",
" 21 | \n",
" 11 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 7 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1.75 | \n",
"
\n",
" \n",
" 8 | \n",
" Stati Uniti | \n",
" LOC | \n",
" 21 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 5 | \n",
" 3 | \n",
" 2 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 1.75 | \n",
"
\n",
" \n",
" 9 | \n",
" Berlino | \n",
" LOC | \n",
" 20 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 1.67 | \n",
"
\n",
" \n",
" 10 | \n",
" Mosca | \n",
" LOC | \n",
" 20 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 1.67 | \n",
"
\n",
" \n",
" 11 | \n",
" Paese | \n",
" LOC | \n",
" 18 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 5 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 1.50 | \n",
"
\n",
" \n",
" 12 | \n",
" Pechino | \n",
" LOC | \n",
" 18 | \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1.50 | \n",
"
\n",
" \n",
" 13 | \n",
" Italia | \n",
" LOC | \n",
" 16 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" 1 | \n",
" 0 | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1.33 | \n",
"
\n",
" \n",
" 14 | \n",
" Covid | \n",
" MISC | \n",
" 13 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 1.08 | \n",
"
\n",
" \n",
" 15 | \n",
" Parigi | \n",
" LOC | \n",
" 13 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 1.08 | \n",
"
\n",
" \n",
" 16 | \n",
" Russia | \n",
" LOC | \n",
" 13 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 1.08 | \n",
"
\n",
" \n",
" 17 | \n",
" Usa | \n",
" LOC | \n",
" 12 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 5 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1.00 | \n",
"
\n",
" \n",
" 18 | \n",
" Ons | \n",
" PER | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 19 | \n",
" Organizzazione mondiale della sanità | \n",
" ORG | \n",
" 10 | \n",
" 1 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 3 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 20 | \n",
" Sun Chunlan | \n",
" PER | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 0.83 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" entity category total 1 2 3 4 5 6 \\\n",
"1 Oms ORG 96 5 4 0 1 23 27 \n",
"2 coronavirus COVID19 79 7 7 3 4 7 7 \n",
"3 ANSA ORG 44 1 0 0 0 0 0 \n",
"4 Cina LOC 43 6 4 0 5 4 14 \n",
"5 Germania LOC 37 0 0 0 6 1 0 \n",
"6 Francia LOC 22 0 0 0 1 0 1 \n",
"7 Wuhan LOC 21 11 1 0 0 2 7 \n",
"8 Stati Uniti LOC 21 0 1 0 5 3 2 \n",
"9 Berlino LOC 20 0 0 0 0 0 0 \n",
"10 Mosca LOC 20 0 0 0 0 0 0 \n",
"11 Paese LOC 18 0 0 0 5 1 1 \n",
"12 Pechino LOC 18 4 0 0 2 2 10 \n",
"13 Italia LOC 16 1 1 0 3 1 0 \n",
"14 Covid MISC 13 0 0 0 1 1 1 \n",
"15 Parigi LOC 13 0 0 0 3 0 0 \n",
"16 Russia LOC 13 0 0 0 2 0 1 \n",
"17 Usa LOC 12 2 0 0 3 5 1 \n",
"18 Ons PER 10 0 0 0 0 0 0 \n",
"19 Organizzazione mondiale della sanità ORG 10 1 2 0 0 3 3 \n",
"20 Sun Chunlan PER 10 0 0 0 0 0 0 \n",
"\n",
" 7 8 9 10 11 12 mean \n",
"1 6 6 6 6 6 6 8.00 \n",
"2 4 8 8 8 8 8 6.58 \n",
"3 3 8 8 8 8 8 3.67 \n",
"4 0 2 2 2 2 2 3.58 \n",
"5 0 6 6 6 6 6 3.08 \n",
"6 0 4 4 4 4 4 1.83 \n",
"7 0 0 0 0 0 0 1.75 \n",
"8 0 2 2 2 2 2 1.75 \n",
"9 0 4 4 4 4 4 1.67 \n",
"10 0 4 4 4 4 4 1.67 \n",
"11 1 2 2 2 2 2 1.50 \n",
"12 0 0 0 0 0 0 1.50 \n",
"13 10 0 0 0 0 0 1.33 \n",
"14 0 2 2 2 2 2 1.08 \n",
"15 0 2 2 2 2 2 1.08 \n",
"16 0 2 2 2 2 2 1.08 \n",
"17 1 0 0 0 0 0 1.00 \n",
"18 0 2 2 2 2 2 0.83 \n",
"19 1 0 0 0 0 0 0.83 \n",
"20 0 2 2 2 2 2 0.83 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Merging entity and category (for better indexing)\n",
"df_entity_counts_january['Entity / Category'] = df_entity_counts_january['Entity'] + ' / ' + df_entity_counts_january['Category']\n",
"df1 = df_entity_counts_january[['Entity / Category', 'Count']]\n",
"\n",
"df_entity_counts_february['Entity / Category'] = df_entity_counts_february['Entity'] + ' / ' + df_entity_counts_february['Category']\n",
"df2 = df_entity_counts_february[['Entity / Category', 'Count']]\n",
"\n",
"df_entity_counts_march['Entity / Category'] = df_entity_counts_march['Entity'] + ' / ' + df_entity_counts_march['Category']\n",
"df3 = df_entity_counts_march[['Entity / Category', 'Count']]\n",
"\n",
"df_entity_counts_april['Entity / Category'] = df_entity_counts_april['Entity'] + ' / ' + df_entity_counts_april['Category']\n",
"df4 = df_entity_counts_april[['Entity / Category', 'Count']]\n",
"\n",
"df_entity_counts_may['Entity / Category'] = df_entity_counts_may['Entity'] + ' / ' + df_entity_counts_may['Category']\n",
"df5 = df_entity_counts_may[['Entity / Category', 'Count']]\n",
"\n",
"df_entity_counts_june['Entity / Category'] = df_entity_counts_june['Entity'] + ' / ' + df_entity_counts_june['Category']\n",
"df6 = df_entity_counts_june[['Entity / Category', 'Count']]\n",
"\n",
"df_entity_counts_july['Entity / Category'] = df_entity_counts_july['Entity'] + ' / ' + df_entity_counts_july['Category']\n",
"df7 = df_entity_counts_july[['Entity / Category', 'Count']]\n",
"\n",
"df_entity_counts_august['Entity / Category'] = df_entity_counts_august['Entity'] + ' / ' + df_entity_counts_august['Category']\n",
"df8 = df_entity_counts_august[['Entity / Category', 'Count']]\n",
"\n",
"df_entity_counts_september['Entity / Category'] = df_entity_counts_september['Entity'] + ' / ' + df_entity_counts_september['Category']\n",
"df9 = df_entity_counts_september[['Entity / Category', 'Count']]\n",
"\n",
"df_entity_counts_october['Entity / Category'] = df_entity_counts_october['Entity'] + ' / ' + df_entity_counts_october['Category']\n",
"df10 = df_entity_counts_october[['Entity / Category', 'Count']]\n",
"\n",
"df_entity_counts_november['Entity / Category'] = df_entity_counts_november['Entity'] + ' / ' + df_entity_counts_november['Category']\n",
"df11 = df_entity_counts_november[['Entity / Category', 'Count']]\n",
"\n",
"df_entity_counts_december['Entity / Category'] = df_entity_counts_december['Entity'] + ' / ' + df_entity_counts_december['Category']\n",
"df12 = df_entity_counts_december[['Entity / Category', 'Count']]\n",
"\n",
"\n",
"# List of all entity dataframes\n",
"df_ent_freq_all = [df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12]\n",
"\n",
"# Loop for index and series\n",
"L = []\n",
"for x in df_ent_freq_all:\n",
" x = x.set_index('Entity / Category')\n",
" L.append(pd.Series(x.values.tolist(), index=x.index))\n",
"\n",
"# All together \n",
"df_ent_freq_all = pd.concat(L, axis=1, keys=('1','2','3','4','5','6','7','8','9','10','11','12'))\n",
"for month in df_ent_freq_all:\n",
" df_ent_freq_all[month] = df_ent_freq_all[month].str[0]\n",
"\n",
"df_ent_freq_all = df_ent_freq_all.fillna(0)\n",
"df_ent_freq_all = df_ent_freq_all.astype('int')\n",
"\n",
"# Calculate the total\n",
"entysums = df_ent_freq_all.iloc[:, [0,1,2,3,4,5,6,7,8,9,10,11]].sum(axis=1)\n",
"df_ent_freq_all = pd.concat([df_ent_freq_all, entysums], axis = 1)\n",
"df_ent_freq_all = df_ent_freq_all.rename(columns={0: \"total\"})\n",
"\n",
"# Calculate the mean of the months\n",
"entymeans = df_ent_freq_all.iloc[:, [0,1,2,3,4,5,6,7,8,9,10,11]].mean(axis=1) ## In case of empty months, exclude them from the mean here!\n",
"df_ent_freq_all = pd.concat([df_ent_freq_all, entymeans], axis = 1)\n",
"df_ent_freq_all = df_ent_freq_all.rename(columns={0: \"mean\"})\n",
"df_ent_freq_all[\"mean\"] = (df_ent_freq_all[\"mean\"].astype('float')).round(2)\n",
"\n",
"# Reorder and reindex\n",
"total_col_e = df_ent_freq_all.pop(\"total\")\n",
"df_ent_freq_all.insert(0, \"total\", total_col_e)\n",
"df_ent_freq_all.reset_index(level=0, inplace=True)\n",
"df_ent_freq_all = df_ent_freq_all.rename(columns={\"index\": \"enticat\"})\n",
"df_ent_freq_all = df_ent_freq_all.sort_values(by=['total'], ascending=False)\n",
"df_ent_freq_all.index = pd.RangeIndex(len(df_ent_freq_all.index))\n",
"df_ent_freq_all.index += 1 \n",
"df_ent_freq_all[['entity','category']] = df_ent_freq_all.enticat.str.split(\" / \",expand=True,)\n",
"df_ent_freq_all = df_ent_freq_all[['entity','category','total','1','2','3','4','5','6','7','8','9','10','11','12','mean']]\n",
"\n",
"# Export and display\n",
"df_ent_freq_all.to_csv(os.path.join(export, \"entities\\entities-frequency-timeseries.csv\"))\n",
"display(df_ent_freq_all.head(20))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Locations\n",
"Remember to change the category according to the linguistic model!"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" entity | \n",
" category | \n",
" total | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" mean | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" Cina | \n",
" LOC | \n",
" 43 | \n",
" 6 | \n",
" 4 | \n",
" 0 | \n",
" 5 | \n",
" 4 | \n",
" 14 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 3.58 | \n",
"
\n",
" \n",
" 2 | \n",
" Germania | \n",
" LOC | \n",
" 37 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 6 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 3.08 | \n",
"
\n",
" \n",
" 3 | \n",
" Francia | \n",
" LOC | \n",
" 22 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 1.83 | \n",
"
\n",
" \n",
" 4 | \n",
" Wuhan | \n",
" LOC | \n",
" 21 | \n",
" 11 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 7 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1.75 | \n",
"
\n",
" \n",
" 5 | \n",
" Stati Uniti | \n",
" LOC | \n",
" 21 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 5 | \n",
" 3 | \n",
" 2 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 1.75 | \n",
"
\n",
" \n",
" 6 | \n",
" Berlino | \n",
" LOC | \n",
" 20 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 1.67 | \n",
"
\n",
" \n",
" 7 | \n",
" Mosca | \n",
" LOC | \n",
" 20 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 1.67 | \n",
"
\n",
" \n",
" 8 | \n",
" Paese | \n",
" LOC | \n",
" 18 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 5 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 1.50 | \n",
"
\n",
" \n",
" 9 | \n",
" Pechino | \n",
" LOC | \n",
" 18 | \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1.50 | \n",
"
\n",
" \n",
" 10 | \n",
" Italia | \n",
" LOC | \n",
" 16 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" 1 | \n",
" 0 | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1.33 | \n",
"
\n",
" \n",
" 11 | \n",
" Parigi | \n",
" LOC | \n",
" 13 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 1.08 | \n",
"
\n",
" \n",
" 12 | \n",
" Russia | \n",
" LOC | \n",
" 13 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 1.08 | \n",
"
\n",
" \n",
" 13 | \n",
" Usa | \n",
" LOC | \n",
" 12 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 5 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1.00 | \n",
"
\n",
" \n",
" 14 | \n",
" Madrid | \n",
" LOC | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 15 | \n",
" Economia | \n",
" LOC | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 16 | \n",
" Allerta Francia | \n",
" LOC | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 17 | \n",
" Porta di Brandeburgo | \n",
" LOC | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 18 | \n",
" Georgia | \n",
" LOC | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 19 | \n",
" Ginevra | \n",
" LOC | \n",
" 10 | \n",
" 1 | \n",
" 0 | \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 20 | \n",
" Paesi | \n",
" LOC | \n",
" 8 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.67 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" entity category total 1 2 3 4 5 6 7 8 9 10 \\\n",
"1 Cina LOC 43 6 4 0 5 4 14 0 2 2 2 \n",
"2 Germania LOC 37 0 0 0 6 1 0 0 6 6 6 \n",
"3 Francia LOC 22 0 0 0 1 0 1 0 4 4 4 \n",
"4 Wuhan LOC 21 11 1 0 0 2 7 0 0 0 0 \n",
"5 Stati Uniti LOC 21 0 1 0 5 3 2 0 2 2 2 \n",
"6 Berlino LOC 20 0 0 0 0 0 0 0 4 4 4 \n",
"7 Mosca LOC 20 0 0 0 0 0 0 0 4 4 4 \n",
"8 Paese LOC 18 0 0 0 5 1 1 1 2 2 2 \n",
"9 Pechino LOC 18 4 0 0 2 2 10 0 0 0 0 \n",
"10 Italia LOC 16 1 1 0 3 1 0 10 0 0 0 \n",
"11 Parigi LOC 13 0 0 0 3 0 0 0 2 2 2 \n",
"12 Russia LOC 13 0 0 0 2 0 1 0 2 2 2 \n",
"13 Usa LOC 12 2 0 0 3 5 1 1 0 0 0 \n",
"14 Madrid LOC 10 0 0 0 0 0 0 0 2 2 2 \n",
"15 Economia LOC 10 0 0 0 0 0 0 0 2 2 2 \n",
"16 Allerta Francia LOC 10 0 0 0 0 0 0 0 2 2 2 \n",
"17 Porta di Brandeburgo LOC 10 0 0 0 0 0 0 0 2 2 2 \n",
"18 Georgia LOC 10 0 0 0 0 0 0 0 2 2 2 \n",
"19 Ginevra LOC 10 1 0 4 1 1 0 3 0 0 0 \n",
"20 Paesi LOC 8 0 2 0 1 3 1 1 0 0 0 \n",
"\n",
" 11 12 mean \n",
"1 2 2 3.58 \n",
"2 6 6 3.08 \n",
"3 4 4 1.83 \n",
"4 0 0 1.75 \n",
"5 2 2 1.75 \n",
"6 4 4 1.67 \n",
"7 4 4 1.67 \n",
"8 2 2 1.50 \n",
"9 0 0 1.50 \n",
"10 0 0 1.33 \n",
"11 2 2 1.08 \n",
"12 2 2 1.08 \n",
"13 0 0 1.00 \n",
"14 2 2 0.83 \n",
"15 2 2 0.83 \n",
"16 2 2 0.83 \n",
"17 2 2 0.83 \n",
"18 2 2 0.83 \n",
"19 0 0 0.83 \n",
"20 0 0 0.67 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_entity_counts_location = df_ent_freq_all[df_ent_freq_all[\"category\"] == \"GPE\"]\n",
"df_entity_counts_location = df_entity_counts_location.reset_index(drop=True)\n",
"df_entity_counts_location.index += 1\n",
"df_entity_counts_location.to_csv(os.path.join(export, \"entities\\entities-frequency-0-general-locations.csv\"))\n",
"display(df_entity_counts_location.head(20))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Persons\n",
"Remember to change the category according to the linguistic model!"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" entity | \n",
" category | \n",
" total | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" mean | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" Ons | \n",
" PER | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 2 | \n",
" Sun Chunlan | \n",
" PER | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 3 | \n",
" Mikhail Murashko | \n",
" PER | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 4 | \n",
" Le Monde Jean-François Delfraissy | \n",
" PER | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 5 | \n",
" Paolo Gentiloni | \n",
" PER | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 6 | \n",
" Roche | \n",
" PER | \n",
" 8 | \n",
" 0 | \n",
" 8 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.67 | \n",
"
\n",
" \n",
" 7 | \n",
" Donald Trump | \n",
" PER | \n",
" 6 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.50 | \n",
"
\n",
" \n",
" 8 | \n",
" Xi Jinping | \n",
" PER | \n",
" 5 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.42 | \n",
"
\n",
" \n",
" 9 | \n",
" Nm | \n",
" PER | \n",
" 5 | \n",
" 0 | \n",
" 0 | \n",
" 5 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.42 | \n",
"
\n",
" \n",
" 10 | \n",
" Tedros Adhanom Ghebreyesus | \n",
" PER | \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.33 | \n",
"
\n",
" \n",
" 11 | \n",
" Mike Ryan | \n",
" PER | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.25 | \n",
"
\n",
" \n",
" 12 | \n",
" Remdesivir | \n",
" PER | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.25 | \n",
"
\n",
" \n",
" 13 | \n",
" Maria Van Kerkhove | \n",
" PER | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
" 14 | \n",
" « | \n",
" PER | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
" 15 | \n",
" Cornado | \n",
" PER | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
" 16 | \n",
" Severin Schwan | \n",
" PER | \n",
" 2 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
" 17 | \n",
" Jair Bolsonaro | \n",
" PER | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
" 18 | \n",
" «O | \n",
" PER | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.08 | \n",
"
\n",
" \n",
" 19 | \n",
" «moderata | \n",
" PER | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.08 | \n",
"
\n",
" \n",
" 20 | \n",
" Li-Wenliang | \n",
" PER | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.08 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" entity category total 1 2 3 4 5 6 7 8 \\\n",
"1 Ons PER 10 0 0 0 0 0 0 0 2 \n",
"2 Sun Chunlan PER 10 0 0 0 0 0 0 0 2 \n",
"3 Mikhail Murashko PER 10 0 0 0 0 0 0 0 2 \n",
"4 Le Monde Jean-François Delfraissy PER 10 0 0 0 0 0 0 0 2 \n",
"5 Paolo Gentiloni PER 10 0 0 0 0 0 0 0 2 \n",
"6 Roche PER 8 0 8 0 0 0 0 0 0 \n",
"7 Donald Trump PER 6 0 1 0 1 3 1 0 0 \n",
"8 Xi Jinping PER 5 1 0 0 0 1 3 0 0 \n",
"9 Nm PER 5 0 0 5 0 0 0 0 0 \n",
"10 Tedros Adhanom Ghebreyesus PER 4 0 0 0 0 1 2 1 0 \n",
"11 Mike Ryan PER 3 0 0 0 0 2 0 1 0 \n",
"12 Remdesivir PER 3 0 0 0 0 0 0 3 0 \n",
"13 Maria Van Kerkhove PER 2 0 0 0 1 1 0 0 0 \n",
"14 « PER 2 0 0 0 0 0 1 1 0 \n",
"15 Cornado PER 2 0 0 0 0 0 0 2 0 \n",
"16 Severin Schwan PER 2 0 2 0 0 0 0 0 0 \n",
"17 Jair Bolsonaro PER 2 0 0 0 1 0 0 1 0 \n",
"18 «O PER 1 0 0 0 0 1 0 0 0 \n",
"19 «moderata PER 1 0 0 0 0 1 0 0 0 \n",
"20 Li-Wenliang PER 1 0 0 0 0 1 0 0 0 \n",
"\n",
" 9 10 11 12 mean \n",
"1 2 2 2 2 0.83 \n",
"2 2 2 2 2 0.83 \n",
"3 2 2 2 2 0.83 \n",
"4 2 2 2 2 0.83 \n",
"5 2 2 2 2 0.83 \n",
"6 0 0 0 0 0.67 \n",
"7 0 0 0 0 0.50 \n",
"8 0 0 0 0 0.42 \n",
"9 0 0 0 0 0.42 \n",
"10 0 0 0 0 0.33 \n",
"11 0 0 0 0 0.25 \n",
"12 0 0 0 0 0.25 \n",
"13 0 0 0 0 0.17 \n",
"14 0 0 0 0 0.17 \n",
"15 0 0 0 0 0.17 \n",
"16 0 0 0 0 0.17 \n",
"17 0 0 0 0 0.17 \n",
"18 0 0 0 0 0.08 \n",
"19 0 0 0 0 0.08 \n",
"20 0 0 0 0 0.08 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_entity_counts_person = df_ent_freq_all[df_ent_freq_all[\"category\"] == \"PERSON\"]\n",
"df_entity_counts_person = df_entity_counts_person.reset_index(drop=True)\n",
"df_entity_counts_person.index += 1\n",
"df_entity_counts_person.to_csv(os.path.join(export, \"entities\\entities-frequency-0-general-persons.csv\"))\n",
"display(df_entity_counts_person.head(20))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Organizations\n",
"Remember to change the category according to the linguistic model!"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" entity | \n",
" category | \n",
" total | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" mean | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" Oms | \n",
" ORG | \n",
" 96 | \n",
" 5 | \n",
" 4 | \n",
" 0 | \n",
" 1 | \n",
" 23 | \n",
" 27 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 8.00 | \n",
"
\n",
" \n",
" 2 | \n",
" ANSA | \n",
" ORG | \n",
" 44 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 3.67 | \n",
"
\n",
" \n",
" 3 | \n",
" Organizzazione mondiale della sanità | \n",
" ORG | \n",
" 10 | \n",
" 1 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 3 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 4 | \n",
" Comitato scientifico | \n",
" ORG | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 5 | \n",
" Comitato per le emergenze | \n",
" ORG | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 0.83 | \n",
"
\n",
" \n",
" 6 | \n",
" Onu | \n",
" ORG | \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.33 | \n",
"
\n",
" \n",
" 7 | \n",
" Associated Press | \n",
" ORG | \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.33 | \n",
"
\n",
" \n",
" 8 | \n",
" Corriere della Sera | \n",
" ORG | \n",
" 3 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.25 | \n",
"
\n",
" \n",
" 9 | \n",
" Consiglio diritti umani | \n",
" ORG | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.25 | \n",
"
\n",
" \n",
" 10 | \n",
" RCS Mediagroup S.p.a | \n",
" ORG | \n",
" 3 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.25 | \n",
"
\n",
" \n",
" 11 | \n",
" Renault | \n",
" ORG | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
" 12 | \n",
" Bmw | \n",
" ORG | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
" 13 | \n",
" GTI | \n",
" ORG | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
" 14 | \n",
" Aston Martin Vantage Roadster | \n",
" ORG | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
" 15 | \n",
" Hyundai i20 | \n",
" ORG | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
" 16 | \n",
" Fiat | \n",
" ORG | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
" 17 | \n",
" Audi | \n",
" ORG | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
" 18 | \n",
" Parlamento | \n",
" ORG | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
" 19 | \n",
" Organizzazione Mondiale della Sanità | \n",
" ORG | \n",
" 2 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
" 20 | \n",
" Nuovo Manifesto Società Cooperativa Editrice | \n",
" ORG | \n",
" 2 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" entity category total 1 2 3 4 \\\n",
"1 Oms ORG 96 5 4 0 1 \n",
"2 ANSA ORG 44 1 0 0 0 \n",
"3 Organizzazione mondiale della sanità ORG 10 1 2 0 0 \n",
"4 Comitato scientifico ORG 10 0 0 0 0 \n",
"5 Comitato per le emergenze ORG 10 0 0 0 0 \n",
"6 Onu ORG 4 0 0 0 1 \n",
"7 Associated Press ORG 4 0 0 0 0 \n",
"8 Corriere della Sera ORG 3 0 1 0 0 \n",
"9 Consiglio diritti umani ORG 3 0 0 0 0 \n",
"10 RCS Mediagroup S.p.a ORG 3 0 1 0 0 \n",
"11 Renault ORG 2 0 0 2 0 \n",
"12 Bmw ORG 2 0 0 2 0 \n",
"13 GTI ORG 2 0 0 2 0 \n",
"14 Aston Martin Vantage Roadster ORG 2 0 0 2 0 \n",
"15 Hyundai i20 ORG 2 0 0 2 0 \n",
"16 Fiat ORG 2 0 0 2 0 \n",
"17 Audi ORG 2 0 0 2 0 \n",
"18 Parlamento ORG 2 0 0 0 1 \n",
"19 Organizzazione Mondiale della Sanità ORG 2 2 0 0 0 \n",
"20 Nuovo Manifesto Società Cooperativa Editrice ORG 2 2 0 0 0 \n",
"\n",
" 5 6 7 8 9 10 11 12 mean \n",
"1 23 27 6 6 6 6 6 6 8.00 \n",
"2 0 0 3 8 8 8 8 8 3.67 \n",
"3 3 3 1 0 0 0 0 0 0.83 \n",
"4 0 0 0 2 2 2 2 2 0.83 \n",
"5 0 0 0 2 2 2 2 2 0.83 \n",
"6 1 2 0 0 0 0 0 0 0.33 \n",
"7 0 4 0 0 0 0 0 0 0.33 \n",
"8 1 1 0 0 0 0 0 0 0.25 \n",
"9 0 0 3 0 0 0 0 0 0.25 \n",
"10 1 1 0 0 0 0 0 0 0.25 \n",
"11 0 0 0 0 0 0 0 0 0.17 \n",
"12 0 0 0 0 0 0 0 0 0.17 \n",
"13 0 0 0 0 0 0 0 0 0.17 \n",
"14 0 0 0 0 0 0 0 0 0.17 \n",
"15 0 0 0 0 0 0 0 0 0.17 \n",
"16 0 0 0 0 0 0 0 0 0.17 \n",
"17 0 0 0 0 0 0 0 0 0.17 \n",
"18 1 0 0 0 0 0 0 0 0.17 \n",
"19 0 0 0 0 0 0 0 0 0.17 \n",
"20 0 0 0 0 0 0 0 0 0.17 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_entity_counts_organization = df_ent_freq_all[df_ent_freq_all[\"category\"] == \"ORG\"]\n",
"df_entity_counts_organization = df_entity_counts_organization.reset_index(drop=True)\n",
"df_entity_counts_organization.index += 1\n",
"df_entity_counts_organization.to_csv(os.path.join(export, \"entities\\entities-frequency-0-general-organizations.csv\"))\n",
"display(df_entity_counts_organization.head(20))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### COVID19\n",
"Remember to change the category according to the linguistic model!"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" entity | \n",
" category | \n",
" total | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" mean | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" coronavirus | \n",
" COVID19 | \n",
" 79 | \n",
" 7 | \n",
" 7 | \n",
" 3 | \n",
" 4 | \n",
" 7 | \n",
" 7 | \n",
" 4 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 6.58 | \n",
"
\n",
" \n",
" 2 | \n",
" Coronavirus | \n",
" COVID19 | \n",
" 6 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.50 | \n",
"
\n",
" \n",
" 3 | \n",
" 2019-nCoV | \n",
" COVID19 | \n",
" 2 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.17 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" entity category total 1 2 3 4 5 6 7 8 9 10 11 12 mean\n",
"1 coronavirus COVID19 79 7 7 3 4 7 7 4 8 8 8 8 8 6.58\n",
"2 Coronavirus COVID19 6 1 1 1 1 1 1 0 0 0 0 0 0 0.50\n",
"3 2019-nCoV COVID19 2 0 2 0 0 0 0 0 0 0 0 0 0 0.17"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_entity_counts_COVID19 = df_ent_freq_all[df_ent_freq_all[\"category\"] == \"COVID19\"]\n",
"df_entity_counts_COVID19 = df_entity_counts_COVID19.reset_index(drop=True)\n",
"df_entity_counts_COVID19.index += 1\n",
"df_entity_counts_COVID19.to_csv(os.path.join(export, \"entities\\entities-frequency-0-general-COVID19.csv\"))\n",
"display(df_entity_counts_COVID19.head(20))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### COVID19r\n",
"Remember to change the category according to the linguistic model!"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" entity | \n",
" category | \n",
" total | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" mean | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" virus cinese | \n",
" COVID19r | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.08 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" entity category total 1 2 3 4 5 6 7 8 9 10 11 12 mean\n",
"1 virus cinese COVID19r 1 0 1 0 0 0 0 0 0 0 0 0 0 0.08"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_entity_counts_COVID19r = df_ent_freq_all[df_ent_freq_all[\"category\"] == \"COVID19r\"]\n",
"df_entity_counts_COVID19r = df_entity_counts_COVID19r.reset_index(drop=True)\n",
"df_entity_counts_COVID19r.index += 1\n",
"df_entity_counts_COVID19r.to_csv(os.path.join(export, \"entities\\entities-frequency-0-general-COVID19r.csv\"))\n",
"display(df_entity_counts_COVID19r.head(20))"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Data elaborated in 0:00:57.258465\n"
]
}
],
"source": [
"end_time = datetime.now()\n",
"print('Data elaborated in {}'.format(end_time - start_time))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}