3991614
doi
10.5281/zenodo.3991614
oai:zenodo.org:3991614
user-covid-19
user-crises_resources
Factiva parser and NLP pipeline for news articles related to COVID-19
Giovanni Spitale
University of Zurich - Institute of Biomedical Ethics and History of Medicine
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
natural language processing
NLP
media analysis
factiva
<p>The COVID-19 pandemic generated (and keeps generating) a huge corpus of news articles, easily retrievable in Factiva with very targeted queries. </p>
<p>The aim of this software is to provide the means to analyze this material rapidly. </p>
<p>Data are retrieved from Factiva and downloaded by hand(...) in RTF. The RTF files are then converted to TXT with unoconv in a unix environment.</p>
<p> </p>
<p><strong>Parser:</strong></p>
<p>Takes as input files numerically ordered in a folder. This is not fundamental (in case of multiple retrieves from Factiva) because the parser orders the article by date using the date field contained in each of the articles. Nevertheless, <strong>it is important to reduce duplicates</strong> (because they increase the computational time needed for processing the corpus), so before adding new articles in the folder, be sure to retrieve them from a timepoint that does not overlap with the articles already retrieved.</p>
<p>In any case, in the last phase the dataframe is checked for duplicates, that are counted and removed, but still the articles are processed by the parser and <strong>this takes computational time.</strong></p>
<p>The parser removes search summaries, segments the text, and cleans it using regex rules. The resulting text is exported in a complete dataframe as a CSV file; a subset containing only title and text is exported as TXT, ready to be fed to the NLP pipeline.</p>
<p>The parser is language agnostic; just change the path to the folder containing the documents to parse. <strong>Important:</strong> there is a regex rule mentioning languages ("header_leftover"). it lists EN, DE, FR and IT. In case you need to work with another language, remember to correct that rule.</p>
<p> </p>
<p><strong>NLP pipeline</strong></p>
<p>The NLP pipeline imports the files generated by the parser (divided by month to put less load on the memory) and analyses them. It is <strong>not language agnostic:</strong> correct linguistic settings must be specified in <strong>"setting up", "NLP" and "additional rules".</strong></p>
<p>First some additional rules for NER are defined. Some are general, some are language-specific, as specified in the relevant section.</p>
<p>The files are opened and preprocessed, then lemma frequency and NE frequency are calculated per each month and in the whole corpus. <strong>important:</strong> in case of empty months (so, when analyzing less than one year of data) <strong>remember to exclude them from the mean,</strong> otherwise the mean will be distorted by the empty months.</p>
<p>All the dataframes are exported as CSV files for further analysis or for data visualization.</p>
<p>This code is optimized for English, German, French and Italian. Nevertheless, being based on spaCy, which provides several other models ( <a href="https://spacy.io/models">https://spacy.io/models</a> ) could easily be adapted to other languages.</p>
<p> </p>
<p>The whole software is structured in Jupyter-lab notebooks, heavily commented for future reference.</p>
Zenodo
2020-08-19
info:eu-repo/semantics/other
3991613
user-covid-19
user-crises_resources
1.0.0
1621964728.751896
112733
md5:7cc11885761417c2755c480afd8a2de2
https://zenodo.org/records/3991614/files/de-NLP.ipynb
112741
md5:33c09034e4fbaad4b790b37e59426b15
https://zenodo.org/records/3991614/files/fr-NLP.ipynb
112572
md5:0ae6b0d0f0fb950c04ee4b865f60720e
https://zenodo.org/records/3991614/files/it-NLP.ipynb
27146
md5:7a5b4e371c3feab73ff7a5f110f8f276
https://zenodo.org/records/3991614/files/Parser.ipynb
112736
md5:c2ad3c73f761563bd7416b8073b82272
https://zenodo.org/records/3991614/files/en-NLP.ipynb
public
10.5281/zenodo.3991613
isVersionOf
doi