Factiva parser and NLP pipeline for news articles related to COVID-19

doi:10.5281/zenodo.4792669

Published August 19, 2020 | Version 2.0.0

Software Open

Factiva parser and NLP pipeline for news articles related to COVID-19

Giovanni Spitale¹

1. University of Zurich - Institute of Biomedical Ethics and History of Medicine

Project leader:

Nikola Biller-Andorno¹

Project member:

Sonja Merten²

1. University of Zurich - Institute of Biomedical Ethics and History of Medicine
2. Swiss Tropical and Public Health Institute

Changelog v2.0.0 / what's new:

- rtf to txt conversion and merging is now done in the notebook and does not depend on external sw

- rewritten the parser due to changes in Factiva's output

- rewritten the NLP pipeline to process data with different temporal depth

- streamlined and optimized here and there :)

The COVID-19 pandemic generated (and keeps generating) a huge corpus of news articles, easily retrievable in Factiva with very targeted queries.

The aim of this software is to provide the means to analyze this material rapidly.

Data are retrieved from Factiva and downloaded by hand(...) in RTF. The RTF files are then converted to TXT.

Parser:

Takes as input files numerically ordered in a folder. This is not fundamental (in case of multiple retrieves from Factiva) because the parser orders the article by date using the date field contained in each of the articles. Nevertheless, it is important to reduce duplicates (because they increase the computational time needed for processing the corpus), so before adding new articles in the folder, be sure to retrieve them from a timepoint that does not overlap with the articles already retrieved.

In any case, in the last phase the dataframe is checked for duplicates, that are counted and removed, but still the articles are processed by the parser and this takes computational time.

The parser removes search summaries, segments the text, and cleans it using regex rules. The resulting text is exported in a complete dataframe as a CSV file; a subset containing only title and text is exported as TXT, ready to be fed to the NLP pipeline.

The parser is language agnostic; just change the path to the folder containing the documents to parse.

NLP pipeline

The NLP pipeline imports the files generated by the parser (divided by month to put less load on the memory) and analyses them. It is not language agnostic: correct linguistic settings must be specified in "setting up", "NLP" and "additional rules".

First some additional rules for NER are defined. Some are general, some are language-specific, as specified in the relevant section.

The files are opened and preprocessed, then lemma frequency and NE frequency are calculated per each month and in the whole corpus.

All the dataframes are exported as CSV files for further analysis or for data visualization.

This code is optimized for English, German, French and Italian. Nevertheless, being based on spaCy, which provides several other models ( https://spacy.io/models ) could easily be adapted to other languages.

The whole software is structured in Jupyter-lab notebooks, heavily commented for future reference.

This work is part of the PubliCo research project.

Notes

This work is part of the PubliCo research project, supported by the Swiss National Science Foundation (SNF). Project no. 31CA30_195905

Files

Factiva parser and NLP.zip

Files (1.4 MB)

Name	Size	Download all
Factiva parser and NLP.zip md5:35df725bdfedcb3f9b4bdfe25fdc7b90	1.4 MB	Preview Download

Additional details

Compiles: Dataset: 10.5281/zenodo.4036071 (DOI)

	All versions	This version
Views	1,735	626
Downloads	164	30
Data volume	59.0 MB	45.2 MB

Factiva parser and NLP pipeline for news articles related to COVID-19

Creators

Contributors

Project leader:

Project member:

Description

Notes

Files

Factiva parser and NLP.zip

Files (1.4 MB)

Additional details

Related works