{ "access": { "embargo": { "active": false, "reason": null }, "files": "public", "record": "public", "status": "open" }, "created": "2021-05-25T17:45:28.539387+00:00", "custom_fields": {}, "deletion_status": { "is_deleted": false, "status": "P" }, "files": { "count": 1, "enabled": true, "entries": { "Factiva parser and NLP.zip": { "checksum": "md5:35df725bdfedcb3f9b4bdfe25fdc7b90", "ext": "zip", "id": "26091ad8-d122-4516-a73d-348b1eae9383", "key": "Factiva parser and NLP.zip", "metadata": null, "mimetype": "application/zip", "size": 1413920 } }, "order": [], "total_bytes": 1413920 }, "id": "4792669", "is_draft": false, "is_published": true, "links": { "access": "https://zenodo.org/api/records/4792669/access", "access_links": "https://zenodo.org/api/records/4792669/access/links", "access_request": "https://zenodo.org/api/records/4792669/access/request", "access_users": "https://zenodo.org/api/records/4792669/access/users", "archive": "https://zenodo.org/api/records/4792669/files-archive", "archive_media": "https://zenodo.org/api/records/4792669/media-files-archive", "communities": "https://zenodo.org/api/records/4792669/communities", "communities-suggestions": "https://zenodo.org/api/records/4792669/communities-suggestions", "doi": "https://doi.org/10.5281/zenodo.4792669", "draft": "https://zenodo.org/api/records/4792669/draft", "files": "https://zenodo.org/api/records/4792669/files", "latest": "https://zenodo.org/api/records/4792669/versions/latest", "latest_html": "https://zenodo.org/records/4792669/latest", "media_files": "https://zenodo.org/api/records/4792669/media-files", "parent": "https://zenodo.org/api/records/3991613", "parent_doi": "https://zenodo.org/doi/10.5281/zenodo.3991613", "parent_html": "https://zenodo.org/records/3991613", "requests": "https://zenodo.org/api/records/4792669/requests", "reserve_doi": "https://zenodo.org/api/records/4792669/draft/pids/doi", "self": "https://zenodo.org/api/records/4792669", "self_doi": "https://zenodo.org/doi/10.5281/zenodo.4792669", "self_html": "https://zenodo.org/records/4792669", "self_iiif_manifest": "https://zenodo.org/api/iiif/record:4792669/manifest", "self_iiif_sequence": "https://zenodo.org/api/iiif/record:4792669/sequence/default", "versions": "https://zenodo.org/api/records/4792669/versions" }, "media_files": { "count": 0, "enabled": false, "entries": {}, "order": [], "total_bytes": 0 }, "metadata": { "additional_descriptions": [ { "description": "This work is part of the PubliCo research project, supported by the Swiss National Science Foundation (SNF). Project no. 31CA30_195905", "type": { "id": "notes", "title": { "de": "Anmerkungen", "en": "Notes" } } } ], "contributors": [ { "affiliations": [ { "name": "University of Zurich - Institute of Biomedical Ethics and History of Medicine" } ], "person_or_org": { "family_name": "Nikola Biller-Andorno", "identifiers": [ { "identifier": "0000-0001-7661-1324", "scheme": "orcid" } ], "name": "Nikola Biller-Andorno", "type": "personal" }, "role": { "id": "projectleader", "title": { "de": "ProjektleiterIn", "en": "Project leader" } } }, { "affiliations": [ { "name": "Swiss Tropical and Public Health Institute" } ], "person_or_org": { "family_name": "Sonja Merten", "identifiers": [ { "identifier": "0000-0003-4115-106X", "scheme": "orcid" } ], "name": "Sonja Merten", "type": "personal" }, "role": { "id": "projectmember", "title": { "de": "Projektmitglied", "en": "Project member" } } } ], "creators": [ { "affiliations": [ { "name": "University of Zurich - Institute of Biomedical Ethics and History of Medicine" } ], "person_or_org": { "family_name": "Giovanni Spitale", "identifiers": [ { "identifier": "0000-0002-6812-0979", "scheme": "orcid" } ], "name": "Giovanni Spitale", "type": "personal" } } ], "description": "
Changelog v2.0.0 / what's new:
\n\n- rtf to txt conversion and merging is now done in the notebook and does not depend on external sw
\n\n- rewritten the parser due to changes in Factiva's output
\n\n- rewritten the NLP pipeline to process data with different temporal depth
\n\n- streamlined and optimized here and there :)
\n\n\n\n
The COVID-19 pandemic generated (and keeps generating) a huge corpus of news articles, easily retrievable in Factiva with very targeted queries.
\n\nThe aim of this software is to provide the means to analyze this material rapidly.
\n\nData are retrieved from Factiva and downloaded by hand(...) in RTF. The RTF files are then converted to TXT.
\n\n\n\n
Parser:
\n\nTakes as input files numerically ordered in a folder. This is not fundamental (in case of multiple retrieves from Factiva) because the parser orders the article by date using the date field contained in each of the articles. Nevertheless, it is important to reduce duplicates (because they increase the computational time needed for processing the corpus), so before adding new articles in the folder, be sure to retrieve them from a timepoint that does not overlap with the articles already retrieved.
\n\nIn any case, in the last phase the dataframe is checked for duplicates, that are counted and removed, but still the articles are processed by the parser and this takes computational time.
\n\nThe parser removes search summaries, segments the text, and cleans it using regex rules. The resulting text is exported in a complete dataframe as a CSV file; a subset containing only title and text is exported as TXT, ready to be fed to the NLP pipeline.
\n\nThe parser is language agnostic; just change the path to the folder containing the documents to parse.
\n\n\n\n
NLP pipeline
\n\nThe NLP pipeline imports the files generated by the parser (divided by month to put less load on the memory) and analyses them. It is not language agnostic: correct linguistic settings must be specified in "setting up", "NLP" and "additional rules".
\n\nFirst some additional rules for NER are defined. Some are general, some are language-specific, as specified in the relevant section.
\n\nThe files are opened and preprocessed, then lemma frequency and NE frequency are calculated per each month and in the whole corpus.
\n\nAll the dataframes are exported as CSV files for further analysis or for data visualization.
\n\nThis code is optimized for English, German, French and Italian. Nevertheless, being based on spaCy, which provides several other models ( https://spacy.io/models ) could easily be adapted to other languages.
\n\nThe whole software is structured in Jupyter-lab notebooks, heavily commented for future reference.
\n\n\n\n
This work is part of the PubliCo research project.
", "languages": [ { "id": "eng", "title": { "en": "English" } } ], "publication_date": "2020-08-19", "publisher": "Zenodo", "related_identifiers": [ { "identifier": "10.5281/zenodo.4036071", "relation_type": { "id": "compiles", "title": { "de": "Kompiliert", "en": "Compiles" } }, "resource_type": { "id": "dataset", "title": { "de": "Datensatz", "en": "Dataset" } }, "scheme": "doi" } ], "resource_type": { "id": "software", "title": { "de": "Software", "en": "Software" } }, "rights": [ { "description": { "en": "The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited." }, "icon": "cc-by-icon", "id": "cc-by-4.0", "props": { "scheme": "spdx", "url": "https://creativecommons.org/licenses/by/4.0/legalcode" }, "title": { "en": "Creative Commons Attribution 4.0 International" } } ], "subjects": [ { "subject": "natural language processing" }, { "subject": "NLP" }, { "subject": "media analysis" }, { "subject": "factiva" } ], "title": "Factiva parser and NLP pipeline for news articles related to COVID-19", "version": "2.0.0" }, "parent": { "access": { "owned_by": { "user": 45242 } }, "communities": { "entries": [ { "access": { "member_policy": "open", "members_visibility": "public", "record_policy": "open", "review_policy": "open", "visibility": "public" }, "children": { "allow": false }, "created": "2020-03-16T11:40:44.487619+00:00", "custom_fields": {}, "deletion_status": { "is_deleted": false, "status": "P" }, "id": "10f33f78-3f29-41b6-bb10-f757a8f03cb8", "links": {}, "metadata": { "curation_policy": "The Coronavirus Disease Research Community - COVID-19 is curated by a selected team of experts nominated by OpenAIRE* (see list below). Each time a Zenodo user wants to add a record into the community, an email is sent to the curators that will decide whether to include the record or not.
\r\n\r\nOnly records that may be relevant to the Corona Virus Disease (COVID-19) or the SARS-CoV-2 should be included in this community. The Community curators are not able to edit records, therefore they may ask the corresponding authors to modify the record metadata when necessary, to provide the readers/users with more detailed information according to the FAIR principle of Open Science.
\r\n\r\nIf after its acceptance, a record is subsequently found not to be compliant, we reserve the right to remove it from the community.
\r\n\r\nThe curation team is reachable through the following email address for further clarification or information: covid19@openaire.eu.
\r\n\r\nCurator List:
\r\n\r\n* OpenAIRE: open access and open science training and support since 2009. OpenAIRE is the largest aggregator of European Commission funded research outputs and beyond, also delivering on-demand services for research communities.
\r\n", "page": "This community collects research outputs that may be relevant to the Coronavirus Disease (COVID-19) or the SARS-CoV-2. Scientists are encouraged to upload their outcome in this collection to facilitate sharing and discovery of information. Although Open Access articles and datasets are recommended, also closed and restricted access material are accepted. All types of research outputs can be included in this Community (Publication, Poster, Presentation, Dataset, Image, Video/Audio, Software, Lesson, Other).
\r\n\r\nThe recent Corona Virus Disease (COVID-19) outbreak is requiring unseen efforts of collaboration of the scientific community that need to act fast and to share results in an unpredictable manner. In order to facilitate the Scientist efforts, this community was created to collect all research results that could be relevant for the scientific community working on the Corona Virus Disease (COVID-19) and SARS-CoV-2.
\r\n\r\nAlthough Open Access articles and datasets are recommended, also closed and restricted access material are accepted. All types of research outputs can be included in this Community (Publication, Poster, Presentation, Dataset, Image, Video/Audio, Software, Lesson, Other).
\r\n\r\nWhen depositing a resource that is linked to other resources (not limited to the records deposited in Zenodo but also in other repositories), please make sure that your record is linked to all the other related elements already available, in order to adhere to the FAIR principles of Open Science to maximise the reusability of research results.
\r\n\r\n", "title": "Coronavirus Disease Research Community - COVID-19" }, "revision_id": 0, "slug": "covid-19", "updated": "2021-02-23T14:39:53.029415+00:00" }, { "access": { "member_policy": "open", "members_visibility": "public", "record_policy": "open", "review_policy": "open", "visibility": "public" }, "children": { "allow": false }, "created": "2020-06-05T12:31:38.375426+00:00", "custom_fields": {}, "deletion_status": { "is_deleted": false, "status": "P" }, "id": "c4a5135d-8f2c-454b-bbc7-5015c28dbdc3", "links": {}, "metadata": { "curation_policy": "
The community welcomes labeled as well as raw/unlabeled datasets and resources. The datasets should be used for research and non-profit purposes.
\r\n", "description": "This community aims to collect and share public datasets and resources related to natural disasters, human-induced crises, health emergencies such as epidemics and pandemics like COVID-19.", "page": "", "title": "Crisis Resources" }, "revision_id": 0, "slug": "crises_resources", "updated": "2020-06-05T12:33:46.660605+00:00" } ], "ids": [ "10f33f78-3f29-41b6-bb10-f757a8f03cb8", "c4a5135d-8f2c-454b-bbc7-5015c28dbdc3" ] }, "id": "3991613", "pids": { "doi": { "client": "datacite", "identifier": "10.5281/zenodo.3991613", "provider": "datacite" } } }, "pids": { "doi": { "client": "datacite", "identifier": "10.5281/zenodo.4792669", "provider": "datacite" }, "oai": { "identifier": "oai:zenodo.org:4792669", "provider": "oai" } }, "revision_id": 2, "stats": { "all_versions": { "data_volume": 55867875.0, "downloads": 183, "unique_downloads": 143, "unique_views": 1679, "views": 1727 }, "this_version": { "data_volume": 43831520.0, "downloads": 31, "unique_downloads": 29, "unique_views": 596, "views": 615 } }, "status": "published", "updated": "2021-05-26T01:48:19.786661+00:00", "versions": { "index": 3, "is_latest": true } }