Using LDA and Jensen-Shannon Distance (JSD) to group similar newspaper articles

Sarah Oberbichler

doi:10.5281/zenodo.3876063

Published June 4, 2020 | Version v1.0

Other Open

Using LDA and Jensen-Shannon Distance (JSD) to group similar newspaper articles

Sarah Oberbichler¹

1. University of Innsbruck

Many researchers have the problem that their data sets or automated set annotations contain articles that are irrelevant to their research question. For example, if the goal is to find articles on return migration, researchers have to deal with some ambiguous search terms. The German words "Heimkehr" (returning home) or "Rückkehr" (returning back) lead to many articles that are relevant to the research question, but also to articles that are not relevant (e.g. return from a mountain tour, work, etc.). By using topic models and document similarity measurements, this notebook allows me to exclude these articles without combining ambiguous words like "Heimkehr" with other search terms. Furthermore, the same code can also be used to remove or prefer a certain genre, e.g. advertising, sports news, etc.

The main purpose of this notebook is to take into account the context of articles in order to automatically refine a search query. This means that even ambiguous words can be used for the search without having to combine them with other words, making the search less influenced by the researcher's prior knowledge and avoiding a too narrow tunnel vision.

Files

soberbichler/Using-LDA-and-Jensen-Shannon-distance-to-separate-relevant-from-non-relevant-articles-v1.0.zip

Files (334.0 kB)

Name	Size	Download all
soberbichler/Using-LDA-and-Jensen-Shannon-distance-to-separate-relevant-from-non-relevant-articles-v1.0.zip md5:6ccd2a64fb2659c05807aa7081c2d4fd	334.0 kB	Preview Download

Additional details

Is supplement to: https://github.com/soberbichler/Using-LDA-and-Jensen-Shannon-distance-to-separate-relevant-from-non-relevant-articles/tree/v1.0 (URL)

European Commission
NewsEye - NewsEye: A Digital Investigator for Historical Newspapers 770299

	All versions	This version
Views	1,383	404
Downloads	161	43
Data volume	56.5 MB	14.7 MB

soberbichler/Using-LDA-and-Jensen-Shannon-distance-to-separate-relevant-from-non-relevant-articles-v1.0.zip

Files (334.0 kB)

Related works

Funding

Using LDA and Jensen-Shannon Distance (JSD) to group similar newspaper articles

Authors/Creators

Description

Files

soberbichler/Using-LDA-and-Jensen-Shannon-distance-to-separate-relevant-from-non-relevant-articles-v1.0.zip

Files (334.0 kB)

Additional details

Related works

Funding