Exploratory Topic Modelling Using Python Dataset - EHRI-3

doi:10.5281/zenodo.6670104

Published June 20, 2022 | Version v1

Dataset Open

Exploratory Topic Modelling Using Python Dataset - EHRI-3

Dermentzi, Maria¹

1. King's College London

In the EHRI-3 project, we are investigating tools and methods that historical researchers and scholars can use to better understand, visualise, and interpret the material held by our partner archives. This dataset accompanies a tutorial exploring a technique called topic modelling in the context of a Holocaust-related historical collection.

We were on the lookout for datasets that would be easily accessible and, for convenience, predominantly in English. One such dataset was the United States Holocaust Memorial Museum’s (USHMM) extensive collection of oral history testimonies, for which there are a considerable number of textual transcripts. The museum’s total collection consists of over 80,703 testimonies, 41,695 of which are available in English, with 2,894 of them listing a transcript.

Since there is not yet a ready-to-download dataset that includes these transcripts, we had to construct our own. Using a web scraping tool, we managed to create a list of the links pointing to the metadata (including transcripts) of the testimonies that were of interest to us. After obtaining the transcript and other metadata of each of these testimonies, we were able to create our dataset and curate it to remove any unwanted entries. For example, we made sure to remove entries with restrictions on access or use. We also removed entries with transcripts that consisted only of some automatically generated headers and entries which turned out to be in languages other than English. The remaining 1,873 transcripts form the corpus of this tutorial — a small, but still decently sized dataset.

The process that we followed to put together this dataset is detailed in the Jupyter Notebook accompanying this post, which can be found in this Github repository.

Credits:

The transcripts that form the corpus in this tutorial were obtained through the United States Holocaust Memorial Museum (USHMM).

Notes

Credits: The transcripts that form the corpus in this tutorial were obtained through the United States Holocaust Memorial Museum (USHMM).

Files

Files (119.8 MB)

Name	Size	Download all
unrestricted_lemmatized_df.pkl md5:94f48c3339940d130b2990635d3bdd55	119.8 MB	Download

Additional details

EHRI-3 – European Holocaust Research Infrastructure 871111: European Commission

	All versions	This version
Views	291	72
Downloads	44	9
Data volume	6.2 GB	1.7 GB

Exploratory Topic Modelling Using Python Dataset - EHRI-3

Creators

Description

Notes

Files

Files (119.8 MB)

Additional details

Funding