Exploratory Topic Modelling in Python Dataset - EHRI-3

Dermentzi, Maria

doi:10.5281/zenodo.6670234

Published June 20, 2022 | Version v2

Dataset Open

Exploratory Topic Modelling in Python Dataset - EHRI-3

Dermentzi, Maria¹

1. King's College London

In the EHRI-3 project, we are investigating tools and methods that historical researchers and scholars can use to better understand, visualise, and interpret the material held by our partner archives. This dataset accompanies a tutorial exploring a technique called topic modelling in the context of a Holocaust-related historical collection.

We were on the lookout for datasets that would be easily accessible and, for convenience, predominantly in English. One such dataset was the United States Holocaust Memorial Museum’s (USHMM) extensive collection of oral history testimonies, for which there are a considerable number of textual transcripts. The museum’s total collection consists of over 80,703 testimonies, 41,695 of which are available in English, with 2,894 of them listing a transcript.

Since there is not yet a ready-to-download dataset that includes these transcripts, we had to construct our own. Using a web scraping tool, we managed to create a list of the links pointing to the metadata (including transcripts) of the testimonies that were of interest to us. After obtaining the transcript and other metadata of each of these testimonies, we were able to create our dataset and curate it to remove any unwanted entries. For example, we made sure to remove entries with restrictions on access or use. We also removed entries with transcripts that consisted only of some automatically generated headers and entries which turned out to be in languages other than English. The remaining 1,873 transcripts form the corpus of this tutorial — a small, but still decently sized dataset.

The process that we followed to put together this dataset is detailed in the Jupyter Notebook accompanying this post, which can be found in this Github repository.

In this Zenodo upload, the user can find two files, each of them containing a pickled pandas DataFrame that was obtained at a different stage of the tutorial:

"unrestricted_df.pkl" contains 1,946 entries of Oral Testimony transcripts and has five fields (RG_number, text, display_date, conditions_access, conditions_use)
"unrestricted_lemmatized_df.pkl" contains 1,873 entries of Oral Testimony transcripts and has six fields (RG_number, text, display_date, conditions_access, conditions_use, lemmas)

Instructions on their intended use can be found in the accompanying Jupyter Notebook.

Credits:

The transcripts that form the corpus in this tutorial were obtained through the United States Holocaust Memorial Museum (USHMM).

Notes

Credits: The transcripts that form the corpus in this tutorial were obtained through the United States Holocaust Memorial Museum (USHMM).

Files

Files (221.2 MB)

Name	Size	Download all
unrestricted_df.pkl md5:071a3ad0f007a35ab8a5971056a95d8b	101.4 MB	Download
unrestricted_lemmatized_df.pkl md5:94f48c3339940d130b2990635d3bdd55	119.8 MB	Download

Additional details

European Commission
EHRI-3 – European Holocaust Research Infrastructure 871111

	All versions	This version
Views	443	320
Downloads	67	55
Data volume	8.8 GB	6.8 GB

Exploratory Topic Modelling in Python Dataset - EHRI-3

Creators

Description

Notes

Files

Files (221.2 MB)

Additional details

Funding