Dataset Open Access

Exploratory Topic Modelling in Python Dataset - EHRI-3

Dermentzi, Maria

In the EHRI-3 project, we are investigating tools and methods that historical researchers and scholars can use to better understand, visualise, and interpret the material held by our partner archives. This dataset accompanies a tutorial exploring a technique called topic modelling in the context of a Holocaust-related historical collection.

We were on the lookout for datasets that would be easily accessible and, for convenience, predominantly in English. One such dataset was the United States Holocaust Memorial Museum’s (USHMM) extensive collection of oral history testimonies, for which there are a considerable number of textual transcripts. The museum’s total collection consists of over 80,703 testimonies, 41,695 of which are available in English, with 2,894 of them listing a transcript.

Since there is not yet a ready-to-download dataset that includes these transcripts, we had to construct our own. Using a web scraping tool, we managed to create a list of the links pointing to the metadata (including transcripts) of the testimonies that were of interest to us. After obtaining the transcript and other metadata of each of these testimonies, we were able to create our dataset and curate it to remove any unwanted entries. For example, we made sure to remove entries with restrictions on access or use. We also removed entries with transcripts that consisted only of some automatically generated headers and entries which turned out to be in languages other than English. The remaining 1,873 transcripts form the corpus of this tutorial — a small, but still decently sized dataset.

The process that we followed to put together this dataset is detailed in the Jupyter Notebook accompanying this post, which can be found in this Github repository.

In this Zenodo upload, the user can find two files, each of them containing a pickled pandas DataFrame that was obtained at a different stage of the tutorial:

"unrestricted_df.pkl" contains 1,946 entries of Oral Testimony transcripts and has five fields (RG_number, text, display_date, conditions_access, conditions_use)
"unrestricted_lemmatized_df.pkl" contains 1,873 entries of Oral Testimony transcripts and has six fields (RG_number, text, display_date, conditions_access, conditions_use, lemmas)

Instructions on their intended use can be found in the accompanying Jupyter Notebook.


The transcripts that form the corpus in this tutorial were obtained through the United States Holocaust Memorial Museum (USHMM).

Credits: The transcripts that form the corpus in this tutorial were obtained through the United States Holocaust Memorial Museum (USHMM).
Files (221.2 MB)
Name Size
101.4 MB Download
119.8 MB Download
All versions This version
Views 4742
Downloads 117
Data volume 1.3 GB783.5 MB
Unique views 3130
Unique downloads 75


Cite as