Published September 24, 2025 | Version v1
Dataset Open

TPT–PE Thematic Analysis Dataset

  • 1. ROR icon University of Bologna
  • 2. University of Oslo

Description

Introduction

This dataset accompanies the article “Analyzing the history of physics education in the USA and Europe
   through natural language processing”
by Martina Caramaschi and Tor Ole B. Odden (https://doi.org/10.1103/wfvw-hkyy).

The dataset contains the cleaned and processed text of articles from two major physics education journals:

  • The Physics Teacher (TPT): 7,203 articles (1963–2020)

  • Physics Education (PE): 6,445 articles (1966–2024)

The datasets

To prepare the data for analysis, we first reduced the number of scraped articles by applying several filtering and cleaning steps:

  • Removed very short articles (ads, announcements, etc.).

  • Removed documents without listed authors.

  • Corrected malformed titles and metadata issues.

  • Excluded duplicates, book reviews, and journal business (~7,399 removals).

  • Filtered out articles with specific headers (e.g., ANNOUNCEMENTS, BOOK REVIEWS, LETTERS TO THE EDITOR).

  • Excluded errata, corrections, replies, and other non-research content.

  • Discarded articles under 500 words.

With the resulting dataset, we improved the correctness of the texts by removing unneeded material appearing before article titles and by cleaning the articles’ content of incorrect or irrelevant sections. After this preprocessing, we tokenized the cleaned texts and created bigrams to prepare the corpus for topic modeling with latent Dirichlet allocation (LDA).

After filtering, each document was transformed into a list of individual words (tokens). These tokenized representations were then collected and stored in Python pickle format.

Specific datasets included

The following files are included in this dataset:

  • 07_bigrams_combined_V2.pkl – combined dataset of all articles from Physics Education and The Physics Teacher, that is a dataframe made by merging the Physics Education data frame and The Physics Teacher dataframe, to obtain one that contains the articles from both journal. This new dataframe is a shuffled and re-indexed version of the combined dataset.

The file is stored as a pickled pandas dataframe containing a list of lists: each row corresponds to one article, and each article is represented as a list of sentences, where each sentence is itself a list of tokens (words).

During the cleaning process we removed stopwords (e.g., if, and, but), punctuation, numbers, and symbols, then lowercased all words, and merged frequent collocations into bigrams (e.g., high schoolhigh_school).

An example represetning 3 sentences from different articles contained into the data frame is:

[['calibrate', 'laser', 'power', 'meter', 'holographic', 'work'],
 ['robot', 'scientist', 'develop', 'student', 'epistemic_insight', 'lesson'],
 ['new', 'free_fall', 'experiment', 'determine', 'acceleration_gravity', 'kit']]

  • matrix_paper_weights_comb_k20_928.pkl – metadata dataframe containing publication year, title, authors, DOI, and journal for each paper in the combined dataset, in the same order as the processed data (07_bigrams_combined_V2.pkl). In addition, this file includes the LDA topic weights used in the analysis (one column per topic, with per-paper weights summing to 1).

These files provide the processed data used for the LDA topic modeling and thematic analysis described in the associated article.  The notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results, is contained in the corresponding public GitHub repository https://github.com/martinacaramaschi/TPT-PE-thematic-analysis

Files

Files (140.6 MB)

Name Size Download all
md5:516ae7e5c251833d49e6202df4a6386c
137.3 MB Download
md5:e7fb74687acb1fb93531dba6aea2ec37
3.4 MB Download

Additional details

Related works

References
Journal article: 10.1103/wfvw-hkyy (DOI)

Software