Published October 20, 2020 | Version v1
Dataset Open

Science Education Research Topic Modeling Dataset

  • 1. University of Oslo, Center for Computing in Science Education
  • 2. University of Wisconsin-Madison, Department of Curriculum and Instruction

Description

This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.

The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:

  • We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text.
  • We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”.
  • We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”) 
  • We removed all numbers, symbols, special characters, and punctuation, and lowercased all words.
  • We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words.
  • We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs).
  • We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”.

After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.

In addition to this file, we have also included the following files:

  1. SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data
  2. Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results.
  3. Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook. 

This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.

Notes

This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to "communicate TDM Output to third parties as part of original non-commercial research carried out by User, including in articles that describe, analyse and interpret research. Publications or analyses resulting from TDM of Wiley Content may include brief quotations from the original text as permitted under Section 107 or 108 of the 1976 United States Copyright Act in the United States, or as permitted by other applicable national copyright laws internationally. Any such extracts, as well as bibliographic metadata, must cite the original Wiley Content in the form of a DOI link. Permission to reproduce images shall be required in accordance with clause 5." The full text of the agreement is available here: https://olabout.wiley.com/WileyCDA/Section/id-826542.html

Files

README.md

Files (218.9 MB)

Name Size Download all
md5:f32cd7d8b069c17add1fada7d3201080
3.6 MB Download
md5:d164378444b04cb2cf1ccb02f136a5e1
3.6 MB Download
md5:c9efa2b857aab4f7ba2b8285202c7b54
24.3 kB Download
md5:40cd7e6d1773c3e53d2aaa17c7ecb673
2.2 kB Preview Download
md5:49bd692579ca268998a067b68bc290c2
102 Bytes Preview Download
md5:2f73ad2889e449118eaad5adc8e2aefb
1.9 MB Download
md5:56320f04a0591e134251841b15a21ec7
203.4 MB Download
md5:11c4f9046ad3718f5b0c6154e9c78ac9
6.4 MB Preview Download

Additional details

Related works

References
Journal article: 10.1103/PhysRevPhysEducRes.16.010142 (DOI)