Science Education Research Topic Modeling Dataset
Authors/Creators
- 1. University of Oslo, Center for Computing in Science Education
- 2. University of Wisconsin-Madison, Department of Curriculum and Instruction
Description
This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.
The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:
- We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text.
- We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”.
- We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”)
- We removed all numbers, symbols, special characters, and punctuation, and lowercased all words.
- We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words.
- We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs).
- We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”.
After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.
In addition to this file, we have also included the following files:
- SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data
- Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results.
- Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook.
This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.
Notes
Files
README.md
Files
(218.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:f32cd7d8b069c17add1fada7d3201080
|
3.6 MB | Download |
|
md5:d164378444b04cb2cf1ccb02f136a5e1
|
3.6 MB | Download |
|
md5:c9efa2b857aab4f7ba2b8285202c7b54
|
24.3 kB | Download |
|
md5:40cd7e6d1773c3e53d2aaa17c7ecb673
|
2.2 kB | Preview Download |
|
md5:49bd692579ca268998a067b68bc290c2
|
102 Bytes | Preview Download |
|
md5:2f73ad2889e449118eaad5adc8e2aefb
|
1.9 MB | Download |
|
md5:56320f04a0591e134251841b15a21ec7
|
203.4 MB | Download |
|
md5:11c4f9046ad3718f5b0c6154e9c78ac9
|
6.4 MB | Preview Download |
Additional details
Related works
- References
- Journal article: 10.1103/PhysRevPhysEducRes.16.010142 (DOI)