Published July 18, 2021 | Version 2.0.0
Dataset Open

Wilcoxon Rank Sum Test and Keyphrase Extraction Data Cited in "What Everyone Says: Public Perceptions of the Humanities in the Media"

Description

This repository contains Wilcoxon rank sum test and keyphrase extraction data cited in the WhatEvery1Says (WE1S) Project's article  "What Everyone Says: Public Perceptions of the Humanities in the Media". The organization of the materials is discussed below.

Wilcoxon Rank Sum Test

All data and results from Wilcoxon rank sum testing can be found in the wilcoxon-tests folder of the extracted zip file we1s_about_the_humanities.zip. The Wilcoxon rank sum test identifies specific words that appear significantly more in one group of documents as compared to another, thus providing researchers with an understanding of what words are “distinctive” to each group. Further information on WE1S's use of Wilcoxon rank sum testing can be found at https://we1s.ucsb.edu/wp-content/uploads/M-15-Wilcoxon-Test.pdf.

Each subdirectory in the wilcoxon-test folder contains the data and results of a particular comparison experiment based on a metadata category such as whether the data contained articles published by public or private institutions. Each data file is a .txt file representing a sample of the overall data from the collection. The README file provides information on the collection used, the sample size, and the nature of the comparison. The results for the test are in a file called results.csv.

The results.csv file for each test includes a row for each term included in the test. Each row displays the term, the term's raw count in each category compared (count 1 and count 2), the difference between the 2 counts (count 1 minus count 2), the percentage change in the counts, the Wilcoxon statistic, and the Wilcoxon p-value. Sorting the csv by the Wilcoxon stat from greatest to least will cause the terms most strongly associated with category 1 to come to the top (category 1 is the category listed first in the title field of the README.md file for each test), while sorting it by the Wilcoxon stat from least to greatest will cause the terms most strongly associated with category 2 to come to the top (category 2 is the category listed second). The p-value column provides you with information about how confident you can be about each comparison's significance.

Keyphrase Extraction

All data and results from Wilcoxon rank sum testing can be found in the keyphrase-extraction folder of the extracted zip file we1s_about_the_humanities.zip. Keyphrase extraction generates a list of the most significant words or phrases (1-6 words long) within individual documents. WE1S takes the top ten keyphrases in each document and ranks them according to their frequency across the collection. WE1S uses the SGRank algorithm for keyphrase extraction, and because this algorithm is computationally intensive, WE1S limits keyphrases to lemmatized nouns and proper nouns within a window of 70 words to either side of candidate keyphrases. Further information on WE1S's use of keyphrase extaction can be found at https://we1s.ucsb.edu/wp-content/uploads/M-14-Keyphrase-Extraction.pdf.

Each subdirectory in the keyphrase-extraction folder contains the data and results of keyphrase extraction on a particular collection. Details of the collection and resulting files can be found in each subdirectory. Each list of keyphrases is in a file called SGRank.csv, which lists the keyphrases and their number of occurrences in the collection. The article additionally cites keyphrases that are shared with the terms in the public topic model produced by Andrew Goldstone and Ted Underwood, “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us,” New Literary History 45, no. 3 (2014): 359–84, https://doi.org/10.1353/nlh.2014.0025. The list of terms is derived from the public visualization at https://www.sas.rutgers.edu/virtual/ag978/quiet/#/words. Keyphrases extracted from WE1S data were split into single-word terms and compared with the list of vocabulary in Goldstone and Underwood's word list (quiet_transformations_wordlist.txt) to compile lists of shared vocabulary. These lists are given in files called shared_terms.txt.

Note that keyphrases were extracted for corpora produced using the Python Textacy library. Because these corpora contain the full text of articles with intellectual property restrictions they cannot be reproduced here.

Files

we1s_about_the_humanities.zip

Files (20.8 MB)

Name Size Download all
md5:5add107997f98d0771d873af5d9d4c2b
20.8 MB Preview Download