# README

This repository contains Wilcoxon rank sum test and keyphrase extraction data cited in the WhatEvery1Says (WE1S) Project's article "What Everyone Says About the Humanities: The Challenge Posed by the Public Perception of the Humanities in the Media". The organization of the materials is discussed below.

## Wilcoxon Rank Sum Test

All data and results from Wilcoxon rank sum testing can be found in the `wilcoxon-tests` folder of the extracted zip file `we1s_about_the_humanities.zip`. The Wilcoxon rank sum test identifies specific words that appear significantly more in one group of documents as compared to another, thus providing researchers with an understanding of what words are “distinctive” to each group. Further information on WE1S's use of Wilcoxon rank sum testing can be found at [https://we1s.ucsb.edu/wp-content/uploads/M-15-Wilcoxon-Test.pdf](https://we1s.ucsb.edu/wp-content/uploads/M-15-Wilcoxon-Test.pdf).

Each subdirectory in the `wilcoxon-test` folder contains the data and results of a particular comparison experiment based on a metadata category such as whether the data contained articles published by public or private institutions. Each data file is a `.txt` file representing a sample of the overall data from the collection. The `README` file provides information on the collection used, the sample size, and the nature of the comparison. The results for the test are in a file called `results.csv`.

The `results.csv` file for each test includes a row for each term included in the test. Each row displays the term, the term's raw count in each category compared (count 1 and count 2), the difference between the 2 counts (count 1 minus count 2), the percentage change in the counts, the Wilcoxon statistic, and the Wilcoxon p-value. Sorting the csv by the Wilcoxon stat from greatest to least will cause the terms most strongly associated with category 1 to come to the top (category 1 is the category listed first in the title field of the README.md file for each test), while sorting it by the Wilcoxon stat from least to greatest will cause the terms most strongly associated with category 2 to come to the top (category 2 is the category listed second). The p-value column provides you with information about how confident you can be about each comparison's significance.

## Keyphrase Extraction

All data and results from Wilcoxon rank sum testing can be found in the `keyphrase-extraction` folder of the extracted zip file `we1s_about_the_humanities.zip`. Keyphrase extraction generates a list of the most significant words or phrases (1-6 words long) within individual documents. WE1S takes the top ten keyphrases in each document and ranks them according to their frequency across the collection. WE1S uses the SGRank algorithm for keyphrase extraction, and because this algorithm is computationally intensive, WE1S limits keyphrases to lemmatized nouns and proper nouns within a window of 70 words to either side of candidate keyphrases. Further information on WE1S's use of keyphrase extaction can be found at [https://we1s.ucsb.edu/wp-content/uploads/M-14-Keyphrase-Extraction.pdf](https://we1s.ucsb.edu/wp-content/uploads/M-14-Keyphrase-Extraction.pdf).

Each subdirectory in the `keyphrase-extraction` folder contains the data and results of keyphrase extraction on a particular collection. Details of the collection and resulting files can be found in each subdirectory. Each list of keyphrases is in a file called `SGRank.csv`, which lists the keyphrases and their number of occurrences in the collection. The article additionally cites keyphrases that are shared with the terms in the public topic model produced by Andrew Goldstone and Ted Underwood, “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us,” _New Literary History_ 45, no. 3 (2014): 359–84, https://doi.org/10.1353/nlh.2014.0025. The list of terms is derived from the public visualization at [https://www.sas.rutgers.edu/virtual/ag978/quiet/#/words](https://www.sas.rutgers.edu/virtual/ag978/quiet/#/words). Keyphrases extracted from WE1S data were split into single-word terms and compared with the list of vocabulary in Goldstone and Underwood's word list (`quiet_transformations_wordlist.txt`) to compile lists of shared vocabulary. These lists are given in files called `shared_terms.txt`.

Note that keyphrases were extracted for corpora produced using the Python [Textacy](https://textacy.readthedocs.io/en/latest/index.html) library. Because these corpora contain the full text of articles with intellectual property restrictions they cannot be reproduced here.
