qss.Rmd
library("quanteda")
In this vignette we show how the quanteda package can be used to replicate the text analysis part (Chapter 5.1) from Kosuke Imai’s book Quantitative Social Science: An Introduction (Princeton: Princeton University Press, 2017).
To get the textual data, you need to install and load the qss package first that comes with the book.
devtools::install_github("kosukeimai/qss-package", build_vignettes = TRUE)
Next, we transform the corpus to a document-feature matrix. dfm_prep
(used in sections 5.1.4 and 5.1.5) is a dfm in which numbers and punctuation have been removed, and in which terms have been converted to lowercase. In dfm_papers
, the words have also been stemmed and a standard set of stopwords removed.
# transform corpus to a document-feature matrix
dfm_prep <- dfm(corpus_raw, remove_numbers = TRUE, tolower = TRUE,
remove_punct = TRUE, verbose = TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 85 documents, 8,630 features
## ... created a 85 x 8,630 sparse dfm
## ... complete.
## Elapsed time: 0.153 seconds.
# remove stop words and stem words
dfm_papers <- dfm(dfm_prep, stem = TRUE, remove = stopwords("english"))
# inspect
dfm_papers
## Document-feature matrix of: 85 documents, 4,859 features (89.1% sparse).
# sort into alphabetical order of features, to match book example
dfm_papers <- dfm_papers[, order(featnames(dfm_papers))]
# inspect some documents in the dfm
head(dfm_papers, nf = 8)
## Document-feature matrix of: 6 documents, 8 features (97.9% sparse).
## 6 x 8 sparse Matrix of class "dfm"
## features
## docs 1st 2d 3d 4th 5th abandon abat abb
## No. 1 0 0 0 0 0 0 0 0
## No. 2 0 0 0 0 0 0 0 0
## No. 3 0 0 0 0 0 0 0 0
## No. 4 0 0 0 0 0 0 0 0
## No. 5 1 0 0 0 0 0 0 0
## No. 6 0 0 0 0 0 0 0 0
The tm package considers features such as “1st” to be numbers, whereas quanteda does not. We can remove these easily using a wildcard removal:
dfm_papers <- dfm_remove(dfm_papers, "[0-9]", valuetype = "regex", verbose = TRUE)
## removed 5 features
head(dfm_papers, nf = 8)
## Document-feature matrix of: 6 documents, 8 features (91.7% sparse).
## 6 x 8 sparse Matrix of class "dfm"
## features
## docs abandon abat abb abet abhorr abil abject abl
## No. 1 0 0 0 0 0 0 0 1
## No. 2 0 0 0 0 0 1 0 0
## No. 3 0 0 0 0 0 0 0 2
## No. 4 0 0 0 0 0 0 0 1
## No. 5 0 0 0 0 0 0 0 0
## No. 6 0 0 0 0 0 0 0 0
We can use the textplot_wordcloud()
function to plot word clouds of the most frequent words in Papers 12 and 24.
set.seed(100)
textplot_wordcloud(dfm_papers[c("No. 12", "No. 24"), ],
max.words = 50, comparison = TRUE)
Since quanteda cannot do stem completion, we will skip that part.
Next, we identify clusters of similar essay based on term frequency-inverse document frequency (tf-idf) and apply the \(k\)-means algorithm to the weighted dfm.
# tf-idf calculation
dfm_papers_tfidf <- dfm_tfidf(dfm_papers, base = 2)
# 10 most important words for Paper No. 12
topfeatures(dfm_papers_tfidf[12, ], n = 10)
## revenu contraband patrol excis coast trade
## 19.42088 19.22817 19.22817 19.12214 16.22817 15.01500
## per tax cent gallon
## 14.47329 13.20080 12.81878 12.81878
# 10 most important words for Paper No. 24
topfeatures(dfm_papers_tfidf[24, ], n = 10)
## garrison dock-yard settlement spain armi frontier
## 24.524777 19.228173 16.228173 13.637564 12.770999 12.262389
## arsenal western post nearer
## 10.818782 10.806108 10.228173 9.648857
We can match the clustering as follows:
k <- 4 # number of clusters
# subset The Federalist papers written by Hamilton
dfm_papers_tfidf_hamilton <- dfm_subset(dfm_papers_tfidf, author == "hamilton")
# run k-means
km_out <- stats::kmeans(dfm_papers_tfidf_hamilton, centers = k)
km_out$iter # check the convergence; number of iterations may vary
## [1] 3
colnames(km_out$centers) <- featnames(dfm_papers_tfidf_hamilton)
for (i in 1:k) { # loop for each cluster
cat("CLUSTER", i, "\n")
cat("Top 10 words:\n") # 10 most important terms at the centroid
print(head(sort(km_out$centers[i, ], decreasing = TRUE), n = 10))
cat("\n")
cat("Federalist Papers classified: \n") # extract essays classified
print(docnames(dfm_papers_tfidf_hamilton)[km_out$cluster == i])
cat("\n")
}
## CLUSTER 1
## Top 10 words:
## court appel jurisdict suprem juri tribun cogniz
## 69.68857 35.27513 25.46591 24.79126 22.16104 21.27125 19.12214
## inferior appeal re-examin
## 18.76875 16.21098 13.52348
##
## Federalist Papers classified:
## [1] "No. 81" "No. 82"
##
## CLUSTER 2
## Top 10 words:
## juri trial court crimin admiralti equiti
## 218.20102 84.74567 62.47940 42.06871 40.87463 38.24428
## chanceri common-law probat civil
## 37.86574 27.04695 27.04695 26.77843
##
## Federalist Papers classified:
## [1] "No. 83"
##
## CLUSTER 3
## Top 10 words:
## senat upon presid armi court claus militia offic
## 3.912991 3.640413 3.100660 2.919085 2.776862 2.636133 2.428567 2.374450
## governor appoint
## 2.294413 2.125990
##
## Federalist Papers classified:
## [1] "No. 1" "No. 6" "No. 7" "No. 8" "No. 9" "No. 13" "No. 15"
## [8] "No. 16" "No. 17" "No. 21" "No. 22" "No. 23" "No. 24" "No. 25"
## [15] "No. 26" "No. 27" "No. 28" "No. 29" "No. 30" "No. 31" "No. 32"
## [22] "No. 33" "No. 34" "No. 36" "No. 59" "No. 60" "No. 61" "No. 65"
## [29] "No. 66" "No. 67" "No. 68" "No. 69" "No. 70" "No. 71" "No. 72"
## [36] "No. 73" "No. 74" "No. 75" "No. 76" "No. 77" "No. 78" "No. 79"
## [43] "No. 80" "No. 84" "No. 85"
##
## CLUSTER 4
## Top 10 words:
## trade merchant manufactur market commerc navig
## 18.351669 18.010180 14.037686 13.637564 11.214252 10.816517
## revenu landhold navi ship
## 9.416185 8.545855 8.233234 8.040714
##
## Federalist Papers classified:
## [1] "No. 11" "No. 12" "No. 35"
Finally, we assess how well the model fits the data by classifying each essay based on its fitted value.
# proportion of correctly classified essays by Hamilton
mean(hm_fitted[author_data_known$author == "hamilton"] > 0)
## [1] 1
# proportion of correctly classified essays by Madison
mean(hm_fitted[author_data_known$author == "madison"] < 0)
## [1] 1
n <- nrow(author_data_known)
hm_classify <- rep(NA, n) # a container vector with missing values
for (i in 1:n) {
# fit the model to the data after removing the ith observation
sub_fit <- lm(author_numeric ~ upon + there + consequently + whilst,
data = author_data_known[-i, ]) # exclude ith row
# predict the authorship for the ith observation
hm_classify[i] <- predict(sub_fit, newdata = author_data_known[i, ])
}
# proportion of correctly classified essays by Hamilton
mean(hm_classify[author_data_known$author == "hamilton"] > 0)
## [1] 1
# proportion of correctly classified essays by Madison
mean(hm_classify[author_data_known$author == "madison"] < 0)
## [1] 1
disputed <- c(49, 50:57, 62, 63) # 11 essays with disputed authorship
tf_disputed <- dfm_subset(tfm, is.na(author)) %>%
convert(to = "data.frame")
author_data$prediction <- predict(hm_fit, newdata = author_data)
author_data$prediction # predicted values
## [1] 0.73682138 -0.13810334 -1.35072048 -0.03823549 -0.27285552
## [6] 0.71537231 1.33178340 0.19414036 0.37516964 -0.20369832
## [11] 0.67584796 0.99478176 1.39348825 -1.18908379 1.20446180
## [16] 0.63999272 0.91390560 -0.38341254 -0.62471327 -0.12453888
## [21] 0.91552544 1.08097849 0.88528696 1.01192444 0.08226598
## [26] 0.72338300 1.08177034 0.71354396 1.47718825 1.53672871
## [31] 1.85572895 0.71920518 1.07664097 1.26163462 0.90800379
## [36] 1.06339810 -0.81826934 -0.56126822 -0.27285552 -0.10309170
## [41] -0.47343432 -0.17492007 -0.60654100 -0.86944061 -1.67765248
## [46] -0.91487748 -0.13268399 -0.13500254 -0.96784742 -0.06907196
## [51] -1.46790700 -0.27285552 -0.54229524 -0.54529610 0.04081584
## [56] -0.56418213 -1.18177840 -0.41902525 0.55200092 0.98676075
## [61] 0.59237415 -0.97795312 -0.21203157 -0.18295716 1.15640459
## [66] 1.12403237 0.78348571 0.11197467 1.12824159 0.70981781
## [71] 0.34751104 0.84257351 1.24182206 0.91612261 0.50276872
## [76] 1.05480578 1.05926761 0.90585578 0.54292995 0.64035410
## [81] 0.91773469 0.31189155 0.80869221 0.73649069 1.00895939
Finally, we plot the fitted values for each Federalist paper with the ggplot2 package.
author_data$author_plot <- ifelse(is.na(author_data$author), "unknown", as.character(author_data$author))
library(ggplot2)
ggplot(data = author_data, aes(x = paper_numeric,
y = prediction,
shape = author_plot,
colour = author_plot)) +
geom_point(size = 2) +
geom_hline(yintercept = 0, linetype = "dotted") +
labs(x = "Federalist Papers", y = "Predicted values") +
theme_minimal() +
theme(legend.title=element_blank())