vignettes/pkgdown/examples/phrase.Rmd
phrase.Rmd
quanteda has the functionality to select, remove or compound multi-word expressions, such as phrasal verbs (“try on”, “wake up” etc.) and place names (“New York”, “South Korea” etc.).
library(quanteda)
toks <- tokens(data_corpus_inaugural)
Functions for tokens objects take a character vector, a dictionary or collocations as pattern
. All those three can be used for multi-word expressions, but you have to be aware their differences.
The most basic way to define multi-word expressions is separating words by whitespaces and wrap the character vector by phrase()
.
multiword <- c("United States", "New York")
kwic()
is useful to find multi-word expressions in tokens. If you are not sure if “United” and “States” are separated, check their positions (e.g. “434:435”).
##
## [1789-Washington, 433:434] of the people of the | United States |
## [1789-Washington, 529:530] more than those of the | United States |
## [1797-Adams, 524:525] saw the Constitution of the | United States |
## [1797-Adams, 1716:1717] to the Constitution of the | United States |
## [1797-Adams, 2480:2481] support the Constitution of the | United States |
## [1805-Jefferson, 441:442] sees a taxgatherer of the | United States |
##
## a Government instituted by themselves
## . Every step by which
## in a foreign country.
## , and a conscientious determination
## , I entertain no doubt
## ? These contributions enable us
Similarly, you can select or remove multi-word expression using tokens_select()
.
head(tokens_select(toks, pattern = phrase(multiword)))
## Tokens consisting of 6 documents and 4 docvars.
## 1789-Washington :
## [1] "United" "States" "United" "States"
##
## 1793-Washington :
## character(0)
##
## 1797-Adams :
## [1] "United" "States" "United" "States" "United" "States"
##
## 1801-Jefferson :
## character(0)
##
## 1805-Jefferson :
## [1] "United" "States"
##
## 1809-Madison :
## [1] "United" "States" "United" "States"
tokens_compound()
joins elements of multi-word expressions by underscore, so they become “United_States” and “New_York”.
comp_toks <- tokens_compound(toks, pattern = phrase(multiword)) head(tokens_select(comp_toks, pattern = c("United_States", "New_York")))
## Tokens consisting of 6 documents and 4 docvars.
## 1789-Washington :
## [1] "United_States" "United_States"
##
## 1793-Washington :
## character(0)
##
## 1797-Adams :
## [1] "United_States" "United_States" "United_States"
##
## 1801-Jefferson :
## character(0)
##
## 1805-Jefferson :
## [1] "United_States"
##
## 1809-Madison :
## [1] "United_States" "United_States"
Elements of multi-word expressions should be separately by whitespaces in a dictionary, but you do not use phrase()
here.
multiword_dict <- dictionary(list(country = "United States", city = "New York"))
head(tokens_lookup(toks, dictionary = multiword_dict))
## Tokens consisting of 6 documents and 4 docvars.
## 1789-Washington :
## [1] "country" "country"
##
## 1793-Washington :
## character(0)
##
## 1797-Adams :
## [1] "country" "country" "country"
##
## 1801-Jefferson :
## character(0)
##
## 1805-Jefferson :
## [1] "country"
##
## 1809-Madison :
## [1] "country" "country"
With textstat_collocations()
, it is possible to discover multi-word expressions through statistical scoring of the associations of adjacent words.
If textstat_collocations()
is applied to a tokens object comprised only of capitalize words, it usually returns multi-word proper names.
col <- toks %>% tokens_remove(stopwords("en")) %>% tokens_select(pattern = "^[A-Z]", valuetype = "regex", case_insensitive = FALSE, padding = TRUE) %>% textstat_collocations(min_count = 5, tolower = FALSE) head(col)
## collocation count count_nested length lambda z
## 1 United States 157 0 2 8.643955 28.47365
## 2 Federal Government 32 0 2 5.575158 21.87650
## 3 Almighty God 15 0 2 7.104604 18.21044
## 4 Chief Justice 13 0 2 8.841726 18.11272
## 5 Constitution United 19 0 2 4.032441 15.99701
## 6 North South 8 0 2 8.151032 15.41085
Collocations are automatically recognized as multi-word expressions by tokens_compound()
in case-sensitive fixed pattern matching. This is the fastest way to compound large numbers of multi-word expressions, but make sure that tolower = FALSE
in textstat_collocations()
to do this.
comp_toks2 <- tokens_compound(toks, pattern = col) head(kwic(comp_toks2, pattern = c("United_States", "New_York")))
##
## [1789-Washington, 433] of the people of the | United_States |
## [1789-Washington, 528] more than those of the | United_States |
## [1797-Adams, 524] saw the Constitution of the | United_States |
## [1797-Adams, 1715] to the Constitution of the | United_States |
## [1797-Adams, 2478] support the Constitution of the | United_States |
## [1805-Jefferson, 441] sees a taxgatherer of the | United_States |
##
## a Government instituted by themselves
## . Every step by which
## in a foreign country.
## , and a conscientious determination
## , I entertain no doubt
## ? These contributions enable us
You can use phrase()
on collocations if more flexibility is needed. This is usually the case if you compound tokens from different corpus.
comp_toks3 <- tokens_compound(toks, pattern = phrase(col$collocation)) head(kwic(comp_toks3, pattern = c("United_States", "New_York")))
##
## [1789-Washington, 433] of the people of the | United_States |
## [1789-Washington, 528] more than those of the | United_States |
## [1797-Adams, 524] saw the Constitution of the | United_States |
## [1797-Adams, 1715] to the Constitution of the | United_States |
## [1797-Adams, 2478] support the Constitution of the | United_States |
## [1805-Jefferson, 441] sees a taxgatherer of the | United_States |
##
## a Government instituted by themselves
## . Every step by which
## in a foreign country.
## , and a conscientious determination
## , I entertain no doubt
## ? These contributions enable us