quanteda has functionality to select, remove or compound for multi-word expressions such as phrasal verbs (“try on”, “wake up” etc.) and place names (“New York”, “South Korea” etc.).
library(quanteda)
toks <- tokens(data_corpus_inaugural)
Functions for tokens objects take a character vector, a dictionary or collocations as pattern
. All those three can be used for multi-word expressions, but you have to be aware their differences.
The most basic way to define multi-word expressions is separating words by whitespaces and wrap the character vector by phrase()
.
multiword <- c('United States', 'New York')
kwic()
is useful to find multi-word expressions in tokens. If you are not sure if ‘United’ and ‘States’ are separated, check their positions (e.g. ‘434:435’).
##
## [1789-Washington, 434:435] of the people of the |
## [1789-Washington, 530:531] more than those of the |
## [1797-Adams, 525:526] saw the Constitution of the |
## [1797-Adams, 1717:1718] to the Constitution of the |
## [1797-Adams, 2481:2482] support the Constitution of the |
## [1805-Jefferson, 441:442] sees a taxgatherer of the |
##
## United States | a Government instituted by themselves
## United States | . Every step by which
## United States | in a foreign country.
## United States | , and a conscientious determination
## United States | , I entertain no doubt
## United States | ? These contributions enable us
Similarly, you can select or remove multi-word expression using tokens_select()
.
## tokens from 6 documents.
## 1789-Washington :
## [1] "United" "States" "United" "States"
##
## 1793-Washington :
## character(0)
##
## 1797-Adams :
## [1] "United" "States" "United" "States" "United" "States"
##
## 1801-Jefferson :
## character(0)
##
## 1805-Jefferson :
## [1] "United" "States"
##
## 1809-Madison :
## [1] "United" "States" "United" "States"
tokens_compound()
joins elements of multi-word expressions by underscore, so they become ‘United_States’ and ‘New_York’.
## tokens from 6 documents.
## 1789-Washington :
## [1] "United_States" "United_States"
##
## 1793-Washington :
## character(0)
##
## 1797-Adams :
## [1] "United_States" "United_States" "United_States"
##
## 1801-Jefferson :
## character(0)
##
## 1805-Jefferson :
## [1] "United_States"
##
## 1809-Madison :
## [1] "United_States" "United_States"
Elements of multi-word expressions should be separately by whitespaces in a dictionary, but you do not use phrase()
here.
city = 'New York'))
head(tokens_lookup(toks, multiword_dict))
## tokens from 6 documents.
## 1789-Washington :
## [1] "country" "country"
##
## 1793-Washington :
## character(0)
##
## 1797-Adams :
## [1] "country" "country" "country"
##
## 1801-Jefferson :
## character(0)
##
## 1805-Jefferson :
## [1] "country"
##
## 1809-Madison :
## [1] "country" "country"
With textstat_collocations()
, it is possible to discover multi-word expressions through statistical scoring of the associations of adjacent words.
If textstat_collocations()
is applied to a tokens object comprised only of capitalize words, it usually returns multi-word proper names.
col <- toks %>%
head(col)
## collocation count count_nested length lambda z
## 1 United States 157 0 2 8.648770 28.48952
## 2 Federal Government 32 0 2 5.579973 21.89541
## 3 Almighty God 15 0 2 7.109404 18.22275
## 4 Chief Justice 13 0 2 8.846522 18.12255
## 5 Constitution United 19 0 2 4.037256 16.01611
## 6 North South 8 0 2 8.155828 15.41992
Collocations are automatically recognized as multi-word expressions by tokens_compound()
in case-sensitive fixed pattern matching. This is the fastest way to compound large numbers of multi-word expressions, but make sure that tolower = FALSE
in textstat_collocations()
to do this.
comp_toks2 <- tokens_compound(toks, col)
##
## [1789-Washington, 434] of the people of the | United_States |
## [1789-Washington, 529] more than those of the | United_States |
## [1797-Adams, 525] saw the Constitution of the | United_States |
## [1797-Adams, 1716] to the Constitution of the | United_States |
## [1797-Adams, 2479] support the Constitution of the | United_States |
## [1805-Jefferson, 441] sees a taxgatherer of the | United_States |
##
## a Government instituted by themselves
## . Every step by which
## in a foreign country.
## , and a conscientious determination
## , I entertain no doubt
## ? These contributions enable us
You can use phrase()
on collocations if more flexibility is needed. This is usually the case if you compound tokens from different corpus.
##
## [1789-Washington, 434] of the people of the | United_States |
## [1789-Washington, 529] more than those of the | United_States |
## [1797-Adams, 525] saw the Constitution of the | United_States |
## [1797-Adams, 1716] to the Constitution of the | United_States |
## [1797-Adams, 2479] support the Constitution of the | United_States |
## [1805-Jefferson, 441] sees a taxgatherer of the | United_States |
##
## a Government instituted by themselves
## . Every step by which
## in a foreign country.
## , and a conscientious determination
## , I entertain no doubt
## ? These contributions enable us