Identify and score multi-word expressions, or adjacent fixed-length collocations, from text.
textstat_collocations(x, method = "lambda", size = 2, min_count = 1, smoothing = 0.5, tolower = TRUE, ...) is.collocations(x)
x | a character, corpus, or tokens object whose
collocations will be scored. The tokens object should include punctuation,
and if any words have been removed, these should have been removed with
|
---|---|
method | association measure for detecting collocations. Currently this
is limited to |
size | integer; the length of the collocations to be scored |
min_count | numeric; minimum frequency of collocations that will be scored |
smoothing | numeric; a smoothing parameter added to the observed counts (default is 0.5) |
tolower | logical; if |
... | additional arguments passed to |
textstat_collocations
returns a data.frame of collocations and their
scores and statistsics.
is.collocation
returns TRUE
if the object is of class
collocations
, FALSE
otherwise.
Documents are grouped for the purposes of scoring, but collocations will not span sentences.
If x
is a tokens object and some tokens have been removed, this should be done
using tokens_remove(x, pattern, padding = TRUE)
so that counts will still be
accurate, but the pads will prevent those collocations from being scored.
The lambda
computed for a size = \(K\)-word target multi-word
expression the coefficient for the \(K\)-way interaction parameter in the
saturated log-linear model fitted to the counts of the terms forming the set
of eligible multi-word expressions. This is the same as the "lambda" computed
in Blaheta and Johnson's (2001), where all multi-word expressions are
considered (rather than just verbs, as in that paper). The z
is the
Wald \(z\)-statistic computed as the quotient of lambda
and the Wald
statistic for lambda
as described below.
In detail:
Consider a \(K\)-word target expression \(x\), and let \(z\) be any
\(K\)-word expression. Define a comparison function \(c(x,z)=(j_{1},
\dots, j_{K})=c\) such that the \(k\)th element of \(c\) is 1 if the
\(k\)th word in \(z\) is equal to the \(k\)th word in \(x\), and 0
otherwise. Let \(c_{i}=(j_{i1}, \dots, j_{iK})\), \(i=1, \dots,
2^{K}=M\), be the possible values of \(c(x,z)\), with \(c_{M}=(1,1,
\dots, 1)\). Consider the set of \(c(x,z_{r})\) across all expressions
\(z_{r}\) in a corpus of text, and let \(n_{i}\), for \(i=1,\dots,M\),
denote the number of the \(c(x,z_{r})\) which equal \(c_{i}\), plus the
smoothing constant smoothing
. The \(n_{i}\) are the counts in a
\(2^{K}\) contingency table whose dimensions are defined by the
\(c_{i}\).
\(\lambda\): The \(K\)-way interaction parameter in the saturated
loglinear model fitted to the \(n_{i}\). It can be calculated as
$$\lambda = \sum_{i=1}^{M} (-1)^{K-b_{i}} \log n_{i}$$
where \(b_{i}\) is the number of the elements of \(c_{i}\) which are equal to 1.
Wald test \(z\)-statistic \(z\) is calculated as: $$z = \frac{\lambda}{[\sum_{i=1}^{M} n_{i}^{-1}]^{(1/2)}}$$
This function is under active development, with more measures to be added in the the next release of quanteda.
Blaheta, D., & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.
txts <- data_corpus_inaugural[1:2] head(cols <- textstat_collocations(txts, size = 2, min_count = 2), 10)#> collocation count length lambda z #> 1 have been 5 2 5.706351 7.357294 #> 2 has been 3 2 5.567303 6.411742 #> 3 of the 24 2 1.675867 6.391585 #> 4 i have 5 2 3.745703 6.271872 #> 5 which i 6 2 3.174358 6.139303 #> 6 will be 4 2 3.870615 5.933394 #> 7 public good 2 2 6.281574 5.531515 #> 8 less than 2 2 6.281574 5.531515 #> 9 you will 2 2 4.919981 5.434062 #> 10 may be 3 2 4.192821 5.330726head(cols <- textstat_collocations(txts, size = 3, min_count = 2), 10)#> collocation count length lambda z #> 1 of which the 2 3 6.1223981 2.8301045 #> 2 in which i 3 3 2.1657555 1.1724743 #> 3 i have in 2 3 2.3777916 1.0604856 #> 4 and of the 2 3 0.8812063 0.7468802 #> 5 me by the 2 3 1.4693825 0.6546061 #> 6 to the great 2 3 1.2857928 0.5645410 #> 7 voice of my 2 3 1.2237754 0.5284241 #> 8 which ought to 2 3 1.4051321 0.5266355 #> 9 of the confidence 2 3 1.1186300 0.4933722 #> 10 the united states 2 3 1.2564735 0.4261125# extracting multi-part proper nouns (capitalized terms) toks2 <- tokens(data_corpus_inaugural) toks2 <- tokens_remove(toks2, stopwords("english"), padding = TRUE) toks2 <- tokens_select(toks2, "^([A-Z][a-z\\-]{2,})", valuetype = "regex", case_insensitive = FALSE, padding = TRUE) seqs <- textstat_collocations(toks2, size = 3, tolower = FALSE) head(seqs, 10)#> collocation count length lambda z #> 1 New England States 1 3 -0.1881325 -0.07035895 #> 2 United States Congress 2 3 -2.1524035 -1.01462267 #> 3 Arlington National Cemetery 1 3 -6.7967700 -2.21834991 #> 4 United Nations Charter 1 3 -6.3698206 -2.40537228 #> 5 Senator John Stennis 1 3 -7.4633094 -2.42928003 #> 6 Second World War 1 3 -5.8388183 -2.43007317 #> 7 Interstate Commerce Commission 1 3 -9.3091763 -2.82110557 #> 8 First Lady Michelle 1 3 -9.5604975 -2.88885728 #> 9 Franklin Delano Roosevelt 1 3 -9.3091897 -2.94382329 #> 10 Lady Michelle Obama 1 3 -9.8969764 -2.97504808