Detects collocations from texts or a corpus, returning a data.frame of
collocations and their scores, sorted in descending order of the association
measure. Words separated by punctuation delimiters are not counted by
default (spanPunct = FALSE
) as adjacent and hence are not eligible to
be collocations.
collocations2(x, method = c("lr", "chi2", "pmi", "dice"), features = "*", valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, min_count = 1, size = 2, ...)
x | |
---|---|
method | association measure for detecting collocations. Let \(i\) index documents, and \(j\) index features, \(n_{ij}\) refers to observed counts, and \(m_{ij}\) the expected counts in a collocations frequency table of dimensions \((J - size + 1)^2\). Available measures are computed as:
|
features | features to be selected for collocations |
valuetype | how to interpret keyword expressions: |
case_insensitive | ignore the case when matching features if |
min_count | exclude collocations below this count |
size | length of the collocation. Only bigram ( |
... | additional parameters passed to |
a collocations class object: a specially classed data.table consisting of collocations, their frequencies, and the computed association measure(s).
McInnes, B T. 2004. "Extending the Log Likelihood Measure to Improve Collocation Identification." M.Sc. Thesis, University of Minnesota.