This function automatically identifies contiguous collocations consisting of variable-length term sequences whose frequency is unlikely to have occurred by chance. The algorithm is based on Blaheta and Johnson's (2001) "Unsupervised Learning of Multi-Word Verbs".

sequences_old(x, features = "*", valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE, min_count = 2, max_size = 5, nested = TRUE,
  ordered = FALSE)

Arguments

x

a tokens object

features

a regular expression for filtering the features to be located in sequences

valuetype

how to interpret keyword expressions: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

ignore case when matching, if TRUE

min_count

minimum frequency of sequences for which parameters are estimated

max_size

maxium length of sequences which are collected

nested

if TRUE, collect all the subsequences of a longer sequence as separate entities. e.g. in a sequence of capitalized words "United States Congress", "States Congress" is considered as a subsequence. But "United States" is not a subsequence because it is followed by "Congress".

ordered

if true, use the Blaheta-Johnson method that distinguishes between the order of words, and tends to promote rare sequences.

References

Blaheta, D., & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.

Examples

toks <- tokens(corpus_segment(data_corpus_inaugural, what = "sentence")) toks <- tokens_select(toks, stopwords("english"), "remove", padding = TRUE) # extracting multi-part proper nouns (capitalized terms) seqs <- sequences_old(toks, "^([A-Z][a-z\\-]{2,})", valuetype="regex", case_insensitive = FALSE) head(seqs, 10)
#> collocation count length lambda sigma z p #> 107 United States 155 2 20.56459 0.04324223 475.56728 0 #> 3 Federal Government 32 2 16.07807 0.09312539 172.64966 0 #> 207 General Government 24 2 14.50954 0.12914263 112.35286 0 #> 66 National Government 11 2 12.88826 0.15847837 81.32504 0 #> 175 Almighty God 15 2 13.10835 0.16301521 80.41181 0 #> 225 Chief Justice 11 2 12.17009 0.18109481 67.20288 0 #> 90 President Bush 6 2 11.87632 0.18741675 63.36850 0 #> 224 Chief Magistrate 10 2 11.37940 0.22063436 51.57583 0 #> 95 President Clinton 4 2 10.67352 0.22913765 46.58124 0 #> 105 United Nations 8 2 10.72087 0.24598156 43.58402 0
# more efficient when applied to the same tokens object toks_comp <- tokens_compound(toks, seqs) toks_comp_ir <- tokens_compound(tokens(data_corpus_irishbudget2010), phrase(seqs)) # types can be any words seqs2 <- sequences_old(toks, "^([a-z]+)$", valuetype="regex", case_insensitive = FALSE, min_count = 2, ordered = TRUE) head(seqs2, 10)
#> collocation count length lambda sigma z p #> 5851 fellow citizens 56 2 54.85442 1.015456 54.01949 0 #> 13886 years ago 24 2 54.44628 1.035811 52.56393 0 #> 837 four years 17 2 51.65847 1.045185 49.42518 0 #> 4835 one another 14 2 50.55910 1.075006 47.03144 0 #> 1865 go forward 6 2 49.99335 1.090522 45.84352 0 #> 7722 every citizen 13 2 46.65390 1.033619 45.13647 0 #> 11056 executive department 6 2 51.60869 1.147562 44.97248 0 #> 8749 good faith 7 2 48.38682 1.080811 44.76900 0 #> 2498 religious liberty 6 2 49.84131 1.123135 44.37697 0 #> 4279 public money 8 2 46.57447 1.050330 44.34271 0