This function automatically identifies contiguous collocations consisting of variable-length term sequences whose frequency is unlikely to have occurred by chance. The algorithm is based on Blaheta and Johnson's (2001) "Unsupervised Learning of Multi-Word Verbs".
sequences_old(x, features = "*", valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, min_count = 2, max_size = 5, nested = TRUE, ordered = FALSE)
x | a tokens object |
---|---|
features | a regular expression for filtering the features to be located in sequences |
valuetype | how to interpret keyword expressions: |
case_insensitive | ignore case when matching, if |
min_count | minimum frequency of sequences for which parameters are estimated |
max_size | maxium length of sequences which are collected |
nested | if |
ordered | if true, use the Blaheta-Johnson method that distinguishes between the order of words, and tends to promote rare sequences. |
Blaheta, D., & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.
toks <- tokens(corpus_segment(data_corpus_inaugural, what = "sentence")) toks <- tokens_select(toks, stopwords("english"), "remove", padding = TRUE) # extracting multi-part proper nouns (capitalized terms) seqs <- sequences_old(toks, "^([A-Z][a-z\\-]{2,})", valuetype="regex", case_insensitive = FALSE) head(seqs, 10)#> collocation count length lambda sigma z p #> 107 United States 155 2 20.56459 0.04324223 475.56728 0 #> 3 Federal Government 32 2 16.07807 0.09312539 172.64966 0 #> 207 General Government 24 2 14.50954 0.12914263 112.35286 0 #> 66 National Government 11 2 12.88826 0.15847837 81.32504 0 #> 175 Almighty God 15 2 13.10835 0.16301521 80.41181 0 #> 225 Chief Justice 11 2 12.17009 0.18109481 67.20288 0 #> 90 President Bush 6 2 11.87632 0.18741675 63.36850 0 #> 224 Chief Magistrate 10 2 11.37940 0.22063436 51.57583 0 #> 95 President Clinton 4 2 10.67352 0.22913765 46.58124 0 #> 105 United Nations 8 2 10.72087 0.24598156 43.58402 0# more efficient when applied to the same tokens object toks_comp <- tokens_compound(toks, seqs) toks_comp_ir <- tokens_compound(tokens(data_corpus_irishbudget2010), phrase(seqs)) # types can be any words seqs2 <- sequences_old(toks, "^([a-z]+)$", valuetype="regex", case_insensitive = FALSE, min_count = 2, ordered = TRUE) head(seqs2, 10)#> collocation count length lambda sigma z p #> 5851 fellow citizens 56 2 54.85442 1.015456 54.01949 0 #> 13886 years ago 24 2 54.44628 1.035811 52.56393 0 #> 837 four years 17 2 51.65847 1.045185 49.42518 0 #> 4835 one another 14 2 50.55910 1.075006 47.03144 0 #> 1865 go forward 6 2 49.99335 1.090522 45.84352 0 #> 7722 every citizen 13 2 46.65390 1.033619 45.13647 0 #> 11056 executive department 6 2 51.60869 1.147562 44.97248 0 #> 8749 good faith 7 2 48.38682 1.080811 44.76900 0 #> 2498 religious liberty 6 2 49.84131 1.123135 44.37697 0 #> 4279 public money 8 2 46.57447 1.050330 44.34271 0