Create a set of ngrams (tokens in sequence) from already tokenized text objects, with an optional skip argument to form skipgrams. Both the ngram length and the skip lengths take vectors of arguments to form multiple lengths or skips in one pass. Implemented in C++ for efficiency.
tokens_ngrams(x, n = 2L, skip = 0L, concatenator = "_") char_ngrams(x, n = 2L, skip = 0L, concatenator = "_") tokens_skipgrams(x, n, skip, concatenator = "_")
x | a tokens object, or a character vector, or a list of characters |
---|---|
n | integer vector specifying the number of elements to be concatenated in each ngram. Each element of this vector will define a \(n\) in the \(n\)-gram(s) that are produced. |
skip | integer vector specifying the adjacency skip size for tokens
forming the ngrams, default is 0 for only immediately neighbouring words.
For |
concatenator | character for combining words, default is |
a tokens object consisting a list of character vectors of ngrams, one list element per text, or a character vector if called on a simple character vector
Normally, these functions will be called through
tokens(x, ngrams = , ...)
, but these functions are provided
in case a user wants to perform lower-level ngram construction on tokenized
texts.
tokens_skipgrams
is a wrapper to tokens_ngrams
that requires arguments to be supplied for both n
and skip
.
For \(k\)-skip skipgrams, set skip
to 0:
\(k\), in order
to conform to the definition of skip-grams found in Guthrie et al (2006): A
\(k\) skip-gram is an ngram which is a superset of all ngrams and each
\((k-i)\) skipgram until \((k-i)==0\) (which includes 0 skip-grams).
char_ngrams
is a convenience wrapper for a (non-list)
vector of characters, so named to be consistent with quanteda's naming
scheme.
Guthrie, D., B. Allison, W. Liu, and L. Guthrie. 2006. "A Closer Look at Skip-Gram Modelling."
#> tokens from 2 documents. #> text1 : #> [1] "a_b" "b_c" "c_d" "d_e" "a_b_c" "b_c_d" "c_d_e" #> #> text2 : #> [1] "c_d" "d_e" "e_f" "f_g" "c_d_e" "d_e_f" "e_f_g" #>toks <- tokens(c(text1 = "the quick brown fox jumped over the lazy dog")) tokens_ngrams(toks, n = 1:3)#> tokens from 1 document. #> text1 : #> [1] "the" "quick" "brown" "fox" #> [5] "jumped" "over" "the" "lazy" #> [9] "dog" "the_quick" "quick_brown" "brown_fox" #> [13] "fox_jumped" "jumped_over" "over_the" "the_lazy" #> [17] "lazy_dog" "the_quick_brown" "quick_brown_fox" "brown_fox_jumped" #> [21] "fox_jumped_over" "jumped_over_the" "over_the_lazy" "the_lazy_dog" #>tokens_ngrams(toks, n = c(2,4), concatenator = " ")#> tokens from 1 document. #> text1 : #> [1] "the quick" "quick brown" "brown fox" #> [4] "fox jumped" "jumped over" "over the" #> [7] "the lazy" "lazy dog" "the quick brown fox" #> [10] "quick brown fox jumped" "brown fox jumped over" "fox jumped over the" #> [13] "jumped over the lazy" "over the lazy dog" #>tokens_ngrams(toks, n = c(2,4), skip = 1, concatenator = " ")#> tokens from 1 document. #> text1 : #> [1] "the brown" "quick fox" "brown jumped" #> [4] "fox over" "jumped the" "over lazy" #> [7] "the dog" "the brown jumped the" "quick fox over lazy" #> [10] "brown jumped the dog" #># on character char_ngrams(letters[1:3], n = 1:3)#> [1] "a" "b" "c" "a_b" "b_c" "a_b_c"# skipgrams toks <- tokens("insurgents killed in ongoing fighting") tokens_skipgrams(toks, n = 2, skip = 0:1, concatenator = " ")#> tokens from 1 document. #> text1 : #> [1] "insurgents killed" "insurgents in" "killed in" #> [4] "killed ongoing" "in ongoing" "in fighting" #> [7] "ongoing fighting" #>tokens_skipgrams(toks, n = 2, skip = 0:2, concatenator = " ")#> tokens from 1 document. #> text1 : #> [1] "insurgents killed" "insurgents in" "insurgents ongoing" #> [4] "killed in" "killed ongoing" "killed fighting" #> [7] "in ongoing" "in fighting" "ongoing fighting" #>tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " ")#> tokens from 1 document. #> text1 : #> [1] "insurgents killed in" "insurgents killed ongoing" #> [3] "insurgents killed fighting" "insurgents in ongoing" #> [5] "insurgents in fighting" "insurgents ongoing fighting" #> [7] "killed in ongoing" "killed in fighting" #> [9] "killed ongoing fighting" "in ongoing fighting" #>