These function select or discard tokens from a tokens objects. For
convenience, the functions tokens_remove
and tokens_keep
are defined as shortcuts for
tokens_select(x, pattern, selection = "remove")
and tokens_select(x, pattern, selection = "keep")
,
respectively. The most common
usage for tokens_remove
will be to eliminate stop words from a text or
text-based object, while the most common use of tokens_select
will be
to select tokens with only positive pattern matches from a list of regular
expressions, including a dictionary.
tokens_select(x, pattern, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, padding = FALSE, window = 0, min_nchar = 1L, max_nchar = 79L, verbose = quanteda_options("verbose")) tokens_remove(x, ...) tokens_keep(x, ...)
x | tokens object whose token elements will be removed or kept |
---|---|
pattern | a character vector, list of character vectors, dictionary, collocations, or dfm. See pattern for details. |
selection | whether to |
valuetype | the type of pattern matching: |
case_insensitive | ignore case when matching, if |
padding | if |
window | integer of length 1 or 2; the size of the window of tokens
adjacent to Terms from overlapping windows are never double-counted, but simply
returned in the pattern match. This is because |
min_nchar, max_nchar | numerics specifying the minimum and maximum length
in characters for tokens to be removed or kept; defaults are 1 and
79.
(Set |
verbose | if |
... | additional arguments passed by |
a tokens object with tokens selected or removed based on their
match to pattern
## tokens_select with simple examples toks <- tokens(c("This is a sentence.", "This is a second sentence."), remove_punct = TRUE) tokens_select(toks, c("is", "a", "this"), selection = "keep", padding = FALSE)#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "a" #> #> text2 : #> [1] "This" "is" "a" #>tokens_select(toks, c("is", "a", "this"), selection = "keep", padding = TRUE)#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "a" "" #> #> text2 : #> [1] "This" "is" "a" "" "" #>tokens_select(toks, c("is", "a", "this"), selection = "remove", padding = FALSE)#> tokens from 2 documents. #> text1 : #> [1] "sentence" #> #> text2 : #> [1] "second" "sentence" #>tokens_select(toks, c("is", "a", "this"), selection = "remove", padding = TRUE)#> tokens from 2 documents. #> text1 : #> [1] "" "" "" "sentence" #> #> text2 : #> [1] "" "" "" "second" "sentence" #># how case_insensitive works tokens_select(toks, c("is", "a", "this"), selection = "remove", case_insensitive = TRUE)#> tokens from 2 documents. #> text1 : #> [1] "sentence" #> #> text2 : #> [1] "second" "sentence" #>tokens_select(toks, c("is", "a", "this"), selection = "remove", case_insensitive = FALSE)#> tokens from 2 documents. #> text1 : #> [1] "This" "sentence" #> #> text2 : #> [1] "This" "second" "sentence" #># use window tokens_select(toks, "second", selection = "keep", window = 1)#> tokens from 2 documents. #> text1 : #> character(0) #> #> text2 : #> [1] "a" "second" "sentence" #>tokens_select(toks, "second", selection = "remove", window = 1)#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "a" "sentence" #> #> text2 : #> [1] "This" "is" #>tokens_remove(toks, "is", window = c(0, 1))#> tokens from 2 documents. #> text1 : #> [1] "This" "sentence" #> #> text2 : #> [1] "This" "second" "sentence" #># tokens_remove example: remove stopwords txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.", wash2 <- "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.") tokens_remove(tokens(txt, remove_punct = TRUE), stopwords("english"))#> tokens from 2 documents. #> text1 : #> [1] "Fellow" "citizens" "called" "upon" "voice" #> [6] "country" "execute" "functions" "Chief" "Magistrate" #> #> text2 : #> [1] "occasion" "proper" "shall" "arrive" #> [5] "shall" "endeavor" "express" "high" #> [9] "sense" "entertain" "distinguished" "honor" #>#> tokens from 2 documents. #> text1 : #> [1] "am" "by" "of" "my" "to" "of" #> #> text2 : #> [1] "it" "to" "of" #>