This function selects or discards tokens from a tokens objects, with
the shortcut tokens_remove(x, pattern)
defined as a shortcut for
tokens_select(x, pattern, selection = "remove")
. The most common
usage for tokens_remove
will be to eliminate stop words from a text or
text-based object, while the most common use of tokens_select
will be
to select tokens with only positive pattern matches from a list of regular
expressions, including a dictionary.
tokens_select(x, pattern, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, padding = FALSE, verbose = quanteda_options("verbose")) tokens_remove(x, pattern, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, padding = FALSE, verbose = quanteda_options("verbose"))
x | tokens object whose token elements will be selected |
---|---|
pattern | a character vector, list of character vectors, dictionary, collocations, or dfm. See pattern for details. |
selection | whether to |
valuetype | the type of pattern matching: |
case_insensitive | ignore case when matching, if |
padding | if |
verbose | if |
a tokens object with tokens selected or removed based on their
match to pattern
## tokens_select with simple examples toks <- tokens(c("This is a sentence.", "This is a second sentence."), remove_punct = TRUE) tokens_select(toks, c("is", "a", "this"), selection = "keep", padding = FALSE)#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "a" #> #> text2 : #> [1] "This" "is" "a" #>tokens_select(toks, c("is", "a", "this"), selection = "keep", padding = TRUE)#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "a" "" #> #> text2 : #> [1] "This" "is" "a" "" "" #>tokens_select(toks, c("is", "a", "this"), selection = "remove", padding = FALSE)#> tokens from 2 documents. #> text1 : #> [1] "sentence" #> #> text2 : #> [1] "second" "sentence" #>tokens_select(toks, c("is", "a", "this"), selection = "remove", padding = TRUE)#> tokens from 2 documents. #> text1 : #> [1] "" "" "" "sentence" #> #> text2 : #> [1] "" "" "" "second" "sentence" #># how case_insensitive works tokens_select(toks, c("is", "a", "this"), selection = "remove", case_insensitive = TRUE)#> tokens from 2 documents. #> text1 : #> [1] "sentence" #> #> text2 : #> [1] "second" "sentence" #>tokens_select(toks, c("is", "a", "this"), selection = "remove", case_insensitive = FALSE)#> tokens from 2 documents. #> text1 : #> [1] "This" "sentence" #> #> text2 : #> [1] "This" "second" "sentence" #>## tokens_remove example txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.", wash2 <- "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.") tokens_remove(tokens(txt, remove_punct = TRUE), stopwords("english"))#> tokens from 2 documents. #> text1 : #> [1] "Fellow" "citizens" "called" "upon" "voice" #> [6] "country" "execute" "functions" "Chief" "Magistrate" #> #> text2 : #> [1] "occasion" "proper" "shall" "arrive" #> [5] "shall" "endeavor" "express" "high" #> [9] "sense" "entertain" "distinguished" "honor" #>