segment tokens object by patterns

tokens_segment(x, what = c("sentences", "other"), delimiter = NULL,
  valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE,
  verbose = quanteda_options("verbose"))

Arguments

x

tokens object whose token elements will be segmented

what

unit of segmentation. Current options are "sentences" (default) and "other".

Segmenting on "other" allows segmentation of a text on any user-defined value, and must be accompanied by the delimiter argument.

delimiter

the same as pattern

valuetype

how to interpret keyword expressions: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

ignore case when matching, if TRUE

verbose

if TRUE print messages about how many tokens were selected or removed

Value

@return tokens_segment returns a tokens of segmented texts

Examples

txts <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor." toks <- tokens(txts) # split into sentences toks_sent <- tokens_segment(toks, what = "sentences") # split by any punctuation toks_punc <- tokens_segment(toks, what = "other", delimiter = "[\\p{P}]", valuetype = 'regex')