segment tokens object by patterns
tokens_segment(x, what = c("sentences", "other"), delimiter = NULL, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, verbose = quanteda_options("verbose"))
x | tokens object whose token elements will be segmented |
---|---|
what | unit of segmentation. Current options are
Segmenting on |
delimiter | the same as |
valuetype | how to interpret keyword expressions: |
case_insensitive | ignore case when matching, if |
verbose | if |
@return tokens_segment
returns a tokens of segmented texts
txts <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor." toks <- tokens(txts) # split into sentences toks_sent <- tokens_segment(toks, what = "sentences") # split by any punctuation toks_punc <- tokens_segment(toks, what = "other", delimiter = "[\\p{P}]", valuetype = 'regex')