tokens_segment.RdSegment tokens by splitting
on a pattern match. This is useful for breaking the tokenized texts into smaller
document units, based on a regular pattern or a user-supplied annotation.
While it normally makes more sense to do this at the corpus level (see corpus_segment),
tokens_segment provides the option to perform this operation on tokens.
tokens_segment(x, pattern, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, extract_pattern = FALSE, pattern_position = c("before", "after"), use_docvars = TRUE)
| x | tokens object whose token elements will be segmented |
|---|---|
| pattern | a character vector, list of character vectors, dictionary, collocations, or dfm. See pattern for details. |
| valuetype | the type of pattern matching: |
| case_insensitive | ignore case when matching, if |
| extract_pattern | remove matched patterns from the texts and save in
docvars, if |
| pattern_position | either |
| use_docvars | if |
tokens_segment returns a tokens object whose documents
have been split by patterns
txts <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor." toks <- tokens(txts) # split by any punctuation toks_punc <- tokens_segment(toks, c(".", "?", "!"), valuetype = "fixed", pattern_position = "after") toks_punc <- tokens_segment(toks, "^\\p{Sterm}$", valuetype = "regex", extract_pattern = FALSE, pattern_position = "after")