Replace multi-word phrases in text(s) with a compound version of the phrases
concatenated with concatenator
(by default, the "_
" character) to
form a single token. This prevents tokenization of the phrases during
subsequent processing by eliminating the whitespace delimiter.
phrasetotoken(object, phrases, ...) # S4 method for corpus,ANY phrasetotoken(object, phrases, ...) # S4 method for textORtokens,dictionary phrasetotoken(object, phrases, ...) # S4 method for textORtokens,collocations phrasetotoken(object, phrases, ...) # S4 method for character,character phrasetotoken(object, phrases, concatenator = "_", valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, ...) # S4 method for tokenizedTexts,character phrasetotoken(object, phrases, concatenator = "_", valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, ...)
dictionary
object that
contains some phrases, defined as multiple words delimited by whitespace,
up to 9 words long; or a quanteda collocation object created
by collocations
"character,character"
method_
is highly
recommended since it will not be removed during normal cleaning and
tokenization (while nearly all other punctuation characters, at least those
in the Unicode punctuation class [P] will be removed."glob"
for
"glob"-style wildcard expressions; "regex"
for regular expressions;
or "fixed"
for exact matching. See valuetype for details.TRUE
, ignore case when matchingcharacter or character vector of texts with phrases replaced by compound "words" joined by the concatenator
## Not run: ------------------------------------ # mytexts <- c("The new law included a capital gains tax, and an inheritance tax.", # "New York City has raised a taxes: an income tax and a sales tax.") # mydict <- dictionary(list(tax=c("tax", "income tax", "capital gains tax", "inheritance tax"))) # (cw <- phrasetotoken(mytexts, mydict)) # dfm(cw, verbose=FALSE) # # # when used as a dictionary for dfm creation # mydfm2 <- dfm(cw, dictionary = dictionary(lapply(mydict, function(x) gsub(" ", "_", x)))) # mydfm2 # # # to pick up "taxes" in the second text, set valuetype = "regex" # mydfm3 <- dfm(cw, dictionary = dictionary(lapply(mydict, phrasetotoken, mydict)), # valuetype = "regex") # mydfm3 # ## one more token counted for "tax" than before ## --------------------------------------------- # using a dictionary to pre-process multi-word expressions myDict <- dictionary(list(negative = c("bad* word*", "negative", "awful text"), postiive = c("good stuff", "like? th??"))) txt <- c("I liked this, when we can use bad words, in awful text.", "Some damn good stuff, like the text, she likes that too.") phrasetotoken(txt, myDict)#> [1] "I liked this, when we can use bad words, in awful_text." #> [2] "Some damn good_stuff, like the text, she likes that too."# on simple text phrasetotoken("This is a simpler version of multi word expressions.", "multi word expression*")#> [1] "This is a simpler version of multi word expressions."# on simple text toks <- tokenize("Simon sez the multi word expression plural is multi word expressions, Simon sez.") phrases <- c("multi word expression*", "Simon sez") phrasetotoken(toks, phrases)#> tokenizedTexts from 1 document. #> Component 1 : #> [1] "Simon_sez" "the" "multi_word_expression" #> [4] "plural" "is" "multi_word_expressions" #> [7] "," "Simon_sez" "." #>