tokens_compound.Rd
Replace multi-token sequences with a multi-word, or "compound" token. The
resulting compound tokens will represent a phrase or multi-word expression,
concatenated with concatenator
(by default, the "_
" character)
to form a single "token". This ensures that the sequences will be processed
subsequently as single tokens, for instance in constructing a dfm.
tokens_compound(x, pattern, concatenator = "_", valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, join = TRUE)
x | an input tokens object |
---|---|
pattern | a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
concatenator | the concatenation character that will connect the words
making up the multi-word sequences. The default |
valuetype | the type of pattern matching: |
case_insensitive | logical; if |
join | logical; if |
a tokens object in which the token sequences matching pattern
have been replaced by compound "tokens" joined by the concatenator
txt <- c("The new law included a capital gains tax, and an inheritance tax.", "New York City has raised taxes: an income tax and inheritance taxes.") toks1 <- tokens(txt, remove_punct = TRUE) # for lists of sequence elements myseqs <- list(c("tax"), c("income", "tax"), c("capital", "gains", "tax"), c("inheritance", "tax")) (toks2 <- tokens_compound(toks1, pattern = myseqs))#> tokens from 2 documents. #> text1 : #> [1] "The" "new" "law" #> [4] "included" "a" "capital_gains_tax" #> [7] "and" "an" "inheritance_tax" #> #> text2 : #> [1] "New" "York" "City" "has" "raised" #> [6] "taxes" "an" "income_tax" "and" "inheritance" #> [11] "taxes" #>dfm(toks2)#> Document-feature matrix of: 2 documents, 16 features (40.6% sparse). #> 2 x 16 sparse Matrix of class "dfm" #> features #> docs the new law included a and an inheritance york city has raised taxes #> text1 1 1 1 1 1 1 1 0 0 0 0 0 0 #> text2 0 1 0 0 0 1 1 1 1 1 1 1 2 #> features #> docs income_tax capital_gains_tax inheritance_tax #> text1 0 1 1 #> text2 1 0 0# when used as a dictionary for dfm creation dict1 <- dictionary(list(tax=c("tax", "income tax", "capital gains tax", "inheritance tax*"))) (toks3 <- tokens_compound(toks1, pattern = dict1))#> tokens from 2 documents. #> text1 : #> [1] "The" "new" "law" #> [4] "included" "a" "capital_gains_tax" #> [7] "and" "an" "inheritance_tax" #> #> text2 : #> [1] "New" "York" "City" #> [4] "has" "raised" "taxes" #> [7] "an" "income_tax" "and" #> [10] "inheritance_taxes" #># to pick up "taxes" in the second text, set valuetype = "regex" (toks4 <- tokens_compound(toks1, pattern = dict1, valuetype = "regex"))#> tokens from 2 documents. #> text1 : #> [1] "The" "new" "law" #> [4] "included" "a" "capital_gains_tax" #> [7] "and" "an" "inheritance_tax" #> #> text2 : #> [1] "New" "York" "City" #> [4] "has" "raised" "taxes" #> [7] "an" "income_tax" "and" #> [10] "inheritance_taxes" #># dictionaries w/glob matches dict2 <- dictionary(list(negative = c("bad* word*", "negative", "awful text"), positive = c("good stuff", "like? th??"))) toks5 <- tokens(c(txt1 = "I liked this, when we can use bad words, in awful text.", txt2 = "Some damn good stuff, like the text, she likes that too.")) tokens_compound(toks5, pattern = dict2)#> tokens from 2 documents. #> txt1 : #> [1] "I" "liked_this" "," "when" "we" #> [6] "can" "use" "bad_words" "," "in" #> [11] "awful_text" "." #> #> txt2 : #> [1] "Some" "damn" "good_stuff" "," "like" #> [6] "the" "text" "," "she" "likes_that" #> [11] "too" "." #># with collocations tstat <- textstat_collocations(tokens("capital gains taxes are worse than inheritance taxes"), size = 2, min_count = 1) toks6 <- tokens("The new law included capital gains taxes and inheritance taxes.") tokens_compound(toks6, pattern = tstat)#> tokens from 1 document. #> text1 : #> [1] "The" "new" "law" #> [4] "included" "capital_gains_taxes" "and" #> [7] "inheritance_taxes" "." #>