Apply a dictionary to a dfm by looking up all dfm features for matches in a a
set of dictionary values, and replace those features with a count of
the dictionary's keys. If exclusive = FALSE
then the behaviour is to
apply a "thesaurus", where each value match is replaced by the dictionary
key, converted to capitals if capkeys = TRUE
(so that the replacements
are easily distinguished from features that were terms found originally in
the document).
dfm_lookup(x, dictionary, levels = 1:5, exclusive = TRUE, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, capkeys = !exclusive, nomatch = NULL, verbose = quanteda_options("verbose"))
x | the dfm to which the dictionary will be applied |
---|---|
dictionary | a dictionary class object |
levels | levels of entries in a hierachical dictionary that will be applied |
exclusive | if |
valuetype | how to interpret keyword expressions: |
case_insensitive | ignore the case of dictionary values if |
capkeys | if |
nomatch | an optional character naming a new feature that will contain
the counts of features of |
verbose | print status messages if |
If using dfm_lookup
with dictionaries containing multi-word
values, matches will only occur if the features themselves are multi-word
or formed from ngrams. A better way to match dictionary values that include
multi-word patterns is to apply tokens_lookup
to the tokens,
and then construct the dfm.
myDict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"), opposition = c("Opposition", "reject", "notincorpus"), taxglob = "tax*", taxregex = "tax.+$", country = c("United_States", "Sweden"))) myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.", "Does the United_States or Sweden have more progressive taxation?"), remove = stopwords("english"), verbose = FALSE) myDfm#> Document-feature matrix of: 2 documents, 11 features (50% sparse). #> 2 x 11 sparse Matrix of class "dfmSparse" #> features #> docs christmas ruined opposition tax plan . united_states sweden progressive #> text1 1 1 1 1 1 1 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 #> features #> docs taxation ? #> text1 0 0 #> text2 1 1# glob format dfm_lookup(myDfm, myDict, valuetype = "glob")#> Error in get(".SigLength", envir = env): object '.SigLength' not founddfm_lookup(myDfm, myDict, valuetype = "glob", case_insensitive = FALSE)#> Error in get(".SigLength", envir = env): object '.SigLength' not found# regex v. glob format: note that "united_states" is a regex match for "tax*" dfm_lookup(myDfm, myDict, valuetype = "glob")#> Error in get(".SigLength", envir = env): object '.SigLength' not founddfm_lookup(myDfm, myDict, valuetype = "regex", case_insensitive = TRUE)#> Error in get(".SigLength", envir = env): object '.SigLength' not found# fixed format: no pattern matching dfm_lookup(myDfm, myDict, valuetype = "fixed")#> Error in get(".SigLength", envir = env): object '.SigLength' not founddfm_lookup(myDfm, myDict, valuetype = "fixed", case_insensitive = FALSE)#> Error in get(".SigLength", envir = env): object '.SigLength' not found# show unmatched tokens dfm_lookup(myDfm, myDict, nomatch = "_UNMATCHED")#> Error in get(".SigLength", envir = env): object '.SigLength' not found