dfm_lookup.Rd
Apply a dictionary to a dfm by looking up all dfm features for matches in a a
set of dictionary values, and replace those features with a count of
the dictionary's keys. If exclusive = FALSE
then the behaviour is to
apply a "thesaurus", where each value match is replaced by the dictionary
key, converted to capitals if capkeys = TRUE
(so that the replacements
are easily distinguished from features that were terms found originally in
the document).
dfm_lookup(x, dictionary, levels = 1:5, exclusive = TRUE, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, capkeys = !exclusive, nomatch = NULL, verbose = quanteda_options("verbose"))
x | the dfm to which the dictionary will be applied |
---|---|
dictionary | a dictionary class object |
levels | levels of entries in a hierarchical dictionary that will be applied |
exclusive | if |
valuetype | the type of pattern matching: |
case_insensitive | ignore the case of dictionary values if |
capkeys | if |
nomatch | an optional character naming a new feature that will contain
the counts of features of |
verbose | print status messages if |
If using dfm_lookup
with dictionaries containing multi-word
values, matches will only occur if the features themselves are multi-word
or formed from ngrams. A better way to match dictionary values that include
multi-word patterns is to apply tokens_lookup
to the tokens,
and then construct the dfm.
dfm_replace
dict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"), opposition = c("Opposition", "reject", "notincorpus"), taxglob = "tax*", taxregex = "tax.+$", country = c("United_States", "Sweden"))) dfmat <- dfm(c("My Christmas was ruined by your opposition tax plan.", "Does the United_States or Sweden have more progressive taxation?"), remove = stopwords("english"), verbose = FALSE) dfmat#> Document-feature matrix of: 2 documents, 11 features (50.0% sparse). #> 2 x 11 sparse Matrix of class "dfm" #> features #> docs christmas ruined opposition tax plan . united_states sweden progressive #> text1 1 1 1 1 1 1 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 #> features #> docs taxation ? #> text1 0 0 #> text2 1 1# glob format dfm_lookup(dfmat, dict, valuetype = "glob")#> Document-feature matrix of: 2 documents, 5 features (50.0% sparse). #> 2 x 5 sparse Matrix of class "dfm" #> features #> docs christmas opposition taxglob taxregex country #> text1 1 1 1 0 0 #> text2 0 0 1 0 2dfm_lookup(dfmat, dict, valuetype = "glob", case_insensitive = FALSE)#> Document-feature matrix of: 2 documents, 5 features (50.0% sparse). #> 2 x 5 sparse Matrix of class "dfm" #> features #> docs christmas opposition taxglob taxregex country #> text1 1 1 1 0 0 #> text2 0 0 1 0 2# regex v. glob format: note that "united_states" is a regex match for "tax*" dfm_lookup(dfmat, dict, valuetype = "glob")#> Document-feature matrix of: 2 documents, 5 features (50.0% sparse). #> 2 x 5 sparse Matrix of class "dfm" #> features #> docs christmas opposition taxglob taxregex country #> text1 1 1 1 0 0 #> text2 0 0 1 0 2dfm_lookup(dfmat, dict, valuetype = "regex", case_insensitive = TRUE)#> Document-feature matrix of: 2 documents, 5 features (40.0% sparse). #> 2 x 5 sparse Matrix of class "dfm" #> features #> docs christmas opposition taxglob taxregex country #> text1 1 1 1 0 0 #> text2 0 0 2 1 2# fixed format: no pattern matching dfm_lookup(dfmat, dict, valuetype = "fixed")#> Document-feature matrix of: 2 documents, 5 features (70.0% sparse). #> 2 x 5 sparse Matrix of class "dfm" #> features #> docs christmas opposition taxglob taxregex country #> text1 1 1 0 0 0 #> text2 0 0 0 0 2dfm_lookup(dfmat, dict, valuetype = "fixed", case_insensitive = FALSE)#> Document-feature matrix of: 2 documents, 5 features (70.0% sparse). #> 2 x 5 sparse Matrix of class "dfm" #> features #> docs christmas opposition taxglob taxregex country #> text1 1 1 0 0 0 #> text2 0 0 0 0 2# show unmatched tokens dfm_lookup(dfmat, dict, nomatch = "_UNMATCHED")#> Document-feature matrix of: 2 documents, 6 features (41.7% sparse). #> 2 x 6 sparse Matrix of class "dfm" #> features #> docs christmas opposition taxglob taxregex country _UNMATCHED #> text1 1 1 1 0 0 3 #> text2 0 0 1 0 2 2