dfm_select.Rd
This function selects or removes features from a dfm or fcm,
based on feature name matches with pattern
. The most common usages
are to eliminate features from a dfm already constructed, such as stopwords,
or to select only terms of interest from a dictionary.
dfm_select(x, pattern = NULL, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, min_nchar = 1L, max_nchar = 79L, verbose = quanteda_options("verbose")) dfm_remove(x, ...) dfm_keep(x, ...) fcm_select(x, pattern = NULL, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, verbose = quanteda_options("verbose"), ...) fcm_remove(x, pattern = NULL, ...) fcm_keep(x, pattern = NULL, ...)
x | |
---|---|
pattern | a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
selection | whether to |
valuetype | the type of pattern matching: For |
case_insensitive | ignore the case of dictionary values if |
min_nchar, max_nchar | numerics specifying the minimum and maximum length
in characters for features to be removed or kept; defaults are 1 and
79.
(Set |
verbose | if |
... | used only for passing arguments from |
A dfm or fcm object, after the feature selection has been applied.
For compatibility with earlier versions, when pattern
is a
dfm object and selection = "keep"
, then this will be
equivalent to calling dfm_match
. In this case, the following
settings are always used: case_insensitive = FALSE
, and
valuetype = "fixed"
. This functionality is deprecated, however, and
you should use dfm_match
instead.
dfm_remove
and fcm_remove
are simply a convenience
wrappers to calling dfm_select
and fcm_select
with
selection = "remove"
.
dfm_keep
and fcm_keep
are simply a convenience wrappers to
calling dfm_select
and fcm_select
with selection =
"keep"
.
This function selects features based on their labels. To select
features based on the values of the document-feature matrix, use
dfm_trim
.
dfmat <- dfm(c("My Christmas was ruined by your opposition tax plan.", "Does the United_States or Sweden have more progressive taxation?"), tolower = FALSE, verbose = FALSE) dict <- dictionary(list(countries = c("United_States", "Sweden", "France"), wordsEndingInY = c("by", "my"), notintext = "blahblah")) dfm_select(dfmat, pattern = dict)#> Document-feature matrix of: 2 documents, 4 features (50.0% sparse). #> 2 x 4 sparse Matrix of class "dfm" #> features #> docs My by United_States Sweden #> text1 1 1 0 0 #> text2 0 0 1 1dfm_select(dfmat, pattern = dict, case_insensitive = FALSE)#> Document-feature matrix of: 2 documents, 1 feature (50.0% sparse). #> 2 x 1 sparse Matrix of class "dfm" #> features #> docs by #> text1 1 #> text2 0#> Document-feature matrix of: 2 documents, 6 features (50.0% sparse). #> 2 x 6 sparse Matrix of class "dfm" #> features #> docs My Christmas was by Does United_States #> text1 1 1 1 1 0 0 #> text2 0 0 0 0 1 1#> Document-feature matrix of: 2 documents, 14 features (50.0% sparse). #> 2 x 14 sparse Matrix of class "dfm" #> features #> docs ruined your opposition tax plan . the or Sweden have more progressive #> text1 1 1 1 1 1 1 0 0 0 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 1 1 1 #> features #> docs taxation ? #> text1 0 0 #> text2 1 1#> Document-feature matrix of: 2 documents, 9 features (50.0% sparse). #> 2 x 9 sparse Matrix of class "dfm" #> features #> docs My was by your Does the or have more #> text1 1 1 1 1 0 0 0 0 0 #> text2 0 0 0 0 1 1 1 1 1#> Document-feature matrix of: 2 documents, 11 features (50.0% sparse). #> 2 x 11 sparse Matrix of class "dfm" #> features #> docs Christmas ruined opposition tax plan . United_States Sweden progressive #> text1 1 1 1 1 1 1 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 #> features #> docs taxation ? #> text1 0 0 #> text2 1 1# select based on character length dfm_select(dfmat, min_nchar = 5)#> Document-feature matrix of: 2 documents, 7 features (50.0% sparse). #> 2 x 7 sparse Matrix of class "dfm" #> features #> docs Christmas ruined opposition United_States Sweden progressive taxation #> text1 1 1 1 0 0 0 0 #> text2 0 0 0 1 1 1 1dfmat <- dfm(c("This is a document with lots of stopwords.", "No if, and, or but about it: lots of stopwords."), verbose = FALSE) dfmat#> Document-feature matrix of: 2 documents, 18 features (38.9% sparse). #> 2 x 18 sparse Matrix of class "dfm" #> features #> docs this is a document with lots of stopwords . no if , and or but about it #> text1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 #> text2 0 0 0 0 0 1 1 1 1 1 1 2 1 1 1 1 1 #> features #> docs : #> text1 0 #> text2 1#> Document-feature matrix of: 2 documents, 6 features (25.0% sparse). #> 2 x 6 sparse Matrix of class "dfm" #> features #> docs document lots stopwords . , : #> text1 1 1 1 1 0 0 #> text2 0 1 1 1 2 1toks <- tokens(c("this contains lots of stopwords", "no if, and, or but about it: lots"), remove_punct = TRUE) fcmat <- fcm(toks) fcmat#> Feature co-occurrence matrix of: 12 by 12 features. #> 12 x 12 sparse Matrix of class "fcm" #> features #> features this contains lots of stopwords no if and or but about it #> this 0 1 1 1 1 0 0 0 0 0 0 0 #> contains 0 0 1 1 1 0 0 0 0 0 0 0 #> lots 0 0 0 1 1 1 1 1 1 1 1 1 #> of 0 0 0 0 1 0 0 0 0 0 0 0 #> stopwords 0 0 0 0 0 0 0 0 0 0 0 0 #> no 0 0 0 0 0 0 1 1 1 1 1 1 #> if 0 0 0 0 0 0 0 1 1 1 1 1 #> and 0 0 0 0 0 0 0 0 1 1 1 1 #> or 0 0 0 0 0 0 0 0 0 1 1 1 #> but 0 0 0 0 0 0 0 0 0 0 1 1 #> about 0 0 0 0 0 0 0 0 0 0 0 1 #> it 0 0 0 0 0 0 0 0 0 0 0 0#> Feature co-occurrence matrix of: 3 by 3 features. #> 3 x 3 sparse Matrix of class "fcm" #> features #> features contains lots stopwords #> contains 0 1 1 #> lots 0 0 1 #> stopwords 0 0 0