This function selects or removes features from a dfm or fcm,
based on feature name matches with pattern
. The most common usages
are to eliminate features from a dfm already constructed, such as stopwords,
or to select only terms of interest from a dictionary.
dfm_select(x, pattern = NULL, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, min_nchar = 1L, max_nchar = 79L, verbose = quanteda_options("verbose")) dfm_remove(x, ...) dfm_keep(x, ...) fcm_select(x, pattern = NULL, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, verbose = quanteda_options("verbose"), ...) fcm_remove(x, pattern = NULL, ...) fcm_keep(x, pattern = NULL, ...)
x | |
---|---|
pattern | a character vector, list of character vectors, dictionary, collocations, or dfm. See pattern for details. |
selection | whether to |
valuetype | the type of pattern matching: For |
case_insensitive | ignore the case of dictionary values if |
min_nchar, max_nchar | numerics specifying the minimum and maximum length
in characters for features to be removed or kept; defaults are 1 and
79.
(Set |
verbose | if |
... | used only for passing arguments from |
A dfm or fcm object, after the feature selection has been applied.
When pattern
is a dfm object, then the returned object will
be identical in its feature set to the dfm supplied as the pattern
argument. This means that any features in x
not in the dfm provided
as pattern
will be discarded, and that any features in found in the
dfm supplied as pattern
but not found in x
will be added with
all zero counts. Because selecting on a dfm is designed to produce a
selected dfm with an exact feature match, when pattern
is a
dfm object, then the following settings are always used:
case_insensitive = FALSE
, and valuetype = "fixed"
.
Selecting on a dfm is useful when you have trained a model on one
dfm, and need to project this onto a test set whose features must be
identical. It is also used in bootstrap_dfm
. See examples.
dfm_remove
and fcm_remove
are simply a convenience
wrappers to calling dfm_select
and fcm_select
with
selection = "remove"
.
dfm_keep
and fcm_keep
are simply a convenience wrappers to
calling dfm_select
and fcm_select
with selection =
"keep"
.
This function selects features based on their labels. To select
features based on the values of the document-feature matrix, use
dfm_trim
.
myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.", "Does the United_States or Sweden have more progressive taxation?"), tolower = FALSE, verbose = FALSE) mydict <- dictionary(list(countries = c("United_States", "Sweden", "France"), wordsEndingInY = c("by", "my"), notintext = "blahblah")) dfm_select(myDfm, mydict)#> Document-feature matrix of: 2 documents, 4 features (50% sparse). #> 2 x 4 sparse Matrix of class "dfm" #> features #> docs My by United_States Sweden #> text1 1 1 0 0 #> text2 0 0 1 1dfm_select(myDfm, mydict, case_insensitive = FALSE)#> Document-feature matrix of: 2 documents, 1 feature (50% sparse). #> 2 x 1 sparse Matrix of class "dfm" #> features #> docs by #> text1 1 #> text2 0dfm_select(myDfm, c("s$", ".y"), selection = "keep", valuetype = "regex")#> Document-feature matrix of: 2 documents, 6 features (50% sparse). #> 2 x 6 sparse Matrix of class "dfm" #> features #> docs My Christmas was by Does United_States #> text1 1 1 1 1 0 0 #> text2 0 0 0 0 1 1dfm_select(myDfm, c("s$", ".y"), selection = "remove", valuetype = "regex")#> Document-feature matrix of: 2 documents, 14 features (50% sparse). #> 2 x 14 sparse Matrix of class "dfm" #> features #> docs ruined your opposition tax plan . the or Sweden have more progressive #> text1 1 1 1 1 1 1 0 0 0 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 1 1 1 #> features #> docs taxation ? #> text1 0 0 #> text2 1 1#> Document-feature matrix of: 2 documents, 9 features (50% sparse). #> 2 x 9 sparse Matrix of class "dfm" #> features #> docs My was by your Does the or have more #> text1 1 1 1 1 0 0 0 0 0 #> text2 0 0 0 0 1 1 1 1 1#> Document-feature matrix of: 2 documents, 11 features (50% sparse). #> 2 x 11 sparse Matrix of class "dfm" #> features #> docs Christmas ruined opposition tax plan . United_States Sweden progressive #> text1 1 1 1 1 1 1 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 #> features #> docs taxation ? #> text1 0 0 #> text2 1 1# select based on character length dfm_select(myDfm, min_nchar = 5)#> Document-feature matrix of: 2 documents, 7 features (50% sparse). #> 2 x 7 sparse Matrix of class "dfm" #> features #> docs Christmas ruined opposition United_States Sweden progressive taxation #> text1 1 1 1 0 0 0 0 #> text2 0 0 0 1 1 1 1# selecting on a dfm txts <- c("This is text one", "The second text", "This is text three") (dfm1 <- dfm(txts[1:2]))#> Document-feature matrix of: 2 documents, 6 features (41.7% sparse). #> 2 x 6 sparse Matrix of class "dfm" #> features #> docs this is text one the second #> text1 1 1 1 1 0 0 #> text2 0 0 1 0 1 1#> Document-feature matrix of: 2 documents, 6 features (41.7% sparse). #> 2 x 6 sparse Matrix of class "dfm" #> features #> docs the second text this is three #> text1 1 1 1 0 0 0 #> text2 0 0 1 1 1 1(dfm3 <- dfm_select(dfm1, dfm2, valuetype = "fixed", verbose = TRUE))#>#>#> Document-feature matrix of: 2 documents, 6 features (50% sparse). #> 2 x 6 sparse Matrix of class "dfm" #> features #> docs the second text this is three #> text1 0 0 1 1 1 0 #> text2 1 1 1 0 0 0#> [1] TRUEtmpdfm <- dfm(c("This is a document with lots of stopwords.", "No if, and, or but about it: lots of stopwords."), verbose = FALSE) tmpdfm#> Document-feature matrix of: 2 documents, 18 features (38.9% sparse). #> 2 x 18 sparse Matrix of class "dfm" #> features #> docs this is a document with lots of stopwords . no if , and or but about it #> text1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 #> text2 0 0 0 0 0 1 1 1 1 1 1 2 1 1 1 1 1 #> features #> docs : #> text1 0 #> text2 1#> Document-feature matrix of: 2 documents, 6 features (25% sparse). #> 2 x 6 sparse Matrix of class "dfm" #> features #> docs document lots stopwords . , : #> text1 1 1 1 1 0 0 #> text2 0 1 1 1 2 1toks <- tokens(c("this contains lots of stopwords", "no if, and, or but about it: lots"), remove_punct = TRUE) tmpfcm <- fcm(toks) tmpfcm#> Feature co-occurrence matrix of: 12 by 12 features. #> 12 x 12 sparse Matrix of class "fcm" #> features #> features this contains lots of stopwords no if and or but about it #> this 0 1 1 1 1 0 0 0 0 0 0 0 #> contains 0 0 1 1 1 0 0 0 0 0 0 0 #> lots 0 0 0 1 1 1 1 1 1 1 1 1 #> of 0 0 0 0 1 0 0 0 0 0 0 0 #> stopwords 0 0 0 0 0 0 0 0 0 0 0 0 #> no 0 0 0 0 0 0 1 1 1 1 1 1 #> if 0 0 0 0 0 0 0 1 1 1 1 1 #> and 0 0 0 0 0 0 0 0 1 1 1 1 #> or 0 0 0 0 0 0 0 0 0 1 1 1 #> but 0 0 0 0 0 0 0 0 0 0 1 1 #> about 0 0 0 0 0 0 0 0 0 0 0 1 #> it 0 0 0 0 0 0 0 0 0 0 0 0#> Feature co-occurrence matrix of: 3 by 3 features. #> 3 x 3 sparse Matrix of class "fcm" #> features #> features contains lots stopwords #> contains 0 1 1 #> lots 0 0 1 #> stopwords 0 0 0