This function selects or discards features from a dfm or fcm,
based on feature name matches with pattern
. The most common usages
are to eliminate features from a dfm already constructed, such as stopwords,
or to select only terms of interest from a dictionary.
dfm_select(x, pattern, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, min_nchar = 1L, max_nchar = 63L, verbose = quanteda_options("verbose"), ...) dfm_remove(x, pattern, ...) fcm_select(x, pattern = NULL, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, verbose = TRUE, ...) fcm_remove(x, pattern, ...)
x | |
---|---|
pattern | a character vector, list of character vectors, dictionary, collocations, or dfm. See pattern for details. |
selection | whether to |
valuetype | the type of pattern matching: |
case_insensitive | ignore the case of dictionary values if |
min_nchar, max_nchar | numerics specifying the minimum and maximum length
in characters for features to be removed or kept; defaults are 1 and
79.
(Set |
verbose | if |
... | used only for passing arguments from |
A dfm or fcm object, after the feature selection has
been applied.
When pattern
is a dfm object, then the returned object will
be identical in its feature set to the dfm supplied as the pattern
argument. This means that any features in x
not in the dfm provided
as pattern
will be discarded, and that any features in found in the
dfm supplied as pattern
but not found in x
will be added with
all zero counts. Because selecting on a dfm is designed to produce a
selected dfm with an exact feature match, when pattern
is a
dfm object, then the following settings are always used:
case_insensitive = FALSE
, and valuetype = "fixed"
.
Selecting on a dfm is useful when you have trained a model on one
dfm, and need to project this onto a test set whose features must be
identical. It is also used in bootstrap_dfm
. See examples.
dfm_remove
and fcm_remove
are simply a convenience
wrappers to calling dfm_select
and fcm_select
with
selection = "remove"
.
This function selects features based on their labels. To select
features based on the values of the document-feature matrix, use
dfm_trim
.
myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.", "Does the United_States or Sweden have more progressive taxation?"), tolower = FALSE, verbose = FALSE) mydict <- dictionary(list(countries = c("United_States", "Sweden", "France"), wordsEndingInY = c("by", "my"), notintext = "blahblah")) dfm_select(myDfm, mydict)#> Document-feature matrix of: 2 documents, 4 features (50% sparse). #> 2 x 4 sparse Matrix of class "dfmSparse" #> features #> docs My by United_States Sweden #> text1 1 1 0 0 #> text2 0 0 1 1dfm_select(myDfm, mydict, case_insensitive = FALSE)#> Document-feature matrix of: 2 documents, 1 feature (50% sparse). #> 2 x 1 sparse Matrix of class "dfmSparse" #> features #> docs by #> text1 1 #> text2 0dfm_select(myDfm, c("s$", ".y"), selection = "keep", valuetype = "regex")#> Document-feature matrix of: 2 documents, 6 features (50% sparse). #> 2 x 6 sparse Matrix of class "dfmSparse" #> features #> docs My Christmas was by Does United_States #> text1 1 1 1 1 0 0 #> text2 0 0 0 0 1 1dfm_select(myDfm, c("s$", ".y"), selection = "remove", valuetype = "regex")#> Document-feature matrix of: 2 documents, 14 features (50% sparse). #> 2 x 14 sparse Matrix of class "dfmSparse" #> features #> docs ruined your opposition tax plan . the or Sweden have more progressive #> text1 1 1 1 1 1 1 0 0 0 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 1 1 1 #> features #> docs taxation ? #> text1 0 0 #> text2 1 1#> Document-feature matrix of: 2 documents, 9 features (50% sparse). #> 2 x 9 sparse Matrix of class "dfmSparse" #> features #> docs My was by your Does the or have more #> text1 1 1 1 1 0 0 0 0 0 #> text2 0 0 0 0 1 1 1 1 1#> Document-feature matrix of: 2 documents, 11 features (50% sparse). #> 2 x 11 sparse Matrix of class "dfmSparse" #> features #> docs Christmas ruined opposition tax plan . United_States Sweden progressive #> text1 1 1 1 1 1 1 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 #> features #> docs taxation ? #> text1 0 0 #> text2 1 1# select based on character length dfm_select(myDfm, min_nchar = 5)#> Document-feature matrix of: 2 documents, 7 features (50% sparse). #> 2 x 7 sparse Matrix of class "dfmSparse" #> features #> docs Christmas ruined opposition United_States Sweden progressive taxation #> text1 1 1 1 0 0 0 0 #> text2 0 0 0 1 1 1 1# selecting on a dfm txts <- c("This is text one", "The second text", "This is text three") (dfm1 <- dfm(txts[1:2]))#> Document-feature matrix of: 2 documents, 6 features (41.7% sparse). #> 2 x 6 sparse Matrix of class "dfmSparse" #> features #> docs this is text one the second #> text1 1 1 1 1 0 0 #> text2 0 0 1 0 1 1#> Document-feature matrix of: 2 documents, 6 features (41.7% sparse). #> 2 x 6 sparse Matrix of class "dfmSparse" #> features #> docs the second text this is three #> text1 1 1 1 0 0 0 #> text2 0 0 1 1 1 1(dfm3 <- dfm_select(dfm1, dfm2, valuetype = "fixed", verbose = TRUE))#>#>#>#>#> Document-feature matrix of: 2 documents, 6 features (50% sparse). #> 2 x 6 sparse Matrix of class "dfmSparse" #> features #> docs the second text this is three #> text1 0 0 1 1 1 0 #> text2 1 1 1 0 0 0#> [1] TRUEtmpdfm <- dfm(c("This is a document with lots of stopwords.", "No if, and, or but about it: lots of stopwords."), verbose = FALSE) tmpdfm#> Document-feature matrix of: 2 documents, 18 features (38.9% sparse). #> 2 x 18 sparse Matrix of class "dfmSparse" #> features #> docs this is a document with lots of stopwords . no if , and or but about it #> text1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 #> text2 0 0 0 0 0 1 1 1 1 1 1 2 1 1 1 1 1 #> features #> docs : #> text1 0 #> text2 1#> Document-feature matrix of: 2 documents, 6 features (25% sparse). #> 2 x 6 sparse Matrix of class "dfmSparse" #> features #> docs document lots stopwords . , : #> text1 1 1 1 1 0 0 #> text2 0 1 1 1 2 1toks <- tokens(c("this contains lots of stopwords", "no if, and, or but about it: lots"), remove_punct = TRUE) tmpfcm <- fcm(toks)#> Error in get(".SigLength", envir = env): object '.SigLength' not foundtmpfcm#> Error in eval(expr, envir, enclos): object 'tmpfcm' not found#> Error in fcm_remove(tmpfcm, stopwords("english")): object 'tmpfcm' not found