Returns a document by feature matrix reduced in size based on document and term frequency, usually in terms of a minimum frequencies, but may also be in terms of maximum frequencies. Setting a combination of minimum and maximum frequencies will select features based on a range.
dfm_trim(x, min_count = 1, min_docfreq = 1, max_count = NULL, max_docfreq = NULL, sparsity = NULL, verbose = quanteda_options("verbose"))
x | a dfm object |
---|---|
min_count, max_count | minimum/maximum count or fraction of features across all documents, below/above which features will be removed |
min_docfreq, max_docfreq | minimum/maximum number or fraction of documents in which a feature appears, below/above which features will be removed |
sparsity | equivalent to 1 - min_docfreq, included for comparison with tm |
verbose | print messages |
A dfm reduced in features (with the same number of documents)
Trimming a dfm object is an operation based on the values
in the document-feature matrix. To select subsets of a dfm based on
the features themselves (meaning the feature labels from featnames
) -- such as those
matching a regular expression, or removing features matching a stopword
list, use dfm_select
.
#> Document-feature matrix of: 5 documents, 1,948 features (69.5% sparse).# keep only words occuring >=10 times and in >=2 docs dfm_trim(myDfm, min_count = 10, min_docfreq = 2)#> Error in get(".SigLength", envir = env): object '.SigLength' not found# keep only words occuring >=10 times and in at least 0.4 of the documents dfm_trim(myDfm, min_count = 10, min_docfreq = 0.4)#> Error in get(".SigLength", envir = env): object '.SigLength' not found# keep only words occuring <=10 times and in <=2 docs dfm_trim(myDfm, max_count = 10, max_docfreq = 2)#> Error in get(".SigLength", envir = env): object '.SigLength' not found# keep only words occuring <=10 times and in at most 3/4 of the documents dfm_trim(myDfm, max_count = 10, max_docfreq = 0.75)#> Error in get(".SigLength", envir = env): object '.SigLength' not found# keep only words occuring at least 0.01 times and in >=2 documents dfm_trim(myDfm, min_count = .01, min_docfreq = 2)#> Error in get(".SigLength", envir = env): object '.SigLength' not found# keep only words occuring 5 times in 1000, and in 2 of 5 of documents dfm_trim(myDfm, min_docfreq = 0.4, min_count = 0.005)#> Error in get(".SigLength", envir = env): object '.SigLength' not foundnot_run({ # compare to removeSparseTerms from the tm package if (require(tm)) { (tmdtm <- convert(myDfm, "tm")) removeSparseTerms(tmdtm, 0.7) dfm_trim(td, min_docfreq = 0.3) dfm_trim(td, sparsity = 0.7) } })