Weight the feature frequencies in a dfm
dfm_weight(x, scheme = c("count", "prop", "propmax", "logcount", "boolean", "augmented", "logave"), weights = NULL, base = 10, K = 0.5) dfm_smooth(x, smoothing = 1)
x | document-feature matrix created by dfm |
---|---|
scheme | a label of the weight type:
|
weights | if |
base | base for the logarithm when |
K | the K for the augmentation when |
smoothing | constant added to the dfm cells for smoothing, default is 1 |
dfm_weight
returns the dfm with weighted values. Note the
because the default weighting scheme is "count"
, simply calling this
function on an unweighted dfm will return the same object. Many users will
want the normalized dfm consisting of the proportions of the feature counts
within each document, which requires setting scheme = "prop"
.
dfm_smooth
returns a dfm whose values have been smoothed by
adding the smoothing
amount. Note that this effectively converts a
matrix from sparse to dense format, so may exceed memory requirements
depending on the size of your input matrix.
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Vol. 1. Cambridge: Cambridge University Press, 2008.
#> the of , and . to in a our that #> 10082 7103 7026 5310 4945 4526 2785 2246 2181 1789#> the , of and . to in our #> 3.7910332 2.7639649 2.6821863 2.0782035 1.9594539 1.7643366 1.0695645 0.8731637 #> a we #> 0.8593092 0.7726443#> the , of and . to in our #> 55.13499 42.22681 39.34995 31.43686 30.76141 26.37869 16.08336 13.97242 #> a we #> 13.38024 13.21974#> the , of and . to in a #> 182.1856 174.3182 173.3837 167.1782 164.9945 163.2151 150.4070 143.6032 #> our that #> 140.7424 138.9939#> the , of and . to in a #> 121.98599 116.64229 116.06902 111.83340 110.34098 109.25338 100.55961 95.75088 #> our that #> 93.81347 92.88474# combine these methods for more complex dfm_weightings, e.g. as in Section 6.4 # of Introduction to Information Retrieval head(dfm_tfidf(dtm, scheme_tf = "logcount"))#> Document-feature matrix of: 6 documents, 9,357 features (93.8% sparse).# apply numeric weights str <- c("apple is better than banana", "banana banana apple much better") (mydfm <- dfm(str, remove = stopwords("english")))#> Document-feature matrix of: 2 documents, 4 features (12.5% sparse). #> 2 x 4 sparse Matrix of class "dfm" #> features #> docs apple better banana much #> text1 1 1 1 0 #> text2 1 1 2 1dfm_weight(mydfm, weights = c(apple = 5, banana = 3, much = 0.5))#> Document-feature matrix of: 2 documents, 4 features (12.5% sparse). #> 2 x 4 sparse Matrix of class "dfm" #> features #> docs apple better banana much #> text1 5 1 3 0 #> text2 5 1 6 0.5# smooth the dfm dfm_smooth(mydfm, 0.5)#> Document-feature matrix of: 2 documents, 4 features (0% sparse). #> 2 x 4 sparse Matrix of class "dfm" #> features #> docs apple better banana much #> text1 1.5 1.5 1.5 0.5 #> text2 1.5 1.5 2.5 1.5