Returns a document by feature matrix with the feature frequencies weighted according to one of several common methods. Some shortcut functions that offer finer-grained control are:
tf
compute term frequency weights
tfidf
compute term frequency-inverse document frequency weights
docfreq
compute document frequencies of features
dfm_weight(x, type = c("frequency", "relfreq", "relmaxfreq", "logfreq", "tfidf"), weights = NULL) dfm_smooth(x, smoothing = 1)
x | document-feature matrix created by dfm |
---|---|
type | a label of the weight type:
|
weights | if |
smoothing | constant added to the dfm cells for smoothing, default is 1 |
dfm_weight
returns the dfm with weighted values.
dfm_smooth
returns a dfm whose values have been smoothed by
adding the smoothing
amount. Note that this effectively converts a
matrix from sparse to dense format, so may exceed memory requirements
depending on the size of your input matrix.
For finer grained control, consider calling the convenience functions directly.
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Vol. 1. Cambridge: Cambridge University Press, 2008.
#> Error in get(".SigLength", envir = env): object '.SigLength' not foundnormDtm <- dfm_weight(dtm, "relfreq")#> Error in get(".SigLength", envir = env): object '.SigLength' not foundtopfeatures(normDtm)#> Error in topfeatures(normDtm): object 'normDtm' not found#> Error in get(".SigLength", envir = env): object '.SigLength' not found#> Error in get(".SigLength", envir = env): object '.SigLength' not found#> Error in get(".SigLength", envir = env): object '.SigLength' not found# combine these methods for more complex dfm_weightings, e.g. as in Section 6.4 # of Introduction to Information Retrieval head(tfidf(dtm, scheme_tf = "log"))#> Document-feature matrix of: 58 documents, 9,357 features (91.8% sparse). #> (showing first 6 documents and first 6 features) #> features #> docs fellow-citizens of the senate and house #> 1789-Washington 0.4846744 0 0 0.8091855 0 1.119326 #> 1793-Washington 0.0000000 0 0 0.0000000 0 0.000000 #> 1797-Adams 0.7159228 0 0 0.8091855 0 0.000000 #> 1801-Jefferson 0.6305759 0 0 0.0000000 0 0.000000 #> 1805-Jefferson 0.0000000 0 0 0.0000000 0 0.000000 #> 1809-Madison 0.4846744 0 0 0.0000000 0 0.000000#' # apply numeric weights str <- c("apple is better than banana", "banana banana apple much better") (mydfm <- dfm(str, remove = stopwords("english")))#> Document-feature matrix of: 2 documents, 4 features (12.5% sparse). #> 2 x 4 sparse Matrix of class "dfmSparse" #> features #> docs apple better banana much #> text1 1 1 1 0 #> text2 1 1 2 1dfm_weight(mydfm, weights = c(apple = 5, banana = 3, much = 0.5))#> Document-feature matrix of: 2 documents, 4 features (12.5% sparse). #> 2 x 4 sparse Matrix of class "dfmSparse" #> features #> docs apple better banana much #> text1 5 1 3 0 #> text2 5 1 6 0.5# smooth the dfm dfm_smooth(mydfm, 0.5)#> Document-feature matrix of: 2 documents, 4 features (0% sparse). #> 2 x 4 Matrix of class "dfmDense" #> features #> docs apple better banana much #> text1 1.5 1.5 1.5 0.5 #> text2 1.5 1.5 2.5 1.5