Weight the feature frequencies in a dfm

dfm_weight(
  x,
  scheme = c("count", "prop", "propmax", "logcount", "boolean", "augmented", "logave",
    "logsmooth"),
  weights = NULL,
  base = 10,
  k = 0.5,
  smoothing = 0.5,
  force = FALSE
)

dfm_smooth(x, smoothing = 1)

Arguments

x

document-feature matrix created by dfm

scheme

a label of the weight type:

count

\(tf_{ij}\), an integer feature count (default when a dfm is created)

prop

the proportion of the feature counts of total feature counts (aka relative frequency), calculated as \(tf_{ij} / \sum_j tf_{ij}\)

propmax

the proportion of the feature counts of the highest feature count in a document, \(tf_{ij} / \textrm{max}_j tf_{ij}\)

logcount

take the 1 + the logarithm of each count, for the given base, or 0 if the count was zero: \(1 + \textrm{log}_{base}(tf_{ij})\) if \(tf_{ij} > 0\), or 0 otherwise.

boolean

recode all non-zero counts as 1

augmented

equivalent to \(k + (1 - k) *\) dfm_weight(x, "propmax")

logave

(1 + the log of the counts) / (1 + log of the average count within document), or $$\frac{1 + \textrm{log}_{base} tf_{ij}}{1 + \textrm{log}_{base}(\sum_j tf_{ij} / N_i)}$$

logsmooth

log of the counts + smooth, or \(tf_{ij} + s\)

weights

if scheme is unused, then weights can be a named numeric vector of weights to be applied to the dfm, where the names of the vector correspond to feature labels of the dfm, and the weights will be applied as multipliers to the existing feature counts for the corresponding named features. Any features not named will be assigned a weight of 1.0 (meaning they will be unchanged).

base

base for the logarithm when scheme is "logcount" or logave

k

the k for the augmentation when scheme = "augmented"

smoothing

constant added to the dfm cells for smoothing, default is 1 for dfm_smooth() and 0.5 for dfm_weight()

force

logical; if TRUE, apply weighting scheme even if the dfm has been weighted before. This can result in invalid weights, such as as weighting by "prop" after applying "logcount", or after having grouped a dfm using dfm_group().

Value

dfm_weight returns the dfm with weighted values. Note the because the default weighting scheme is "count", simply calling this function on an unweighted dfm will return the same object. Many users will want the normalized dfm consisting of the proportions of the feature counts within each document, which requires setting scheme = "prop".

dfm_smooth returns a dfm whose values have been smoothed by adding the smoothing amount. Note that this effectively converts a matrix from sparse to dense format, so may exceed memory requirements depending on the size of your input matrix.

References

Manning, C.D., Raghavan, P., & Schütze, H. (2008). An Introduction to Information Retrieval. Cambridge: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

See also

Examples

dfmat1 <- dfm(data_corpus_inaugural) dfmat2 <- dfm_weight(dfmat1, scheme = "prop") topfeatures(dfmat2)
#> the , of and . to in our #> 3.8016552 2.7729117 2.6901032 2.0842094 1.9666506 1.7694603 1.0730903 0.8761677 #> a we #> 0.8624189 0.7764047
dfmat3 <- dfm_weight(dfmat1) topfeatures(dfmat3)
#> the of , and . to in a our that #> 10082 7103 7026 5310 4945 4526 2785 2246 2181 1789
dfmat4 <- dfm_weight(dfmat1, scheme = "logcount") topfeatures(dfmat4)
#> the , of and . to in a #> 182.1856 174.3182 173.3837 167.1782 164.9945 163.2151 150.4070 143.6032 #> our that #> 140.7424 138.9939
dfmat5 <- dfm_weight(dfmat1, scheme = "logave") topfeatures(dfmat5)
#> the , of and . to in a #> 122.09372 116.74748 116.17221 111.93230 110.44258 109.35003 100.65005 95.83956 #> our that #> 93.90034 92.97071
# combine these methods for more complex dfm_weightings, e.g. as in Section 6.4 # of Introduction to Information Retrieval head(dfm_tfidf(dfmat1, scheme_tf = "logcount"))
#> Document-feature matrix of: 6 documents, 9,360 features (93.8% sparse) and 4 docvars. #> features #> docs fellow-citizens of the senate and house representatives #> 1789-Washington 0.4846744 0 0 0.8091855 0 1.119326 0.8031258 #> 1793-Washington 0 0 0 0 0 0 0 #> 1797-Adams 0.7159228 0 0 0.8091855 0 0 0.8031258 #> 1801-Jefferson 0.6305759 0 0 0 0 0 0 #> 1805-Jefferson 0 0 0 0 0 0 0 #> 1809-Madison 0.4846744 0 0 0 0 0 0 #> features #> docs : among vicissitudes #> 1789-Washington 0.2071255 0.1299595 1.064458 #> 1793-Washington 0.2071255 0 0 #> 1797-Adams 0 0.2082030 0 #> 1801-Jefferson 0.2071255 0.1299595 0 #> 1805-Jefferson 0 0.2397881 0 #> 1809-Madison 0 0 0 #> [ reached max_nfeat ... 9,350 more features ]
# apply numeric weights str <- c("apple is better than banana", "banana banana apple much better") (dfmat6 <- dfm(str, remove = stopwords("english")))
#> Document-feature matrix of: 2 documents, 4 features (12.5% sparse). #> features #> docs apple better banana much #> text1 1 1 1 0 #> text2 1 1 2 1
dfm_weight(dfmat6, weights = c(apple = 5, banana = 3, much = 0.5))
#> Document-feature matrix of: 2 documents, 4 features (12.5% sparse). #> features #> docs apple better banana much #> text1 5 1 3 0 #> text2 5 1 6 0.5
# smooth the dfm dfmat <- dfm(data_corpus_inaugural) dfm_smooth(dfmat, 0.5)
#> Document-feature matrix of: 58 documents, 9,360 features (0.0% sparse) and 4 docvars. #> features #> docs fellow-citizens of the senate and house #> 1789-Washington 1.5 71.5 116.5 1.5 48.5 2.5 #> 1793-Washington 0.5 11.5 13.5 0.5 2.5 0.5 #> 1797-Adams 3.5 140.5 163.5 1.5 130.5 0.5 #> 1801-Jefferson 2.5 104.5 130.5 0.5 81.5 0.5 #> 1805-Jefferson 0.5 101.5 143.5 0.5 93.5 0.5 #> 1809-Madison 1.5 69.5 104.5 0.5 43.5 0.5 #> features #> docs representatives : among vicissitudes #> 1789-Washington 2.5 1.5 1.5 1.5 #> 1793-Washington 0.5 1.5 0.5 0.5 #> 1797-Adams 2.5 0.5 4.5 0.5 #> 1801-Jefferson 0.5 1.5 1.5 0.5 #> 1805-Jefferson 0.5 0.5 7.5 0.5 #> 1809-Madison 0.5 0.5 0.5 0.5 #> [ reached max_ndoc ... 52 more documents, reached max_nfeat ... 9,350 more features ]