Fit a multinomial or Bernoulli Naive Bayes model, given a dfm and some training labels.
textmodel_NB(x, y, smooth = 1, prior = c("uniform", "docfreq", "termfreq"), distribution = c("multinomial", "Bernoulli"), ...)
x | the dfm on which the model will be fit. Does not need to contain only the training documents. |
---|---|
y | vector of training labels associated with each document identified
in |
smooth | smoothing parameter for feature counts by class |
prior | prior distribution on texts; see Details |
distribution | count model for text features, can be |
... | more arguments passed through |
A list of return values, consisting of:
original function call
probability of the word given the class (empirical likelihood)
class prior probability
posterior class probability given the word
baseline probability of the word
list consisting of x
training class, and y
test class
the distribution argument
argument passed as a prior
smoothing parameter
A predict
method is also available for a
fitted Naive Bayes object, see predict.textmodel_NB_fitted
.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf Jurafsky, Daniel and James H. Martin. (2016) Speech and Language Processing. Draft of November 7, 2016. https://web.stanford.edu/~jurafsky/slp3/6.pdf
## Example from 13.1 of _An Introduction to Information Retrieval_ txt <- c(d1 = "Chinese Beijing Chinese", d2 = "Chinese Chinese Shanghai", d3 = "Chinese Macao", d4 = "Tokyo Japan Chinese", d5 = "Chinese Chinese Chinese Tokyo Japan") trainingset <- dfm(txt, tolower = FALSE) trainingclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE) ## replicate IIR p261 prediction for test set (document 5) (nb.p261 <- textmodel_NB(trainingset, trainingclass, prior = "docfreq"))#> Fitted Naive Bayes model: #> Call: #> textmodel_NB.dfm(x = trainingset, y = trainingclass, prior = "docfreq") #> #> #> Training classes and priors: #> Y N #> 0.75 0.25 #> #> Likelihoods: Class Posteriors: #> 6 x 4 Matrix of class "dgeMatrix" #> Y N Y N #> Chinese 0.42857143 0.2222222 0.8526316 0.1473684 #> Beijing 0.14285714 0.1111111 0.7941176 0.2058824 #> Shanghai 0.14285714 0.1111111 0.7941176 0.2058824 #> Macao 0.14285714 0.1111111 0.7941176 0.2058824 #> Tokyo 0.07142857 0.2222222 0.4909091 0.5090909 #> Japan 0.07142857 0.2222222 0.4909091 0.5090909 #>predict(nb.p261, newdata = trainingset[5, ])#> Predicted textmodel of type: Naive Bayes #> #> lp(Y) lp(N) Pr(Y) Pr(N) Predicted #> d5 -8.10769 -8.906681 0.6898 0.3102 Y #># contrast with other priors predict(textmodel_NB(trainingset, trainingclass, prior = "uniform"))#> Predicted textmodel of type: Naive Bayes #> #> lp(Y) lp(N) Pr(Y) Pr(N) Predicted #> d1 -4.333653 -5.898527 0.8271 0.1729 Y #> d2 -4.333653 -5.898527 0.8271 0.1729 Y #> d3 -3.486355 -4.394449 0.7126 0.2874 Y #> d4 -6.818560 -5.205379 0.1661 0.8339 N #> d5 -8.513155 -8.213534 0.4257 0.5743 N #>predict(textmodel_NB(trainingset, trainingclass, prior = "termfreq"))#> Predicted textmodel of type: Naive Bayes #> #> lp(Y) lp(N) Pr(Y) Pr(N) Predicted #> d1 -3.958960 -6.504662 0.9273 0.0727 Y #> d2 -3.958960 -6.504662 0.9273 0.0727 Y #> d3 -3.111662 -5.000585 0.8686 0.1314 Y #> d4 -6.443866 -5.811515 0.3470 0.6530 N #> d5 -8.138462 -8.819670 0.6640 0.3360 Y #>## replicate IIR p264 Bernoulli Naive Bayes (nb.p261.bern <- textmodel_NB(trainingset, trainingclass, distribution = "Bernoulli", prior = "docfreq"))#> Fitted Naive Bayes model: #> Call: #> textmodel_NB.dfm(x = trainingset, y = trainingclass, prior = "docfreq", #> distribution = "Bernoulli") #> #> #> Training classes and priors: #> Y N #> 0.75 0.25 #> #> Likelihoods: Class Posteriors: #> 6 x 4 Matrix of class "dgeMatrix" #> Y N Y N #> Chinese 0.8 0.6666667 0.7826087 0.2173913 #> Beijing 0.4 0.3333333 0.7826087 0.2173913 #> Shanghai 0.4 0.3333333 0.7826087 0.2173913 #> Macao 0.4 0.3333333 0.7826087 0.2173913 #> Tokyo 0.2 0.6666667 0.4736842 0.5263158 #> Japan 0.2 0.6666667 0.4736842 0.5263158 #>predict(nb.p261.bern, newdata = trainingset[5, ])#> Error in getMethod("t", "dgCMatrix"): no generic function found for 't'