R/mallet_tidiers.R
mallet_tidiers.Rd
Tidy LDA models fit by the mallet package, which wraps the Mallet topic
modeling package in Java. The arguments and return values
are similar to lda_tidiers
.
# S3 method for jobjRef tidy(x, matrix = c("beta", "gamma"), log = FALSE, normalized = TRUE, smoothed = TRUE, ...) # S3 method for jobjRef augment(x, data, ...)
x | A jobjRef object, of type RTopicModel, such as created
by |
---|---|
matrix | Whether to tidy the beta (per-term-per-topic, default) or gamma (per-document-per-topic) matrix. |
log | Whether beta/gamma should be on a log scale, default FALSE |
normalized | If true (default), normalize so that each document or word sums to one across the topics. If false, values will be integers representing the actual number of word-topic or document-topic assignments. |
smoothed | If true (default), add the smoothing parameter to each
to avoid any values being zero. This smoothing parameter is initialized
as |
... | Extra arguments, not used |
data | For |
augment
must be provided a data argument containing
one row per original document-term pair, such as is returned by
tdm_tidiers, containing columns document
and term
.
It returns that same data with an additional column
.topic
with the topic assignment for that document-term combination.
Note that the LDA models from MalletLDA
are technically a special case of S4 objects with class jobjRef
.
These are thus implemented as jobjRef
tidiers, with a check for
whether the toString
output is as expected.
# NOT RUN { library(mallet) library(dplyr) data("AssociatedPress", package = "topicmodels") td <- tidy(AssociatedPress) # mallet needs a file with stop words tmp <- tempfile() writeLines(stop_words$word, tmp) # two vectors: one with document IDs, one with text docs <- td %>% group_by(document = as.character(document)) %>% summarize(text = paste(rep(term, count), collapse = " ")) docs <- mallet.import(docs$document, docs$text, tmp) # create and run a topic model topic_model <- MalletLDA(num.topics = 4) topic_model$loadDocuments(docs) topic_model$train(20) # tidy the word-topic combinations td_beta <- tidy(topic_model) td_beta # Examine the four topics td_beta %>% group_by(topic) %>% top_n(8, beta) %>% ungroup() %>% mutate(term = reorder(term, beta)) %>% ggplot(aes(term, beta)) + geom_col() + facet_wrap(~ topic, scales = "free") + coord_flip() # find the assignments of each word in each document assignments <- augment(topic_model, td) assignments # }