the function for creating a document term matrix

ldaDtm(
  data,
  id_col,
  data_col,
  ngram_window = c(1, 3),
  stopwords = stopwords::stopwords("en", source = "snowball"),
  removalword = "",
  occ_rate = 0,
  removal_mode = "",
  removal_rate_most = 0,
  removal_rate_least = 0,
  split = 1,
  seed = 42,
  save_dir = "./results"
)

Arguments

data

(tibble) the data frame containing the text data

id_col

(string) the name of the column containing the unique id

data_col

(string) the name of the column containing the text data

ngram_window

(list) the minimum and maximum n-gram length, e.g. c(1,3)

stopwords

(stopwords) the stopwords to remove, e.g. stopwords::stopwords("en", source = "snowball")

removalword

(string) the word to remove

occ_rate

(integer) the rate of occurence of a word to be removed

removal_mode

(string) the mode of removal -> "most" or "least"

removal_rate_most

(integer) the rate of most frequent words to be removed

removal_rate_least

(integer) the rate of least frequent words to be removed

split

(float) the proportion of the data to be used for training

seed

(integer) the random seed for reproducibility

save_dir

(string) the directory to save the results, default is "./results", if NULL, no results are saved

Value

the document term matrix