Calculate and bind the term frequency and inverse document frequency of a tidy text dataset, along with the product, tf-idf, to the dataset. Each of these values are added as columns. This function supports non-standard evaluation through the tidyeval framework.

bind_tf_idf(tbl, term, document, n)

Arguments

tbl

A tidy text dataset with one-row-per-term-per-document

term

Column containing terms as string or symbol

document

Column containing document IDs as string or symbol

n

Column containing document-term counts as string or symbol

Details

The arguments term, document, and n are passed by expression and support quasiquotation; you can unquote strings and symbols.

If the dataset is grouped, the groups are ignored but are retained.

The dataset must have exactly one row per document-term combination for this to work.

Examples

library(dplyr) library(janeaustenr) book_words <- austen_books() %>% unnest_tokens(word, text) %>% count(book, word, sort = TRUE) %>% ungroup() book_words
#> # A tibble: 40,379 x 3 #> book word n #> <fctr> <chr> <int> #> 1 Mansfield Park the 6206 #> 2 Mansfield Park to 5475 #> 3 Mansfield Park and 5438 #> 4 Emma to 5239 #> 5 Emma the 5201 #> 6 Emma and 4896 #> 7 Mansfield Park of 4778 #> 8 Pride & Prejudice the 4331 #> 9 Emma of 4291 #> 10 Pride & Prejudice to 4162 #> # ... with 40,369 more rows
# find the words most distinctive to each document book_words %>% bind_tf_idf(word, book, n) %>% arrange(desc(tf_idf))
#> # A tibble: 40,379 x 6 #> book word n tf idf tf_idf #> <fctr> <chr> <int> <dbl> <dbl> <dbl> #> 1 Sense & Sensibility elinor 623 0.005193528 1.791759 0.009305552 #> 2 Sense & Sensibility marianne 492 0.004101470 1.791759 0.007348847 #> 3 Mansfield Park crawford 493 0.003072417 1.791759 0.005505032 #> 4 Pride & Prejudice darcy 373 0.003052273 1.791759 0.005468939 #> 5 Persuasion elliot 254 0.003036171 1.791759 0.005440088 #> 6 Emma emma 786 0.004882109 1.098612 0.005363545 #> 7 Northanger Abbey tilney 196 0.002519928 1.791759 0.004515105 #> 8 Emma weston 389 0.002416209 1.791759 0.004329266 #> 9 Pride & Prejudice bennet 294 0.002405813 1.791759 0.004310639 #> 10 Persuasion wentworth 191 0.002283105 1.791759 0.004090775 #> # ... with 40,369 more rows