Split a column into tokens using the tokenizers package, splitting the table into one-token-per-row. This function supports non-standard evaluation through the tidyeval framework.
unnest_tokens(tbl, output, input, token = "words", format = c("text", "man", "latex", "html", "xml"), to_lower = TRUE, drop = TRUE, collapse = NULL, ...)
| tbl | A data frame |
|---|---|
| output | Output column to be created as string or symbol. |
| input | Input column that gets split as string or symbol. The output/input arguments are passed by expression and support quasiquotation; you can unquote strings and symbols. |
| token | Unit for tokenizing, or a custom tokenizing function. Built-in options are "words" (default), "characters", "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", and "regex". If a function, should take a character vector and return a list of character vectors of the same length. |
| format | Either "text", "man", "latex", "html", or "xml". If not text, this uses the hunspell tokenizer, and can tokenize only by "word" |
| to_lower | Whether to turn column lowercase. |
| drop | Whether original input column should get dropped. Ignored if the original input and new output column have the same name. |
| collapse | Whether to combine text with newlines first in case tokens (such as sentences or paragraphs) span multiple lines. If NULL, collapses when token method is "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex". |
| ... | Extra arguments passed on to the tokenizer, such as |
If the unit for tokenizing is ngrams, skip_ngrams, sentences, lines, paragraphs, or regex, the entire input will be collapsed together before tokenizing.
If format is anything other than "text", this uses the
hunspell_parse tokenizer instead of the tokenizers package.
This does not yet have support for tokenizing by any unit other than words.
#> # A tibble: 13,030 x 1 #> txt #> <chr> #> 1 PRIDE AND PREJUDICE #> 2 #> 3 By Jane Austen #> 4 #> 5 #> 6 #> 7 Chapter 1 #> 8 #> 9 #> 10 It is a truth universally acknowledged, that a single man in possession #> # ... with 13,020 more rowsd %>% unnest_tokens(word, txt)#> # A tibble: 122,204 x 1 #> word #> <chr> #> 1 pride #> 2 and #> 3 prejudice #> 4 by #> 5 jane #> 6 austen #> 7 chapter #> 8 1 #> 9 it #> 10 is #> # ... with 122,194 more rowsd %>% unnest_tokens(sentence, txt, token = "sentences")#> # A tibble: 7,066 x 1 #> sentence #> <chr> #> 1 pride and prejudice by jane austen chapter 1 it is a truth universally ack #> 2 however little known the feelings or views of such a man may be on his first ent #> 3 "\"my dear mr." #> 4 "bennet,\" said his lady to him one day, \"have you heard that netherfield park #> 5 mr. #> 6 bennet replied that he had not. #> 7 "\"but it is,\" returned she; \"for mrs." #> 8 "long has just been here, and she told me all about it.\"" #> 9 mr. #> 10 bennet made no answer. #> # ... with 7,056 more rowsd %>% unnest_tokens(ngram, txt, token = "ngrams", n = 2)#> # A tibble: 122,203 x 1 #> ngram #> <chr> #> 1 pride and #> 2 and prejudice #> 3 prejudice by #> 4 by jane #> 5 jane austen #> 6 austen chapter #> 7 chapter 1 #> 8 1 it #> 9 it is #> 10 is a #> # ... with 122,193 more rowsd %>% unnest_tokens(ngram, txt, token = "skip_ngrams", n = 4, k = 2)#> # A tibble: 366,594 x 1 #> ngram #> <chr> #> 1 pride by chapter is #> 2 and jane 1 a #> 3 prejudice austen it truth #> 4 by chapter is universally #> 5 jane 1 a acknowledged #> 6 austen it truth that #> 7 chapter is universally a #> 8 1 a acknowledged single #> 9 it truth that man #> 10 is universally a in #> # ... with 366,584 more rowsd %>% unnest_tokens(chapter, txt, token = "regex", pattern = "Chapter [\\d]")#> # A tibble: 62 x 1 #> chapter #> <chr> #> 1 "pride and prejudice\n\nby jane austen\n\n\n\n" #> 2 "\n\n\nit is a truth universally acknowledged, that a single man in possession\n #> 3 "\n\n\nmr. bennet was among the earliest of those who waited on mr. bingley. he\ #> 4 "\n\n\nnot all that mrs. bennet, however, with the assistance of her five\ndaugh #> 5 "\n\n\nwhen jane and elizabeth were alone, the former, who had been cautious in\ #> 6 "\n\n\nwithin a short walk of longbourn lived a family with whom the bennets\nwe #> 7 "\n\n\nthe ladies of longbourn soon waited on those of netherfield. the visit\nw #> 8 "\n\n\nmr. bennet's property consisted almost entirely in an estate of two\nthou #> 9 "\n\n\nat five o'clock the two ladies retired to dress, and at half-past six\nel #> 10 "\n\n\nelizabeth passed the chief of the night in her sister's room, and in the\ #> # ... with 52 more rows#> # A tibble: 121,567 x 1 #> word #> <chr> #> 1 pride #> 2 and #> 3 prejudice #> 4 by #> 5 jane #> 6 austen #> 7 chapter #> 8 1 #> 9 it #> 10 is #> # ... with 121,557 more rows# tokenize HTML h <- data_frame(row = 1:2, text = c("<h1>Text <b>is</b>", "<a href='example.com'>here</a>")) h %>% unnest_tokens(word, text, format = "html")#> # A tibble: 3 x 2 #> row word #> <int> <chr> #> 1 1 text #> 2 1 is #> 3 2 here