Split a column into tokens using the tokenizers package, splitting the table into one-token-per-row. This function supports non-standard evaluation through the tidyeval framework.
unnest_tokens(tbl, output, input, token = "words", format = c("text", "man", "latex", "html", "xml"), to_lower = TRUE, drop = TRUE, collapse = NULL, ...)
tbl | A data frame |
---|---|
output | Output column to be created as string or symbol. |
input | Input column that gets split as string or symbol. The output/input arguments are passed by expression and support quasiquotation; you can unquote strings and symbols. |
token | Unit for tokenizing, or a custom tokenizing function. Built-in options are "words" (default), "characters", "character_shingles", "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", "regex", "tweets" (tokenization by word that preserves usernames, hashtags, and URLS ), and "ptb" (Penn Treebank). If a function, should take a character vector and return a list of character vectors of the same length. |
format | Either "text", "man", "latex", "html", or "xml". If not text, this uses the hunspell tokenizer, and can tokenize only by "word" |
to_lower | Whether to convert tokens to lowercase. If tokens include
URLS (such as with |
drop | Whether original input column should get dropped. Ignored if the original input and new output column have the same name. |
collapse | Whether to combine text with newlines first in case tokens (such as sentences or paragraphs) span multiple lines. If NULL, collapses when token method is "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex". |
... | Extra arguments passed on to tokenizers, such
as |
If the unit for tokenizing is ngrams, skip_ngrams, sentences, lines,
paragraphs, or regex, the entire input will be collapsed together before
tokenizing unless collapse = FALSE
.
If format is anything other than "text", this uses the
hunspell_parse
tokenizer instead of the tokenizers package.
This does not yet have support for tokenizing by any unit other than words.
#> # A tibble: 13,030 x 1 #> txt #> <chr> #> 1 PRIDE AND PREJUDICE #> 2 "" #> 3 By Jane Austen #> 4 "" #> 5 "" #> 6 "" #> 7 Chapter 1 #> 8 "" #> 9 "" #> 10 It is a truth universally acknowledged, that a single man in possession #> # … with 13,020 more rowsd %>% unnest_tokens(word, txt)#> # A tibble: 122,204 x 1 #> word #> <chr> #> 1 pride #> 2 and #> 3 prejudice #> 4 by #> 5 jane #> 6 austen #> 7 chapter #> 8 1 #> 9 it #> 10 is #> # … with 122,194 more rowsd %>% unnest_tokens(sentence, txt, token = "sentences")#> # A tibble: 7,066 x 1 #> sentence #> <chr> #> 1 pride and prejudice by jane austen chapter 1 it is a truth universally… #> 2 however little known the feelings or views of such a man may be on his first… #> 3 "\"my dear mr." #> 4 "bennet,\" said his lady to him one day, \"have you heard that netherfield p… #> 5 mr. #> 6 bennet replied that he had not. #> 7 "\"but it is,\" returned she; \"for mrs." #> 8 "long has just been here, and she told me all about it.\"" #> 9 mr. #> 10 bennet made no answer. #> # … with 7,056 more rowsd %>% unnest_tokens(ngram, txt, token = "ngrams", n = 2)#> # A tibble: 122,203 x 1 #> ngram #> <chr> #> 1 pride and #> 2 and prejudice #> 3 prejudice by #> 4 by jane #> 5 jane austen #> 6 austen chapter #> 7 chapter 1 #> 8 1 it #> 9 it is #> 10 is a #> # … with 122,193 more rowsd %>% unnest_tokens(chapter, txt, token = "regex", pattern = "Chapter [\\d]")#> # A tibble: 62 x 1 #> chapter #> <chr> #> 1 "pride and prejudice\n\nby jane austen\n\n\n\n" #> 2 "\n\n\nit is a truth universally acknowledged, that a single man in possessi… #> 3 "\n\n\nmr. bennet was among the earliest of those who waited on mr. bingley.… #> 4 "\n\n\nnot all that mrs. bennet, however, with the assistance of her five\nd… #> 5 "\n\n\nwhen jane and elizabeth were alone, the former, who had been cautious… #> 6 "\n\n\nwithin a short walk of longbourn lived a family with whom the bennets… #> 7 "\n\n\nthe ladies of longbourn soon waited on those of netherfield. the visi… #> 8 "\n\n\nmr. bennet's property consisted almost entirely in an estate of two\n… #> 9 "\n\n\nat five o'clock the two ladies retired to dress, and at half-past six… #> 10 "\n\n\nelizabeth passed the chief of the night in her sister's room, and in … #> # … with 52 more rowsd %>% unnest_tokens(shingle, txt, token = "character_shingles", n = 4)#> # A tibble: 536,526 x 1 #> shingle #> <chr> #> 1 prid #> 2 ride #> 3 idea #> 4 dean #> 5 eand #> 6 andp #> 7 ndpr #> 8 dpre #> 9 prej #> 10 reju #> # … with 536,516 more rows#> # A tibble: 124,032 x 1 #> word #> <chr> #> 1 pride #> 2 and #> 3 prejudice #> 4 "" #> 5 by #> 6 jane #> 7 austen #> 8 "" #> 9 "" #> 10 "" #> # … with 124,022 more rows# tokenize HTML h <- tibble(row = 1:2, text = c("<h1>Text <b>is</b>", "<a href='example.com'>here</a>")) h %>% unnest_tokens(word, text, format = "html")#> # A tibble: 3 x 2 #> row word #> <int> <chr> #> 1 1 text #> 2 1 is #> 3 2 here