Tokenize the texts from a character vector or from a corpus.
tokens(x, what = c("word", "sentence", "character", "fastestword", "fasterword"), remove_numbers = FALSE, remove_punct = FALSE, remove_symbols = FALSE, remove_separators = TRUE, remove_twitter = FALSE, remove_hyphens = FALSE, remove_url = FALSE, ngrams = 1L, skip = 0L, concatenator = "_", verbose = quanteda_options("verbose"), include_docvars = TRUE, ...)
x | a character, corpus, or tokens object to be tokenized |
---|---|
what | the unit for splitting the text, available alternatives are:
|
remove_numbers | remove tokens that consist only of numbers, but not
words that start with digits, e.g. |
remove_punct | if |
remove_symbols | if |
remove_separators | remove separators and separator characters (spaces
and variations of spaces, plus tab, newlines, and anything else in the
Unicode "separator" category) when |
remove_twitter | remove Twitter characters |
remove_hyphens | if |
remove_url | if |
ngrams | integer vector of the n for n-grams, defaulting
to |
skip | integer vector specifying the skips for skip-grams, default is 0
for only immediately neighbouring words. Only applies if |
concatenator | character to use in concatenating n-grams, default
is " |
verbose | if |
include_docvars | if |
... | additional arguments not used |
quanteda tokens
class object, by default a serialized list
of integers corresponding to a vector of types.
The tokenizer is designed to be fast and flexible as well as to
handle Unicode correctly. Most of the time, users will construct dfm
objects from texts or a corpus, without calling tokens()
as an
intermediate step. Since tokens()
is most likely to be used by more
technical users, we have set its options to default to minimal
intervention. This means that punctuation is tokenized as well, and that
nothing is removed by default from the text being tokenized except
inter-word spacing and equivalent characters.
Note that a tokens
constructor also works on tokens objects,
which allows setting additional options that will modify the original object.
It is not possible, however, to change a setting to "un-remove" something
that was removed from the input tokens object, however. For instance,
tokens(tokens("Ha!", remove_punct = TRUE), remove_punct = FALSE)
will
not restore the "!"
token. No warning is currently issued about this,
so the user should use tokens.tokens()
with caution.
URLs are tricky to tokenize, because they contain
a number of symbols and punctuation characters. If you wish to remove
these, as most people do, and your text contains URLs, then you should set
what = "fasterword"
and remove_url = TRUE
. If you wish to
keep the URLs, but do not want them mangled, then your options are more
limited, since removing punctuation and symbols will also remove them from
URLs. We are working on improving this behaviour.
See the examples below.
tokens_ngrams
, tokens_skipgrams
, as.list.tokens
txt <- c(doc1 = "This is a sample: of tokens.", doc2 = "Another sentence, to demonstrate how tokens works.") tokens(txt)#> tokens from 2 documents. #> doc1 : #> [1] "This" "is" "a" "sample" ":" "of" "tokens" "." #> #> doc2 : #> [1] "Another" "sentence" "," "to" "demonstrate" #> [6] "how" "tokens" "works" "." #>#> tokens from 2 documents. #> doc1 : #> [1] "this" "is" "a" "sample" "of" "tokens" #> #> doc2 : #> [1] "another" "sentence" "to" "demonstrate" "how" #> [6] "tokens" "works" #># keeping versus removing hyphens tokens("quanteda data objects are auto-loading.", remove_punct = TRUE)#> tokens from 1 document. #> text1 : #> [1] "quanteda" "data" "objects" "are" "auto-loading" #>tokens("quanteda data objects are auto-loading.", remove_punct = TRUE, remove_hyphens = TRUE)#> tokens from 1 document. #> text1 : #> [1] "quanteda" "data" "objects" "are" "auto" "loading" #># keeping versus removing symbols tokens("<tags> and other + symbols.", remove_symbols = FALSE)#> tokens from 1 document. #> text1 : #> [1] "<" "tags" ">" "and" "other" "+" "symbols" #> [8] "." #>tokens("<tags> and other + symbols.", remove_symbols = TRUE)#> tokens from 1 document. #> text1 : #> [1] "tags" "and" "other" "symbols" "." #>tokens("<tags> and other + symbols.", remove_symbols = FALSE, what = "fasterword")#> tokens from 1 document. #> text1 : #> [1] "<tags>" "and" "other" "+" "symbols." #>tokens("<tags> and other + symbols.", remove_symbols = TRUE, what = "fasterword")#> tokens from 1 document. #> text1 : #> [1] "<tags>" "and" "other" "symbols." #>## examples with URLs - hardly perfect! txt <- "Repo https://githib.com/kbenoit/quanteda, and www.stackoverflow.com." tokens(txt, remove_url = TRUE, remove_punct = TRUE)#> tokens from 1 document. #> text1 : #> [1] "Repo" "and" "www.stackoverflow.com" #>tokens(txt, remove_url = FALSE, remove_punct = TRUE)#> tokens from 1 document. #> text1 : #> [1] "Repo" "https" "githib.com" #> [4] "kbenoit" "quanteda" "and" #> [7] "www.stackoverflow.com" #>tokens(txt, remove_url = FALSE, remove_punct = TRUE, what = "fasterword")#> tokens from 1 document. #> text1 : #> [1] "Repo" #> [2] "https://githib.com/kbenoit/quanteda," #> [3] "and" #> [4] "www.stackoverflow.com." #>tokens(txt, remove_url = FALSE, remove_punct = FALSE, what = "fasterword")#> tokens from 1 document. #> text1 : #> [1] "Repo" #> [2] "https://githib.com/kbenoit/quanteda," #> [3] "and" #> [4] "www.stackoverflow.com." #>## MORE COMPARISONS txt <- "#textanalysis is MY <3 4U @myhandle gr8 #stuff :-)" tokens(txt, remove_punct = TRUE)#> tokens from 1 document. #> text1 : #> [1] "#textanalysis" "is" "MY" "3" #> [5] "4U" "@myhandle" "gr8" "#stuff" #>tokens(txt, remove_punct = TRUE, remove_twitter = TRUE)#> tokens from 1 document. #> text1 : #> [1] "textanalysis" "is" "MY" "3" "4U" #> [6] "myhandle" "gr8" "stuff" #>#tokens("great website http://textasdata.com", remove_url = FALSE) #tokens("great website http://textasdata.com", remove_url = TRUE) txt <- c(text1="This is $10 in 999 different ways,\n up and down; left and right!", text2="@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.") tokens(txt, verbose = TRUE)#>#>#>#>#>#>#>#>#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "$" "10" "in" "999" #> [7] "different" "ways" "," "up" "and" "down" #> [13] ";" "left" "and" "right" "!" #> #> text2 : #> [1] "@kenbenoit" "working" ":" "on" #> [5] "#quanteda" "2day" "4ever" "," #> [9] "http" ":" "/" "/" #> [13] "textasdata.com" "?" "page" "=" #> [17] "123" "." #>tokens(txt, remove_numbers = TRUE, remove_punct = TRUE)#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "in" "different" "ways" "up" #> [7] "and" "down" "left" "and" "right" #> #> text2 : #> [1] "@kenbenoit" "working" "on" "#quanteda" #> [5] "2day" "4ever" "http" "textasdata.com" #> [9] "page" #>tokens(txt, remove_numbers = FALSE, remove_punct = TRUE)#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "10" "in" "999" "different" #> [7] "ways" "up" "and" "down" "left" "and" #> [13] "right" #> #> text2 : #> [1] "@kenbenoit" "working" "on" "#quanteda" #> [5] "2day" "4ever" "http" "textasdata.com" #> [9] "page" "123" #>tokens(txt, remove_numbers = TRUE, remove_punct = FALSE)#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "$" "in" "different" "ways" #> [7] "," "up" "and" "down" ";" "left" #> [13] "and" "right" "!" #> #> text2 : #> [1] "@kenbenoit" "working" ":" "on" #> [5] "#quanteda" "2day" "4ever" "," #> [9] "http" ":" "/" "/" #> [13] "textasdata.com" "?" "page" "=" #> [17] "." #>tokens(txt, remove_numbers = FALSE, remove_punct = FALSE)#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "$" "10" "in" "999" #> [7] "different" "ways" "," "up" "and" "down" #> [13] ";" "left" "and" "right" "!" #> #> text2 : #> [1] "@kenbenoit" "working" ":" "on" #> [5] "#quanteda" "2day" "4ever" "," #> [9] "http" ":" "/" "/" #> [13] "textasdata.com" "?" "page" "=" #> [17] "123" "." #>tokens(txt, remove_numbers = FALSE, remove_punct = FALSE, remove_separators = FALSE)#> tokens from 2 documents. #> text1 : #> [1] "This" " " "is" " " "$" "10" #> [7] " " "in" " " "999" " " "different" #> [13] " " "ways" "," "\n" " " "up" #> [19] " " "and" " " "down" ";" " " #> [25] "left" " " "and" " " "right" "!" #> #> text2 : #> [1] "@kenbenoit" " " "working" ":" #> [5] " " "on" " " "#quanteda" #> [9] " " "2day" "\t" "4ever" #> [13] "," " " "http" ":" #> [17] "/" "/" "textasdata.com" "?" #> [21] "page" "=" "123" "." #>tokens(txt, remove_numbers = TRUE, remove_punct = TRUE, remove_url = TRUE)#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "in" "different" "ways" "up" #> [7] "and" "down" "left" "and" "right" #> #> text2 : #> [1] "@kenbenoit" "working" "on" "#quanteda" "2day" #> [6] "4ever" #># character level tokens("Great website: http://textasdata.com?page=123.", what = "character")#> tokens from 1 document. #> text1 : #> [1] "G" "r" "e" "a" "t" "w" "e" "b" "s" "i" "t" "e" ":" "h" "t" "t" "p" ":" "/" #> [20] "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c" "o" "m" "?" "p" "a" "g" #> [39] "e" "=" "1" "2" "3" "." #>tokens("Great website: http://textasdata.com?page=123.", what = "character", remove_separators = FALSE)#> tokens from 1 document. #> text1 : #> [1] "G" "r" "e" "a" "t" " " "w" "e" "b" "s" "i" "t" "e" ":" " " "h" "t" "t" "p" #> [20] ":" "/" "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c" "o" "m" "?" "p" #> [39] "a" "g" "e" "=" "1" "2" "3" "." #># sentence level tokens(c("Kurt Vongeut said; only assholes use semi-colons.", "Today is Thursday in Canberra: It is yesterday in London.", "Today is Thursday in Canberra: \nIt is yesterday in London.", "To be? Or\nnot to be?"), what = "sentence")#> tokens from 4 documents. #> text1 : #> [1] "Kurt Vongeut said; only assholes use semi-colons." #> #> text2 : #> [1] "Today is Thursday in Canberra: It is yesterday in London." #> #> text3 : #> [1] "Today is Thursday in Canberra: It is yesterday in London." #> #> text4 : #> [1] "To be?" "Or not to be?" #>tokens(data_corpus_inaugural[c(2,40)], what = "sentence")#> tokens from 2 documents. #> 1793-Washington : #> [1] "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate." #> [2] "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America." #> [3] "Previous to the execution of any official act of the President the Constitution requires an oath of office." #> [4] "This oath I am now about to take, and in your presence: That if it shall be found during my administration of the Government I have in any instance violated willingly or knowingly the injunctions thereof, I may (besides incurring constitutional punishment) be subject to the upbraidings of all who are now witnesses of the present solemn ceremony." #> #> 1945-Roosevelt : #> [1] "Chief Justice, Mr. Vice President, my friends, you will understand and, I believe, agree with my wish that the form of this inauguration be simple and its words brief." #> [2] "We Americans of today, together with our allies, are passing through a period of supreme test." #> [3] "It is a test of our courage -- of our resolve -- of our wisdom -- our essential democracy." #> [4] "If we meet that test -- successfully and honorably -- we shall perform a service of historic importance which men and women and children will honor throughout all time." #> [5] "As I stand here today, having taken the solemn oath of office in the presence of my fellow countrymen -- in the presence of our God -- I know that it is America's purpose that we shall not fail." #> [6] "In the days and in the years that are to come we shall work for a just and honorable peace, a durable peace, as today we work and fight for total victory in war." #> [7] "We can and we will achieve such a peace." #> [8] "We shall strive for perfection." #> [9] "We shall not achieve it immediately -- but we still shall strive." #> [10] "We may make mistakes -- but they must never be mistakes which result from faintness of heart or abandonment of moral principle." #> [11] "I remember that my old schoolmaster, Dr. Peabody, said, in days that seemed to us then to be secure and untroubled: \"Things in life will not always run smoothly." #> [12] "Sometimes we will be rising toward the heights -- then all will seem to reverse itself and start downward." #> [13] "The great fact to remember is that the trend of civilization itself is forever upward; that a line drawn through the middle of the peaks and the valleys of the centuries always has an upward trend.\"" #> [14] "Our Constitution of 1787 was not a perfect instrument; it is not perfect yet." #> [15] "But it provided a firm base upon which all manner of men, of all races and colors and creeds, could build our solid structure of democracy." #> [16] "And so today, in this year of war, 1945, we have learned lessons -- at a fearful cost -- and we shall profit by them." #> [17] "We have learned that we cannot live alone, at peace; that our own well-being is dependent on the well-being of other nations far away." #> [18] "We have learned that we must live as men, not as ostriches, nor as dogs in the manger." #> [19] "We have learned to be citizens of the world, members of the human community." #> [20] "We have learned the simple truth, as Emerson said, that \"The only way to have a friend is to be one.\"" #> [21] "We can gain no lasting peace if we approach it with suspicion and mistrust or with fear." #> [22] "We can gain it only if we proceed with the understanding, the confidence, and the courage which flow from conviction." #> [23] "The Almighty God has blessed our land in many ways." #> [24] "He has given our people stout hearts and strong arms with which to strike mighty blows for freedom and truth." #> [25] "He has given to our country a faith which has become the hope of all peoples in an anguished world." #> [26] "So we pray to Him now for the vision to see our way clearly -- to see the way that leads to a better life for ourselves and for all our fellow men -- to the achievement of His will to peace on earth." #># removing features (stopwords) from tokenized texts txt <- char_tolower(c(mytext1 = "This is a short test sentence.", mytext2 = "Short.", mytext3 = "Short, shorter, and shortest.")) tokens(txt, remove_punct = TRUE)#> tokens from 3 documents. #> mytext1 : #> [1] "this" "is" "a" "short" "test" "sentence" #> #> mytext2 : #> [1] "short" #> #> mytext3 : #> [1] "short" "shorter" "and" "shortest" #>#> tokens from 3 documents. #> mytext1 : #> [1] "short" "test" "sentence" #> #> mytext2 : #> [1] "short" #> #> mytext3 : #> [1] "short" "shorter" "shortest" #># ngram tokenization tokens(txt, remove_punct = TRUE, ngrams = 2)#> tokens from 3 documents. #> mytext1 : #> [1] "this_is" "is_a" "a_short" "short_test" #> [5] "test_sentence" #> #> mytext2 : #> character(0) #> #> mytext3 : #> [1] "short_shorter" "shorter_and" "and_shortest" #>tokens(txt, remove_punct = TRUE, ngrams = 2, skip = 1, concatenator = " ")#> tokens from 3 documents. #> mytext1 : #> [1] "this a" "is short" "a test" "short sentence" #> #> mytext2 : #> character(0) #> #> mytext3 : #> [1] "short and" "shorter shortest" #>tokens(txt, remove_punct = TRUE, ngrams = 1:2)#> tokens from 3 documents. #> mytext1 : #> [1] "this" "is" "a" "short" #> [5] "test" "sentence" "this_is" "is_a" #> [9] "a_short" "short_test" "test_sentence" #> #> mytext2 : #> [1] "short" #> #> mytext3 : #> [1] "short" "shorter" "and" "shortest" #> [5] "short_shorter" "shorter_and" "and_shortest" #># removing features from ngram tokens tokens_remove(tokens(txt, remove_punct = TRUE, ngrams = 1:2), stopwords("english"))#> tokens from 3 documents. #> mytext1 : #> [1] "short" "test" "sentence" "this_is" #> [5] "is_a" "a_short" "short_test" "test_sentence" #> #> mytext2 : #> [1] "short" #> #> mytext3 : #> [1] "short" "shorter" "shortest" "short_shorter" #> [5] "shorter_and" "and_shortest" #>