Tokenize the texts from a character vector or from a corpus.
is.tokenizedTexts
returns TRUE
if the object is of class tokenizedTexts, FALSE
otherwise.
tokenize(x, ...) # S3 method for character tokenize(x, what = c("word", "sentence", "character", "fastestword", "fasterword"), remove_numbers = FALSE, remove_punct = FALSE, remove_symbols = FALSE, remove_separators = TRUE, remove_twitter = FALSE, remove_hyphens = FALSE, remove_url = FALSE, ngrams = 1L, skip = 0L, concatenator = "_", simplify = FALSE, verbose = FALSE, ...) # S3 method for corpus tokenize(x, ...) is.tokenizedTexts(x) as.tokenizedTexts(x, ...) # S3 method for list as.tokenizedTexts(x, ...) # S3 method for tokens as.tokenizedTexts(x, ...)
x | text(s) or corpus to be tokenized |
---|---|
... | additional arguments not used |
what | the unit for splitting the text, available alternatives are:
|
remove_numbers | remove tokens that consist only of numbers, but not
words that start with digits, e.g. |
remove_punct | if |
remove_symbols | if |
remove_separators | remove Separators and separator characters (spaces
and variations of spaces, plus tab, newlines, and anything else in the
Unicode "separator" category) when |
remove_twitter | remove Twitter characters |
remove_hyphens | if |
remove_url | if |
ngrams | integer vector of the n for n-grams, defaulting
to |
skip | integer vector specifying the skips for skip-grams, default is 0
for only immediately neighbouring words. Only applies if |
concatenator | character to use in concatenating n-grams, default
is " |
simplify | if |
verbose | if |
A list of length ndoc(x)
of the tokens found in each text.
a tokenizedText (S3) object, essentially a list of character
vectors. If simplify = TRUE
then return a single character vector.
The tokenizer is designed to be fast and flexible as well as to
handle Unicode correctly. Most of the time, users will construct dfm
objects from texts or a corpus, without calling tokenize()
as an
intermediate step. Since tokenize()
is most likely to be used by
more technical users, we have set its options to default to minimal
intervention. This means that punctuation is tokenized as well, and that
nothing is removed by default from the text being tokenized except
inter-word spacing and equivalent characters.
as.tokenizedTexts
coerces a list of character tokens to a tokenizedText class object,
making the methods available for this object type available to this object.
as.tokenizedTexts
coerces tokenizedTextsHashed to a
tokenizedText class object, making the methods available for this object
type available to this object.
URLs are tricky to tokenize, because they contain
a number of symbols and punctuation characters. If you wish to remove
these, as most people do, and your text contains URLs, then you should set
what = "fasterword"
and remove_url = TRUE
. If you wish to
keep the URLs, but do not want them mangled, then your options are more
limited, since removing punctuation and symbols will also remove them from
URLs. We are working on improving this behaviour.
See the examples below.
# same for character vectors and for lists tokensFromChar <- tokenize(data_corpus_inaugural[1:3]) tokensFromCorp <- tokenize(corpus_subset(data_corpus_inaugural, Year<1798)) identical(tokensFromChar, tokensFromCorp)#> [1] TRUEstr(tokensFromChar)#> List of 3 #> $ 1789-Washington: chr [1:1540] "Fellow" "-" "Citizens" "of" ... #> $ 1793-Washington: chr [1:147] "Fellow" "citizens" "," "I" ... #> $ 1797-Adams : chr [1:2584] "When" "it" "was" "first" ... #> - attr(*, "class")= chr [1:2] "tokenizedTexts" "list" #> - attr(*, "what")= chr "word" #> - attr(*, "ngrams")= int 1 #> - attr(*, "concatenator")= chr ""# returned as a list head(tokenize(data_corpus_inaugural[57])[[1]], 10)#> [1] "Vice" "President" "Biden" "," "Mr" "." #> [7] "Chief" "Justice" "," "Members"# returned as a character vector using simplify=TRUE head(tokenize(data_corpus_inaugural[57], simplify = TRUE), 10)#> [1] "Vice" "President" "Biden" "," "Mr" "." #> [7] "Chief" "Justice" "," "Members"# removing punctuation marks and lowecasing texts head(tokenize(char_tolower(data_corpus_inaugural[57]), simplify = TRUE, remove_punct = TRUE), 30)#> [1] "vice" "president" "biden" "mr" #> [5] "chief" "justice" "members" "of" #> [9] "the" "united" "states" "congress" #> [13] "distinguished" "guests" "and" "fellow" #> [17] "citizens" "each" "time" "we" #> [21] "gather" "to" "inaugurate" "a" #> [25] "president" "we" "bear" "witness" #> [29] "to" "the"# keeping case and punctuation head(tokenize(data_corpus_inaugural[57], simplify = TRUE), 30)#> [1] "Vice" "President" "Biden" "," #> [5] "Mr" "." "Chief" "Justice" #> [9] "," "Members" "of" "the" #> [13] "United" "States" "Congress" "," #> [17] "distinguished" "guests" "," "and" #> [21] "fellow" "citizens" ":" "Each" #> [25] "time" "we" "gather" "to" #> [29] "inaugurate" "a"# keeping versus removing hyphens tokenize("quanteda data objects are auto-loading.", remove_punct = TRUE)#> tokenizedTexts from 1 document. #> Component 1 : #> [1] "quanteda" "data" "objects" "are" "auto-loading" #>tokenize("quanteda data objects are auto-loading.", remove_punct = TRUE, remove_hyphens = TRUE)#> tokenizedTexts from 1 document. #> Component 1 : #> [1] "quanteda" "data" "objects" "are" "auto" "loading" #># keeping versus removing symbols tokenize("<tags> and other + symbols.", remove_symbols = FALSE)#> tokenizedTexts from 1 document. #> Component 1 : #> [1] "<" "tags" ">" "and" "other" "+" "symbols" #> [8] "." #>tokenize("<tags> and other + symbols.", remove_symbols = TRUE)#> tokenizedTexts from 1 document. #> Component 1 : #> [1] "tags" "and" "other" "symbols" #>tokenize("<tags> and other + symbols.", remove_symbols = FALSE, what = "fasterword")#> tokenizedTexts from 1 document. #> Component 1 : #> [1] "<tags>" "and" "other" "+" "symbols." #>tokenize("<tags> and other + symbols.", remove_symbols = TRUE, what = "fasterword")#> tokenizedTexts from 1 document. #> Component 1 : #> [1] "tags" "and" "other" "symbols." #>## examples with URLs - hardly perfect! txt <- "Repo https://githib.com/kbenoit/quanteda, and www.stackoverflow.com." tokenize(txt, remove_url = TRUE, remove_punct = TRUE)#> tokenizedTexts from 1 document. #> Component 1 : #> [1] "Repo" "and" "www.stackoverflow.com" #>tokenize(txt, remove_url = FALSE, remove_punct = TRUE)#> tokenizedTexts from 1 document. #> Component 1 : #> [1] "Repo" "https" "githib.com" #> [4] "kbenoit" "quanteda" "and" #> [7] "www.stackoverflow.com" #>tokenize(txt, remove_url = FALSE, remove_punct = TRUE, what = "fasterword")#> tokenizedTexts from 1 document. #> Component 1 : #> [1] "Repo" "httpsgithibcomkbenoitquanteda" #> [3] "and" "wwwstackoverflowcom" #>tokenize(txt, remove_url = FALSE, remove_punct = FALSE, what = "fasterword")#> tokenizedTexts from 1 document. #> Component 1 : #> [1] "Repo" #> [2] "https://githib.com/kbenoit/quanteda," #> [3] "and" #> [4] "www.stackoverflow.com." #>## MORE COMPARISONS txt <- "#textanalysis is MY <3 4U @myhandle gr8 #stuff :-)" tokenize(txt, remove_punct = TRUE)#> tokenizedTexts from 1 document. #> Component 1 : #> [1] "#textanalysis" "is" "MY" "3" #> [5] "4U" "@myhandle" "gr8" "#stuff" #>tokenize(txt, remove_punct = TRUE, remove_twitter = TRUE)#> tokenizedTexts from 1 document. #> Component 1 : #> [1] "textanalysis" "is" "MY" "3" "4U" #> [6] "myhandle" "gr8" "stuff" #>#tokenize("great website http://textasdata.com", remove_url = FALSE) #tokenize("great website http://textasdata.com", remove_url = TRUE) txt <- c(text1="This is $10 in 999 different ways,\n up and down; left and right!", text2="@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.") tokenize(txt, verbose = TRUE)#>#>#>#>#> #>#>#>#>#>#>#>#> tokenizedTexts from 2 documents. #> text1 : #> [1] "This" "is" "$" "10" "in" "999" #> [7] "different" "ways" "," "up" "and" "down" #> [13] ";" "left" "and" "right" "!" #> #> text2 : #> [1] "@kenbenoit" "working" ":" "on" #> [5] "#quanteda" "2day" "4ever" "," #> [9] "http" ":" "/" "/" #> [13] "textasdata.com" "?" "page" "=" #> [17] "123" "." #>tokenize(txt, remove_numbers = TRUE, remove_punct = TRUE)#> tokenizedTexts from 2 documents. #> text1 : #> [1] "This" "is" "in" "different" "ways" "up" #> [7] "and" "down" "left" "and" "right" #> #> text2 : #> [1] "@kenbenoit" "working" "on" "#quanteda" #> [5] "2day" "4ever" "http" "textasdata.com" #> [9] "page" #>tokenize(txt, remove_numbers = FALSE, remove_punct = TRUE)#> tokenizedTexts from 2 documents. #> text1 : #> [1] "This" "is" "10" "in" "999" "different" #> [7] "ways" "up" "and" "down" "left" "and" #> [13] "right" #> #> text2 : #> [1] "@kenbenoit" "working" "on" "#quanteda" #> [5] "2day" "4ever" "http" "textasdata.com" #> [9] "page" "123" #>tokenize(txt, remove_numbers = TRUE, remove_punct = FALSE)#> tokenizedTexts from 2 documents. #> text1 : #> [1] "This" "is" "$" "in" "different" "ways" #> [7] "," "up" "and" "down" ";" "left" #> [13] "and" "right" "!" #> #> text2 : #> [1] "@kenbenoit" "working" ":" "on" #> [5] "#quanteda" "2day" "4ever" "," #> [9] "http" ":" "/" "/" #> [13] "textasdata.com" "?" "page" "=" #> [17] "." #>tokenize(txt, remove_numbers = FALSE, remove_punct = FALSE)#> tokenizedTexts from 2 documents. #> text1 : #> [1] "This" "is" "$" "10" "in" "999" #> [7] "different" "ways" "," "up" "and" "down" #> [13] ";" "left" "and" "right" "!" #> #> text2 : #> [1] "@kenbenoit" "working" ":" "on" #> [5] "#quanteda" "2day" "4ever" "," #> [9] "http" ":" "/" "/" #> [13] "textasdata.com" "?" "page" "=" #> [17] "123" "." #>tokenize(txt, remove_numbers = FALSE, remove_punct = FALSE, remove_separators = FALSE)#> tokenizedTexts from 2 documents. #> text1 : #> [1] "This" " " "is" " " "$" "10" #> [7] " " "in" " " "999" " " "different" #> [13] " " "ways" "," "\n" " " "up" #> [19] " " "and" " " "down" ";" " " #> [25] "left" " " "and" " " "right" "!" #> #> text2 : #> [1] "@kenbenoit" " " "working" ":" #> [5] " " "on" " " "#quanteda" #> [9] " " "2day" "\t" "4ever" #> [13] "," " " "http" ":" #> [17] "/" "/" "textasdata.com" "?" #> [21] "page" "=" "123" "." #>tokenize(txt, remove_numbers = TRUE, remove_punct = TRUE, remove_url = TRUE)#> tokenizedTexts from 2 documents. #> text1 : #> [1] "This" "is" "in" "different" "ways" "up" #> [7] "and" "down" "left" "and" "right" #> #> text2 : #> [1] "@kenbenoit" "working" "on" "#quanteda" "2day" #> [6] "4ever" #># character level tokenize("Great website: http://textasdata.com?page=123.", what = "character")#> tokenizedTexts from 1 document. #> Component 1 : #> [1] "G" "r" "e" "a" "t" "w" "e" "b" "s" "i" "t" "e" ":" "h" "t" "t" "p" ":" "/" #> [20] "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c" "o" "m" "?" "p" "a" "g" #> [39] "e" "=" "1" "2" "3" "." #>tokenize("Great website: http://textasdata.com?page=123.", what = "character", remove_separators = FALSE)#> tokenizedTexts from 1 document. #> Component 1 : #> [1] "G" "r" "e" "a" "t" " " "w" "e" "b" "s" "i" "t" "e" ":" " " "h" "t" "t" "p" #> [20] ":" "/" "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c" "o" "m" "?" "p" #> [39] "a" "g" "e" "=" "1" "2" "3" "." #># sentence level tokenize(c("Kurt Vongeut said; only assholes use semi-colons.", "Today is Thursday in Canberra: It is yesterday in London.", "Today is Thursday in Canberra: \nIt is yesterday in London.", "To be? Or\nnot to be?"), what = "sentence")#> tokenizedTexts from 4 documents. #> Component 1 : #> [1] "Kurt Vongeut said; only assholes use semi-colons." #> #> Component 2 : #> [1] "Today is Thursday in Canberra: It is yesterday in London." #> #> Component 3 : #> [1] "Today is Thursday in Canberra: It is yesterday in London." #> #> Component 4 : #> [1] "To be?" "Or not to be?" #>tokenize(data_corpus_inaugural[c(2,40)], what = "sentence", simplify = TRUE)#> [1] "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate." #> [2] "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America." #> [3] "Previous to the execution of any official act of the President the Constitution requires an oath of office." #> [4] "This oath I am now about to take, and in your presence: That if it shall be found during my administration of the Government I have in any instance violated willingly or knowingly the injunctions thereof, I may (besides incurring constitutional punishment) be subject to the upbraidings of all who are now witnesses of the present solemn ceremony." #> [5] "Chief Justice, Mr. Vice President, my friends, you will understand and, I believe, agree with my wish that the form of this inauguration be simple and its words brief." #> [6] "We Americans of today, together with our allies, are passing through a period of supreme test." #> [7] "It is a test of our courage -- of our resolve -- of our wisdom -- our essential democracy." #> [8] "If we meet that test -- successfully and honorably -- we shall perform a service of historic importance which men and women and children will honor throughout all time." #> [9] "As I stand here today, having taken the solemn oath of office in the presence of my fellow countrymen -- in the presence of our God -- I know that it is America's purpose that we shall not fail." #> [10] "In the days and in the years that are to come we shall work for a just and honorable peace, a durable peace, as today we work and fight for total victory in war." #> [11] "We can and we will achieve such a peace." #> [12] "We shall strive for perfection." #> [13] "We shall not achieve it immediately -- but we still shall strive." #> [14] "We may make mistakes -- but they must never be mistakes which result from faintness of heart or abandonment of moral principle." #> [15] "I remember that my old schoolmaster, Dr. Peabody, said, in days that seemed to us then to be secure and untroubled: \"Things in life will not always run smoothly." #> [16] "Sometimes we will be rising toward the heights -- then all will seem to reverse itself and start downward." #> [17] "The great fact to remember is that the trend of civilization itself is forever upward; that a line drawn through the middle of the peaks and the valleys of the centuries always has an upward trend.\"" #> [18] "Our Constitution of 1787 was not a perfect instrument; it is not perfect yet." #> [19] "But it provided a firm base upon which all manner of men, of all races and colors and creeds, could build our solid structure of democracy." #> [20] "And so today, in this year of war, 1945, we have learned lessons -- at a fearful cost -- and we shall profit by them." #> [21] "We have learned that we cannot live alone, at peace; that our own well-being is dependent on the well-being of other nations far away." #> [22] "We have learned that we must live as men, not as ostriches, nor as dogs in the manger." #> [23] "We have learned to be citizens of the world, members of the human community." #> [24] "We have learned the simple truth, as Emerson said, that \"The only way to have a friend is to be one.\"" #> [25] "We can gain no lasting peace if we approach it with suspicion and mistrust or with fear." #> [26] "We can gain it only if we proceed with the understanding, the confidence, and the courage which flow from conviction." #> [27] "The Almighty God has blessed our land in many ways." #> [28] "He has given our people stout hearts and strong arms with which to strike mighty blows for freedom and truth." #> [29] "He has given to our country a faith which has become the hope of all peoples in an anguished world." #> [30] "So we pray to Him now for the vision to see our way clearly -- to see the way that leads to a better life for ourselves and for all our fellow men -- to the achievement of His will to peace on earth."# removing features (stopwords) from tokenized texts txt <- char_tolower(c(mytext1 = "This is a short test sentence.", mytext2 = "Short.", mytext3 = "Short, shorter, and shortest.")) tokenize(txt, remove_punct = TRUE)#> tokenizedTexts from 3 documents. #> mytext1 : #> [1] "this" "is" "a" "short" "test" "sentence" #> #> mytext2 : #> [1] "short" #> #> mytext3 : #> [1] "short" "shorter" "and" "shortest" #>#> tokenizedTexts from 3 documents. #> mytext1 : #> [1] "short" "test" "sentence" #> #> mytext2 : #> [1] "short" #> #> mytext3 : #> [1] "short" "shorter" "shortest" #># ngram tokenization tokenize(txt, remove_punct = TRUE, ngrams = 2)#> tokenizedTexts from 3 documents. #> mytext1 : #> [1] "this_is" "is_a" "a_short" "short_test" #> [5] "test_sentence" #> #> mytext2 : #> character(0) #> #> mytext3 : #> [1] "short_shorter" "shorter_and" "and_shortest" #>tokenize(txt, remove_punct = TRUE, ngrams = 2, skip = 1, concatenator = " ")#> tokenizedTexts from 3 documents. #> mytext1 : #> [1] "this a" "is short" "a test" "short sentence" #> #> mytext2 : #> character(0) #> #> mytext3 : #> [1] "short and" "shorter shortest" #>tokenize(txt, remove_punct = TRUE, ngrams = 1:2)#> tokenizedTexts from 3 documents. #> mytext1 : #> [1] "this" "is" "a" "short" #> [5] "test" "sentence" "this_is" "is_a" #> [9] "a_short" "short_test" "test_sentence" #> #> mytext2 : #> [1] "short" #> #> mytext3 : #> [1] "short" "shorter" "and" "shortest" #> [5] "short_shorter" "shorter_and" "and_shortest" #># removing features from ngram tokens removeFeatures(tokenize(txt, remove_punct = TRUE, ngrams = 1:2), stopwords("english"))#> tokenizedTexts from 3 documents. #> mytext1 : #> [1] "short" "test" "sentence" "this_is" #> [5] "is_a" "a_short" "short_test" "test_sentence" #> #> mytext2 : #> [1] "short" #> #> mytext3 : #> [1] "short" "shorter" "shortest" "short_shorter" #> [5] "shorter_and" "and_shortest" #>