as.tokens.Rd
Coercion functions to and from tokens objects, checks for whether an object is a tokens object, and functions to combine tokens objects.
as.tokens(x, concatenator = "_", ...) # S3 method for list as.tokens(x, concatenator = "_", ...) # S3 method for spacyr_parsed as.tokens(x, concatenator = "/", include_pos = c("none", "pos", "tag"), use_lemma = FALSE, ...) # S3 method for tokens as.list(x, ...) # S3 method for tokens as.character(x, use.names = FALSE, ...) is.tokens(x) # S3 method for tokens unlist(x, recursive = FALSE, use.names = TRUE) # S3 method for tokens +(t1, t2) # S3 method for tokens c(...)
x | object to be coerced or checked |
---|---|
concatenator | character between multi-word expressions, default is the underscore character. See Details. |
... | additional arguments used by specific methods. For c.tokens, these are the tokens objects to be concatenated. |
include_pos | character; whether and which part-of-speech tag to use:
|
use_lemma | logical; if |
use.names | logical; preserve names if |
recursive | a required argument for unlist but inapplicable to tokens objects |
t1 | tokens one to be added |
t2 | tokens two to be added |
as.tokens
returns a quanteda tokens object.
as.list
returns a simple list of characters from a
tokens object.
as.character
returns a character vector from a
tokens object.
is.tokens
returns TRUE
if the object is of class
tokens, FALSE
otherwise.
unlist
returns a simple vector of characters from a
tokens object.
c(...)
and +
return a tokens object whose documents
have been added as a single sequence of documents.
The concatenator
is used to automatically generate dictionary
values for multi-word expressions in tokens_lookup
and
dfm_lookup
. The underscore character is commonly used to join
elements of multi-word expressions (e.g. "piece_of_cake", "New_York"), but
other characters (e.g. whitespace " " or a hyphen "-") can also be used.
In those cases, users have to tell the system what is the concatenator in
your tokens so that the conversion knows to treat this character as the
inter-word delimiter, when reading in the elements that will become the
tokens.
# create tokens object from list of characters with custom concatenator dict <- dictionary(list(country = "United States", sea = c("Atlantic Ocean", "Pacific Ocean"))) lis <- list(c("The", "United-States", "has", "the", "Atlantic-Ocean", "and", "the", "Pacific-Ocean", ".")) toks <- as.tokens(lis, concatenator = "-") tokens_lookup(toks, dict)#> tokens from 1 document. #> text1 : #> [1] "country" "sea" "sea" #># combining tokens toks1 <- tokens(c(doc1 = "a b c d e", doc2 = "f g h")) toks2 <- tokens(c(doc3 = "1 2 3")) toks1 + toks2#> tokens from 3 documents. #> doc1 : #> [1] "a" "b" "c" "d" "e" #> #> doc2 : #> [1] "f" "g" "h" #> #> doc3 : #> [1] "1" "2" "3" #>#> tokens from 3 documents. #> doc1 : #> [1] "a" "b" "c" "d" "e" #> #> doc2 : #> [1] "f" "g" "h" #> #> doc3 : #> [1] "1" "2" "3" #>