Extracts features from text vector.
textfeatures(x, sentiment = TRUE, word_dims = NULL, threads = 1, normalize = TRUE, export = FALSE)
x | Input data. Should be character vector or data frame with character variable of interest named "text". If a data frame then the first "id|*_id" variable, if found, is assumed to be an ID variable. |
---|---|
sentiment | Logical, indicating whether to return sentiment analysis
features, the variables |
word_dims | Integer indicating the desired number of word2vec dimension estimates. When NULL, the default, this function will pick a reasonable number of dimensions (ranging from 2 to 200) based on size of input. To disable word2vec estimates, set this to 0 or FALSE. |
threads | Integer, specifying the number of threads to use when generating
word2vec estimates. Defaults to 1. Ignored if |
normalize | Logical indicating whether to normalize (mean center, sd = 1) features. Defaults to TRUE. |
export | Logical indicating whether to store sufficient information for exporting the feature extraction process (stores the means, standard deviations, and the word2vec reference object, which can then be used to process new data). |
A tibble data frame with extracted features as columns.
## the text of five of Trump's most retweeted tweets trump_tweets <- c( "#FraudNewsCNN #FNN https://t.co/WYUnHjjUjg", "TODAY WE MAKE AMERICA GREAT AGAIN!", paste("Why would Kim Jong-un insult me by calling me \"old,\" when I would", "NEVER call him \"short and fat?\" Oh well, I try so hard to be his", "friend - and maybe someday that will happen!"), paste("Such a beautiful and important evening! The forgotten man and woman", "will never be forgotten again. We will all come together as never before"), paste("North Korean Leader Kim Jong Un just stated that the \"Nuclear", "Button is on his desk at all times.\" Will someone from his depleted and", "food starved regime please inform him that I too have a Nuclear Button,", "but it is a much bigger & more powerful one than his, and my Button", "works!") ) ## get the text features of a character vector textfeatures(trump_tweets)#> INFO [2018-11-28 17:11:02] iter 10 loglikelihood = -11.817 #> INFO [2018-11-28 17:11:02] iter 20 loglikelihood = -12.426 #> INFO [2018-11-28 17:11:02] early stopping at 20 iteration#> # A tibble: 5 x 31 #> id n_urls n_hashtags n_mentions n_chars n_commas n_digits n_exclaims #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 1.79 1.79 0 -1.68 -0.730 0 -1.79 #> 2 2 -0.447 -0.447 0 -0.141 -0.730 0 0.447 #> 3 3 -0.447 -0.447 0 0.559 1.10 0 0.447 #> 4 4 -0.447 -0.447 0 0.478 -0.730 0 0.447 #> 5 5 -0.447 -0.447 0 0.784 1.10 0 0.447 #> # ... with 23 more variables: n_extraspaces <dbl>, n_lowers <dbl>, #> # n_lowersp <dbl>, n_periods <dbl>, n_words <dbl>, n_caps <dbl>, #> # n_nonasciis <dbl>, n_puncts <dbl>, n_capsp <dbl>, n_charsperword <dbl>, #> # sent_afinn <dbl>, sent_bing <dbl>, n_polite <dbl>, n_first_person <dbl>, #> # n_first_personp <dbl>, n_second_person <dbl>, n_second_personp <dbl>, #> # n_third_person <dbl>, n_tobe <dbl>, n_prepositions <dbl>, V3 <dbl>, #> # w1 <dbl>, w2 <dbl>## data frame with a character vector named "text" df <- data.frame( id = c(1, 2, 3), text = c("this is A!\t sEntence https://github.com about #rstats @github", "and another sentence here", "The following list:\n- one\n- two\n- three\nOkay!?!"), stringsAsFactors = FALSE ) ## get text features of a data frame with "text" variable textfeatures(df)#> Warning: dtm has 0 rows. Empty iterator?#> INFO [2018-11-28 17:11:02] iter 10 loglikelihood = 0.000 #> INFO [2018-11-28 17:11:02] early stopping at 10 iteration#> # A tibble: 3 x 30 #> id n_urls n_hashtags n_mentions n_chars n_commas n_digits n_exclaims #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 1.15 1.15 1.15 -0.792 0 0 0.173 #> 2 2 -0.577 -0.577 -0.577 -0.332 0 0 -1.08 #> 3 3 -0.577 -0.577 -0.577 1.12 0 0 0.902 #> # ... with 22 more variables: n_extraspaces <dbl>, n_lowers <dbl>, #> # n_lowersp <dbl>, n_periods <dbl>, n_words <dbl>, n_caps <dbl>, #> # n_nonasciis <dbl>, n_puncts <dbl>, n_capsp <dbl>, n_charsperword <dbl>, #> # sent_afinn <dbl>, sent_bing <dbl>, n_polite <dbl>, n_first_person <dbl>, #> # n_first_personp <dbl>, n_second_person <dbl>, n_second_personp <dbl>, #> # n_third_person <dbl>, n_tobe <dbl>, n_prepositions <dbl>, V2 <dbl>, #> # w1 <dbl>