Construct a sparse document-feature matrix, from a character, corpus, tokens, or even other dfm object.
dfm( x, tolower = TRUE, stem = FALSE, select = NULL, remove = NULL, dictionary = NULL, thesaurus = NULL, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, groups = NULL, verbose = quanteda_options("verbose"), ... )
x | |
---|---|
tolower | convert all features to lowercase |
stem | if |
select | a pattern of user-supplied features to keep, while
excluding all others. This can be used in lieu of a dictionary if there
are only specific features that a user wishes to keep. To extract only
Twitter usernames, for example, set |
remove | a pattern of user-supplied features to ignore, such as "stop
words". To access one possible list (from any list you wish), use
|
dictionary | a dictionary object to apply to the tokens when creating the dfm |
thesaurus | a dictionary object that will be applied as if |
valuetype | the type of pattern matching: |
case_insensitive | logical; if |
groups | either: a character vector containing the names of document
variables to be used for grouping; or a factor or object that can be
coerced into a factor equal in length or rows to the number of documents.
|
verbose | display messages if |
... | additional arguments passed to tokens; not used when |
a dfm object
The default behaviour for remove
/select
when constructing ngrams
using dfm(x,
ngrams > 1)
is to remove/select any ngram constructed
from a matching feature. If you wish to remove these before constructing
ngrams, you will need to first tokenize the texts with ngrams, then remove
the features to be ignored, and then construct the dfm using this modified
tokenization object. See the code examples for an illustration.
To select on and match the features of a another dfm, x
must also be a
dfm.
When x
is a dfm, groups
provides a convenient and fast method of
combining and refactoring the documents of the dfm according to the groups.
#> Document-feature matrix of: 10 documents, 3,261 features (77.4% sparse) and 4 docvars. #> features #> docs senator hatfield , mr . chief justice president vice bush #> 1981-Reagan 2 1 174 3 130 1 1 5 2 1 #> 1985-Reagan 4 0 177 0 124 1 1 3 1 1 #> 1989-Bush 2 0 166 6 142 1 2 6 1 0 #> 1993-Clinton 0 0 139 0 81 0 0 2 0 1 #> 1997-Clinton 0 0 131 0 108 0 1 1 0 0 #> 2001-Bush 0 0 110 0 96 0 3 3 1 0 #> [ reached max_ndoc ... 4 more documents, reached max_nfeat ... 3,251 more features ]dfm(corp, tolower = FALSE)#> Document-feature matrix of: 10 documents, 3,479 features (77.7% sparse) and 4 docvars. #> features #> docs Senator Hatfield , Mr . Chief Justice President Vice Bush #> 1981-Reagan 2 1 174 3 130 1 1 5 2 1 #> 1985-Reagan 4 0 177 0 124 1 1 3 1 1 #> 1989-Bush 2 0 166 6 142 1 1 6 1 0 #> 1993-Clinton 0 0 139 0 81 0 0 1 0 1 #> 1997-Clinton 0 0 131 0 108 0 0 1 0 0 #> 2001-Bush 0 0 110 0 96 0 0 3 1 0 #> [ reached max_ndoc ... 4 more documents, reached max_nfeat ... 3,469 more features ]# grouping documents by docvars in a corpus dfm(corp, groups = "President", verbose = TRUE)#>#>#>#>#>#>#> Document-feature matrix of: 5 documents, 3,261 features (64.2% sparse) and 2 docvars. #> features #> docs senator hatfield , mr . chief justice president vice bush #> Bush 2 0 396 7 336 2 11 13 3 1 #> Clinton 0 0 270 0 189 0 1 3 0 1 #> Obama 0 0 229 1 207 1 2 3 1 1 #> Reagan 6 1 351 3 254 2 2 8 3 2 #> Trump 0 0 96 0 88 1 1 5 0 1 #> [ reached max_nfeat ... 3,251 more features ]# with English stopwords and stemming dfm(corp, remove = stopwords("english"), stem = TRUE, verbose = TRUE)#>#>#>#>#>#>#>#>#>#> Document-feature matrix of: 10 documents, 2,304 features (75.1% sparse) and 4 docvars. #> features #> docs senat hatfield , mr . chief justic presid vice bush #> 1981-Reagan 2 1 174 3 130 1 1 6 2 1 #> 1985-Reagan 4 0 177 0 124 1 1 3 1 1 #> 1989-Bush 3 0 166 6 142 1 2 7 1 0 #> 1993-Clinton 0 0 139 0 81 0 0 3 0 1 #> 1997-Clinton 0 0 131 0 108 0 1 1 0 0 #> 2001-Bush 0 0 110 0 96 0 3 3 1 0 #> [ reached max_ndoc ... 4 more documents, reached max_nfeat ... 2,294 more features ]# works for both words in ngrams too tokens("Banking industry") %>% tokens_ngrams(n = 2) %>% dfm(stem = TRUE)#> Document-feature matrix of: 1 document, 1 feature (0.0% sparse). #> features #> docs bank_industri #> text1 1# with dictionaries dict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"), opposition = c("Opposition", "reject", "notincorpus"), taxing = "taxing", taxation = "taxation", taxregex = "tax*", country = "states")) dfm(corpus_subset(data_corpus_inaugural, Year > 1900), dictionary = dict)#> Document-feature matrix of: 30 documents, 6 features (73.3% sparse) and 4 docvars. #> features #> docs christmas opposition taxing taxation taxregex country #> 1901-McKinley 0 2 0 1 1 9 #> 1905-Roosevelt 0 0 0 0 0 0 #> 1909-Taft 0 1 0 4 6 12 #> 1913-Wilson 0 0 0 1 1 0 #> 1917-Wilson 0 0 0 0 0 2 #> 1921-Harding 0 0 0 1 2 1 #> [ reached max_ndoc ... 24 more documents ]# removing stopwords txt <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, with the newspaper from a boy named Seamus, in his mouth." corp <- corpus(txt) # note: "also" is not in the default stopwords("english") featnames(dfm(corp, select = stopwords("english")))#> [1] "the" "over" "with" "from" "a" "in" "his"#> Warning: ngrams argument is not used.#> [1] "the" "over" "with" "from" "a" "in" "his"#> Warning: ngrams argument is not used.#> [1] "the" "over" "with" "from" "a" "in" "his"# removing stopwords before constructing ngrams toks1 <- tokens(char_tolower(txt), remove_punct = TRUE) toks2 <- tokens_remove(toks1, stopwords("english")) toks3 <- tokens_ngrams(toks2, 2) featnames(dfm(toks3))#> [1] "quick_brown" "brown_fox" "fox_named" "named_seamus" #> [5] "seamus_jumps" "jumps_lazy" "lazy_dog" "dog_also" #> [9] "also_named" "seamus_newspaper" "newspaper_boy" "boy_named" #> [13] "seamus_mouth"# keep only certain words dfm(corp, select = "*s") # keep only words ending in "s"#> Document-feature matrix of: 1 document, 3 features (0.0% sparse). #> features #> docs seamus jumps his #> text1 3 1 1dfm(corp, select = "s$", valuetype = "regex")#> Document-feature matrix of: 1 document, 3 features (0.0% sparse). #> features #> docs seamus jumps his #> text1 3 1 1# testing Twitter functions txttweets <- c("My homie @justinbieber #justinbieber shopping in #LA yesterday #beliebers", "2all the ha8ers including my bro #justinbieber #emabiggestfansjustinbieber", "Justin Bieber #justinbieber #belieber #fetusjustin #EMABiggestFansJustinBieber") dfm(txttweets, select = "#*", split_tags = FALSE) # keep only hashtags#> Warning: split_tags argument is not used.#> Document-feature matrix of: 3 documents, 6 features (50.0% sparse). #> features #> docs #justinbieber #la #beliebers #emabiggestfansjustinbieber #belieber #> text1 1 1 1 0 0 #> text2 1 0 0 1 0 #> text3 1 0 0 1 1 #> features #> docs #fetusjustin #> text1 0 #> text2 0 #> text3 1dfm(txttweets, select = "^#.*$", valuetype = "regex", split_tags = FALSE)#> Warning: split_tags argument is not used.#> Document-feature matrix of: 3 documents, 6 features (50.0% sparse). #> features #> docs #justinbieber #la #beliebers #emabiggestfansjustinbieber #belieber #> text1 1 1 1 0 0 #> text2 1 0 0 1 0 #> text3 1 0 0 1 1 #> features #> docs #fetusjustin #> text1 0 #> text2 0 #> text3 1#> Document-feature matrix of: 2 documents, 3,261 features (33.9% sparse) and 1 docvar. #> features #> docs senator hatfield , mr . chief justice president vice bush #> Democratic 0 0 499 1 396 1 3 6 1 2 #> Republican 8 1 843 10 678 5 14 26 6 4 #> [ reached max_nfeat ... 3,251 more features ]