The dfm class of object is a type of Matrix-class
object with additional slots, described below. quanteda uses two
subclasses of the dfm
class, depending on whether the object can be
represented by a sparse matrix, in which case it is a dfmSparse
class object, or if dense, then a dfmDense
object. See Details.
# S4 method for dfmDense t(x) # S4 method for dfmSparse t(x) # S4 method for dfmSparse colSums(x, na.rm = FALSE, dims = 1L, ...) # S4 method for dfmSparse rowSums(x, na.rm = FALSE, dims = 1L, ...) # S4 method for dfmSparse colMeans(x, na.rm = FALSE, dims = 1L, ...) # S4 method for dfmSparse rowMeans(x, na.rm = FALSE, dims = 1L, ...) # S4 method for dfmSparse,numeric +(e1, e2) # S4 method for numeric,dfmSparse +(e1, e2) # S4 method for dfmDense,numeric +(e1, e2) # S4 method for numeric,dfmDense +(e1, e2) # S4 method for dfm,index,index,missing [(x, i, j, ..., drop = FALSE) # S4 method for dfm,index,index,logical [(x, i, j, ..., drop = FALSE) # S4 method for dfm,missing,missing,missing [(x, i, j, ..., drop = FALSE) # S4 method for dfm,missing,missing,logical [(x, i, j, ..., drop = FALSE) # S4 method for dfm,index,missing,missing [(x, i, j, ..., drop = FALSE) # S4 method for dfm,index,missing,logical [(x, i, j, ..., drop = FALSE) # S4 method for dfm,missing,index,missing [(x, i, j, ..., drop = FALSE) # S4 method for dfm,missing,index,logical [(x, i, j, ..., drop = FALSE)
x | the dfm object |
---|---|
na.rm | if |
dims | ignored |
... | additional arguments not used here |
e1 | first quantity in "+" operation for dfm |
e2 | second quantity in "+" operation for dfm |
i | index for documents |
j | index for features |
drop | always set to |
The dfm
class is a virtual class that will contain one of two
subclasses for containing the cell counts of document-feature matrixes:
dfmSparse
or dfmDense
.
The dfmSparse
class is a sparse matrix version of
dfm-class
, inheriting dgCMatrix-class from the
Matrix package. It is the default object type created when feature
counts are the object of interest, as typical text-based feature counts
tend contain many zeroes. As long as subsequent transformations of the dfm
preserve cells with zero counts, the dfm should remain sparse.
When the Matrix package implements sparse integer matrixes, we will
switch the default object class to this object type, as integers are 4
bytes each (compared to the current numeric double type requiring 8 bytes
per cell.)
The dfmDense
class is a sparse matrix version of dfm-class
,
inheriting dgeMatrix-class from the Matrix package. dfm objects that
are converted through weighting or other transformations into cells without zeroes will
be automatically converted to the dfmDense class. This will necessarily be a much larger sized
object than one of dfmSparse
class, because each cell is recorded as a numeric (double) type
requiring 8 bytes of storage.
settings
settings that govern corpus handling and subsequent downstream
operations, including the settings used to clean and tokenize the texts,
and to create the dfm. See settings
.
weighting
the feature weighting applied to the dfm. Default is
"frequency"
, indicating that the values in the cells of the dfm are
simple feature counts. To change this, use the weight
method.
smooth
a smoothing parameter, defaults to zero. Can be changed using
either the smooth
or the weight
methods.
Dimnames
These are inherited from Matrix-class but are
named docs
and features
respectively.
# dfm subsetting x <- dfm(tokens(c("this contains lots of stopwords", "no if, and, or but about it: lots", "and a third document is it"), remove_punct = TRUE)) x[1:2, ]#> Document-feature matrix of: 2 documents, 16 features (59.4% sparse). #> 2 x 16 sparse Matrix of class "dfmSparse" #> features #> docs this contains lots of stopwords no if and or but about it a third #> text1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 #> text2 0 0 1 0 0 1 1 1 1 1 1 1 0 0 #> features #> docs document is #> text1 0 0 #> text2 0 0x[1:2, 1:5]#> Document-feature matrix of: 2 documents, 5 features (40% sparse). #> 2 x 5 sparse Matrix of class "dfmSparse" #> features #> docs this contains lots of stopwords #> text1 1 1 1 1 1 #> text2 0 0 1 0 0# fcm subsetting y <- fcm(tokens(c("this contains lots of stopwords", "no if, and, or but about it: lots"), remove_punct = TRUE))#> Error in get(".SigLength", envir = env): object '.SigLength' not foundy[1:3, ]#> Error in eval(expr, envir, enclos): object 'y' not foundy[4:5, 1:5]#> Error in eval(expr, envir, enclos): object 'y' not found