These functions compute matrixes of distances and similarities between
documents or features from a dfm
and return a
dist
object (or a matrix if specific targets are
selected). They are fast and robust because they operate directly on the sparse
dfm objects.
textstat_dist(x, selection = NULL, margin = c("documents", "features"), method = "euclidean", upper = FALSE, diag = FALSE, p = 2) textstat_simil(x, selection = NULL, margin = c("documents", "features"), method = "correlation", upper = FALSE, diag = FALSE)
x | a dfm object |
---|---|
selection | character vector of document names or feature labels from
|
margin | identifies the margin of the dfm on which similarity or
difference will be computed: |
method | method the similarity or distance measure to be used; see Details |
upper | whether the upper triangle of the symmetric \(V \times V\) matrix is recorded |
diag | whether the diagonal of the distance matrix should be recorded |
p | The power of the Minkowski distance. |
textstat_simil
and textstat_dist
return dist
class objects.
textstat_dist
options are: "euclidean"
(default),
"chisquared"
, "chisquared2"
, "hamming"
,
"kullback"
. "manhattan"
, "maximum"
, "canberra"
,
and "minkowski"
.
textstat_simil
options are: "correlation"
(default),
"cosine"
, "jaccard"
, "ejaccard"
, "dice"
,
"edice"
, "simple matching"
, "hamann"
, and
"faith"
.
If you want to compute similarity on a "normalized" dfm object
(controlling for variable document lengths, for methods such as correlation
for which different document lengths matter), then wrap the input dfm in
dfm_weight(x, "prop")
.
The "chisquared"
metric is from Legendre, P., & Gallagher,
E. D. (2001).
"Ecologically
meaningful transformations for ordination of species data".
Oecologia, 129(2), 271–280. doi.org/10.1007/s004420100716
The "chisquared2"
metric is the "Quadratic-Chi" measure from Pele,
O., & Werman, M. (2010).
"The
Quadratic-Chi Histogram Distance Family". In Computer Vision – ECCV
2010 (Vol. 6312, pp. 749–762). Berlin, Heidelberg: Springer, Berlin,
Heidelberg. doi.org/10.1007/978-3-642-15552-9_54.
"hamming"
is \(\sum{x \neq y)}\).
"kullback"
is the Kullback-Leibler distance, which assumes that
\(P(x_i) = 0\) implies \(P(y_i)=0\), and in case both \(P(x_i)\) and
\(P(y_i)\) equals to zero, then \(P(x_i) * log(p(x_i)/p(y_i))\) is
assumed to be zero as the limit value. The formula is:
$$\sum{P(x)*log(P(x)/p(y))}$$
All other measures are described in the proxy package.
textstat_dist
, as.list.dist
,
dist
# create a dfm from inaugural addresses from Reagan onwards presDfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1990), remove = stopwords("english"), stem = TRUE, remove_punct = TRUE) # distances for documents (d1 <- textstat_dist(presDfm, margin = "documents"))#> 1993-Clinton 1997-Clinton 2001-Bush 2005-Bush 2009-Obama #> 1997-Clinton 58.90671 #> 2001-Bush 52.82045 63.63961 #> 2005-Bush 62.79331 73.38256 54.32311 #> 2009-Obama 51.66237 59.95832 50.70503 62.33779 #> 2013-Obama 51.30302 60.81118 49.03060 57.90509 48.48711 #> 2017-Trump 52.14403 65.85590 48.79549 58.00000 55.65968 #> 2013-Obama #> 1997-Clinton #> 2001-Bush #> 2005-Bush #> 2009-Obama #> 2013-Obama #> 2017-Trump 55.21775as.matrix(d1)#> 1993-Clinton 1997-Clinton 2001-Bush 2005-Bush 2009-Obama #> 1993-Clinton 0.00000 58.90671 52.82045 62.79331 51.66237 #> 1997-Clinton 58.90671 0.00000 63.63961 73.38256 59.95832 #> 2001-Bush 52.82045 63.63961 0.00000 54.32311 50.70503 #> 2005-Bush 62.79331 73.38256 54.32311 0.00000 62.33779 #> 2009-Obama 51.66237 59.95832 50.70503 62.33779 0.00000 #> 2013-Obama 51.30302 60.81118 49.03060 57.90509 48.48711 #> 2017-Trump 52.14403 65.85590 48.79549 58.00000 55.65968 #> 2013-Obama 2017-Trump #> 1993-Clinton 51.30302 52.14403 #> 1997-Clinton 60.81118 65.85590 #> 2001-Bush 49.03060 48.79549 #> 2005-Bush 57.90509 58.00000 #> 2009-Obama 48.48711 55.65968 #> 2013-Obama 0.00000 55.21775 #> 2017-Trump 55.21775 0.00000# distances for specific documents textstat_dist(presDfm, "2017-Trump", margin = "documents")#> 2017-Trump #> 2017-Trump 0.00000 #> 1993-Clinton 52.14403 #> 1997-Clinton 65.85590 #> 2001-Bush 48.79549 #> 2005-Bush 58.00000 #> 2009-Obama 55.65968 #> 2013-Obama 55.21775textstat_dist(presDfm, "2005-Bush", margin = "documents", method = "jaccard")#> 2005-Bush #> 2005-Bush 1.0000000 #> 1993-Clinton 0.2216867 #> 1997-Clinton 0.2392503 #> 2001-Bush 0.2591195 #> 2009-Obama 0.2502483 #> 2013-Obama 0.2505353 #> 2017-Trump 0.1852761(d2 <- textstat_dist(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents"))#> 2009-Obama 2013-Obama #> 2009-Obama 0.00000 48.48711 #> 2013-Obama 48.48711 0.00000 #> 1993-Clinton 51.66237 51.30302 #> 1997-Clinton 59.95832 60.81118 #> 2001-Bush 50.70503 49.03060 #> 2005-Bush 62.33779 57.90509 #> 2017-Trump 55.65968 55.21775as.list(d1)#> $`1993-Clinton` #> 2005-Bush 1997-Clinton 2001-Bush 2017-Trump 2009-Obama 2013-Obama #> 62.79331 58.90671 52.82045 52.14403 51.66237 51.30302 #> #> $`1997-Clinton` #> 2005-Bush 2017-Trump 2001-Bush 2013-Obama 2009-Obama 1993-Clinton #> 73.38256 65.85590 63.63961 60.81118 59.95832 58.90671 #> #> $`2001-Bush` #> 1997-Clinton 2005-Bush 1993-Clinton 2009-Obama 2013-Obama 2017-Trump #> 63.63961 54.32311 52.82045 50.70503 49.03060 48.79549 #> #> $`2005-Bush` #> 1997-Clinton 1993-Clinton 2009-Obama 2017-Trump 2013-Obama 2001-Bush #> 73.38256 62.79331 62.33779 58.00000 57.90509 54.32311 #> #> $`2009-Obama` #> 2005-Bush 1997-Clinton 2017-Trump 1993-Clinton 2001-Bush 2013-Obama #> 62.33779 59.95832 55.65968 51.66237 50.70503 48.48711 #> #> $`2013-Obama` #> 1997-Clinton 2005-Bush 2017-Trump 1993-Clinton 2001-Bush 2009-Obama #> 60.81118 57.90509 55.21775 51.30302 49.03060 48.48711 #> #> $`2017-Trump` #> 1997-Clinton 2005-Bush 2009-Obama 2013-Obama 1993-Clinton 2001-Bush #> 65.85590 58.00000 55.65968 55.21775 52.14403 48.79549 #># similarities for documents (s1 <- textstat_simil(presDfm, method = "cosine", margin = "documents"))#> 1993-Clinton 1997-Clinton 2001-Bush 2005-Bush 2009-Obama #> 1997-Clinton 0.6680262 #> 2001-Bush 0.5358898 0.5912236 #> 2005-Bush 0.5012215 0.5006142 0.5982538 #> 2009-Obama 0.6280946 0.6593018 0.6018113 0.5266249 #> 2013-Obama 0.6265428 0.6466030 0.6193608 0.5867178 0.6815711 #> 2017-Trump 0.5511398 0.5558054 0.5327058 0.5386656 0.5192075 #> 2013-Obama #> 1997-Clinton #> 2001-Bush #> 2005-Bush #> 2009-Obama #> 2013-Obama #> 2017-Trump 0.5160104as.matrix(s1)#> 1993-Clinton 1997-Clinton 2001-Bush 2005-Bush 2009-Obama #> 1993-Clinton 1.0000000 0.6680262 0.5358898 0.5012215 0.6280946 #> 1997-Clinton 0.6680262 1.0000000 0.5912236 0.5006142 0.6593018 #> 2001-Bush 0.5358898 0.5912236 1.0000000 0.5982538 0.6018113 #> 2005-Bush 0.5012215 0.5006142 0.5982538 1.0000000 0.5266249 #> 2009-Obama 0.6280946 0.6593018 0.6018113 0.5266249 1.0000000 #> 2013-Obama 0.6265428 0.6466030 0.6193608 0.5867178 0.6815711 #> 2017-Trump 0.5511398 0.5558054 0.5327058 0.5386656 0.5192075 #> 2013-Obama 2017-Trump #> 1993-Clinton 0.6265428 0.5511398 #> 1997-Clinton 0.6466030 0.5558054 #> 2001-Bush 0.6193608 0.5327058 #> 2005-Bush 0.5867178 0.5386656 #> 2009-Obama 0.6815711 0.5192075 #> 2013-Obama 1.0000000 0.5160104 #> 2017-Trump 0.5160104 1.0000000as.list(s1)#> $`1993-Clinton` #> 1997-Clinton 2009-Obama 2013-Obama 2017-Trump 2001-Bush 2005-Bush #> 0.6680262 0.6280946 0.6265428 0.5511398 0.5358898 0.5012215 #> #> $`1997-Clinton` #> 1993-Clinton 2009-Obama 2013-Obama 2001-Bush 2017-Trump 2005-Bush #> 0.6680262 0.6593018 0.6466030 0.5912236 0.5558054 0.5006142 #> #> $`2001-Bush` #> 2013-Obama 2009-Obama 2005-Bush 1997-Clinton 1993-Clinton 2017-Trump #> 0.6193608 0.6018113 0.5982538 0.5912236 0.5358898 0.5327058 #> #> $`2005-Bush` #> 2001-Bush 2013-Obama 2017-Trump 2009-Obama 1993-Clinton 1997-Clinton #> 0.5982538 0.5867178 0.5386656 0.5266249 0.5012215 0.5006142 #> #> $`2009-Obama` #> 2013-Obama 1997-Clinton 1993-Clinton 2001-Bush 2005-Bush 2017-Trump #> 0.6815711 0.6593018 0.6280946 0.6018113 0.5266249 0.5192075 #> #> $`2013-Obama` #> 2009-Obama 1997-Clinton 1993-Clinton 2001-Bush 2005-Bush 2017-Trump #> 0.6815711 0.6466030 0.6265428 0.6193608 0.5867178 0.5160104 #> #> $`2017-Trump` #> 1997-Clinton 1993-Clinton 2005-Bush 2001-Bush 2009-Obama 2013-Obama #> 0.5558054 0.5511398 0.5386656 0.5327058 0.5192075 0.5160104 #># similarities for for specific documents textstat_simil(presDfm, "2017-Trump", margin = "documents")#> 2017-Trump #> 2017-Trump 1.0000000 #> 1993-Clinton 0.4967910 #> 1997-Clinton 0.4989669 #> 2001-Bush 0.4672634 #> 2005-Bush 0.4739241 #> 2009-Obama 0.4377484 #> 2013-Obama 0.4414144textstat_simil(presDfm, "2017-Trump", method = "cosine", margin = "documents")#> 2017-Trump #> 2017-Trump 1.0000000 #> 1993-Clinton 0.5511398 #> 1997-Clinton 0.5558054 #> 2001-Bush 0.5327058 #> 2005-Bush 0.5386656 #> 2009-Obama 0.5192075 #> 2013-Obama 0.5160104textstat_simil(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents")#> 2009-Obama 2013-Obama #> 2009-Obama 1.0000000 0.6103693 #> 2013-Obama 0.6103693 1.0000000 #> 1993-Clinton 0.5707623 0.5725041 #> 1997-Clinton 0.6026942 0.5916516 #> 2001-Bush 0.5241995 0.5523905 #> 2005-Bush 0.4330978 0.5137096 #> 2017-Trump 0.4377484 0.4414144# compute some term similarities s2 <- textstat_simil(presDfm, c("fair", "health", "terror"), method = "cosine", margin = "features") head(as.matrix(s2), 10)#> fair health terror #> fair 1.0000000 0.7559289 0.15430335 #> health 0.7559289 1.0000000 0.54433105 #> terror 0.1543033 0.5443311 1.00000000 #> fellow 0.4265617 0.7181848 0.67016625 #> citizen 0.6787417 0.7144508 0.49663296 #> today 0.6265515 0.8288497 0.59866609 #> celebr 0.4472136 0.6761234 0.48304589 #> mysteri 0.2672612 0.2357023 0.28867513 #> american 0.5665941 0.7335861 0.67709711 #> renew 0.5041842 0.4850713 0.09901475as.list(s2, n = 8)#> $fair #> continu purpos travel failur lead begin courag call #> 1.0000000 0.9636241 0.9561829 0.9449112 0.9166985 0.9091373 0.9091373 0.8971226 #>