Compute similarities between documents and/or features from a
dfm
. Uses the similarity measures defined in
simil. See pr_DB
for available distance
measures, or how to create your own.
similarity(x, selection = NULL, n = NULL, margin = c("documents", "features"), method = "correlation", sorted = TRUE, normalize = FALSE) # S4 method for dfm similarity(x, selection = NULL, n = NULL, margin = c("documents", "features"), method = "correlation", sorted = TRUE, normalize = FALSE) # S3 method for similMatrix as.matrix(x, ...)
x | a dfm object |
---|---|
selection | character or character vector of document names or feature labels from the dfm |
n | the top |
margin | identifies the margin of the dfm on which similarity will be
computed: |
method | a valid method for computing similarity from
|
sorted | sort results in descending order if |
normalize | a deprecated argument retained (temporarily) for legacy
reasons. If you want to compute similarity on a "normalized" dfm objects
(e.g. |
... | unused |
a named list of the selection labels, with a sorted named vector of similarity measures.
The method for computing feature similarities can be quite slow when there are large numbers of feature types. Future implementations will hopefully speed this up.
# create a dfm from inaugural addresses from Reagan onwards presDfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1980), stem = TRUE, remove = stopwords("english")) # compute some document similarities (tmp <- similarity(presDfm, margin = "documents"))#> Warning: 'similarity' is deprecated. #> Use 'textstat_simil' instead. #> See help("Deprecated")#> Error in get(".SigLength", envir = env): object '.SigLength' not found# output as a matrix as.matrix(tmp)#> Error in as.matrix(tmp): object 'tmp' not found# for specific comparisons similarity(presDfm, "1985-Reagan", n = 5, margin = "documents")#> Warning: 'similarity' is deprecated. #> Use 'textstat_simil' instead. #> See help("Deprecated")#> Error in get(".SigLength", envir = env): object '.SigLength' not foundsimilarity(presDfm, c("2009-Obama" , "2013-Obama"), n = 5, margin = "documents")#> Warning: 'similarity' is deprecated. #> Use 'textstat_simil' instead. #> See help("Deprecated")#> Error in get(".SigLength", envir = env): object '.SigLength' not foundsimilarity(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents")#> Warning: 'similarity' is deprecated. #> Use 'textstat_simil' instead. #> See help("Deprecated")#> Error in get(".SigLength", envir = env): object '.SigLength' not foundsimilarity(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents", method = "cosine")#> Warning: 'similarity' is deprecated. #> Use 'textstat_simil' instead. #> See help("Deprecated")#> Error in getMethod("t", "dgCMatrix"): no generic function found for 't'similarity(presDfm, "2005-Bush", margin = "documents", method = "eJaccard", sorted = FALSE)#> Warning: 'similarity' is deprecated. #> Use 'textstat_simil' instead. #> See help("Deprecated")#> similarity Matrix: #> $`2005-Bush` #> 1981-Reagan 1985-Reagan 1989-Bush 1993-Clinton 1997-Clinton 2001-Bush #> 0.8063781 0.8017041 0.8077805 0.8507758 0.8211545 0.8875512 #> 2009-Obama 2013-Obama 2017-Trump #> 0.8172377 0.8454122 0.8393692 #># compute some term similarities similarity(presDfm, c("fair", "health", "terror"), method="cosine", margin = "features", 20)#> Warning: 'similarity' is deprecated. #> Use 'textstat_simil' instead. #> See help("Deprecated")#> Error in get(".SigLength", envir = env): object '.SigLength' not found