vignettes/pkgdown/examples/lsa.Rmd
lsa.Rmd
In this vignette, we show how to perform Latent Semantic Analysis using the quanteda package based on Grossman and Frieder’s Information Retrieval, Algorithms and Heuristics.
LSA decomposes document-feature matrix into a reduced vector space that is assumed to reflect semantic structure.
New documents or queries can be ‘folded-in’ to this constructed latent semantic space for downstream tasks.
library(quanteda)
txt <- c(d1 = "Shipment of gold damaged in a fire", d2 = "Delivery of silver arrived in a silver truck", d3 = "Shipment of gold arrived in a truck" ) mydfm <- dfm(txt) mydfm
## Document-feature matrix of: 3 documents, 11 features (36.4% sparse).
## features
## docs shipment of gold damaged in a fire delivery silver arrived
## d1 1 1 1 1 1 1 1 0 0 0
## d2 0 1 0 0 1 1 0 1 2 1
## d3 1 1 1 0 1 1 0 0 0 1
## [ reached max_nfeat ... 1 more feature ]
library("quanteda.textmodels")
##
## Attaching package: 'quanteda.textmodels'
## The following object is masked from 'package:quanteda':
##
## data_dfm_lbgexample
mylsa <- textmodel_lsa(mydfm)
## Warning in fun(A, k, nu, nv, opts, mattype = "dgCMatrix"): all singular values
## are requested, svd() is used instead
the new document vector coordinates in the reduced 2-dimensional space is:
mylsa$docs[, 1:2]
## [,1] [,2]
## d1 -0.4944666 0.6491758
## d2 -0.6458224 -0.7194469
## d3 -0.5817355 0.2469149
Now the new unseen document can be represented in the reduced 2-dimensional space. The unseen query document:
querydfm <- dfm(c("gold silver truck")) %>% dfm_select(pattern = mydfm)
## Warning: pattern = dfm is deprecated; use dfm_match() instead
querydfm
## Document-feature matrix of: 1 document, 11 features (72.7% sparse).
## features
## docs shipment of gold damaged in a fire delivery silver arrived
## text1 0 0 1 0 0 0 0 0 1 0
## [ reached max_nfeat ... 1 more feature ]
newq <- predict(mylsa, newdata = querydfm) newq$docs_newspace[, 1:2]
## [1] -0.2140026 -0.1820571