Create a sparse feature co-occurrence matrix, measuring co-occurrences of features within a user-defined context. The context can be defined as a document or a window within a collection of documents, with an optional vector of weights applied to the co-occurrence counts.
fcm(x, context = c("document", "window"), count = c("frequency", "boolean", "weighted"), window = 5L, weights = 1L, ordered = FALSE, span_sentence = TRUE, tri = TRUE, ...)
x | character, corpus, tokens, or dfm object from which to generate the feature co-occurrence matrix |
---|---|
context | the context in which to consider term co-occurrence:
|
count | how to count co-occurrences:
|
window | positive integer value for the size of a window on either side of the target feature, default is 5, meaning 5 words before and after the target feature |
weights | a vector of weights applied to each distance from
|
ordered | if |
span_sentence | if |
tri | if |
... | not used here |
The function fcm
provides a very general
implementation of a "context-feature" matrix, consisting of a count of
feature co-occurrence within a defined context. This context, following
Momtazi et. al. (2010), can be defined as the document,
sentences within documents, syntactic relationships beteeen
features (nouns within a sentence, for instance), or according to a
window. When the context is a window, a weighting function is
typically applied that is a function of distance from the target word (see
Jurafsky and Martin 2015, Ch. 16) and ordered co-occurrence of the two
features is considered (see Church & Hanks 1990).
fcm provides all of this functionality, returning a \(V * V\)
matrix (where \(V\) is the vocabulary size, returned by
nfeature
). The tri = TRUE
option will only return the
upper part of the matrix.
Unlike some implementations of co-occurrences, fcm counts feature
co-occurrences with themselves, meaning that the diagonal will not be zero.
fcm also provides "boolean" counting within the context of "window",
which differs from the counting within "document".
is.fcm(x)
returns TRUE
if and only if its x is an object of
type fcm.
Momtazi, S., Khudanpur, S., & Klakow, D. (2010). "A comparative study of word co-occurrence for term clustering in language model-based sentence retrieval." Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, California, June 2010, pp. 325-328. Daniel Jurafsky & James H. Martin. (2015) Speech and Language Processing. Draft of April 11, 2016. Chapter 16, Semantics with Dense Vectors. Church, K. W. & P. Hanks (1990) "Word association norms, mutual information, and lexicography" Computational Linguistics, 16(1):22–29.
# see http://bit.ly/29b2zOA txt <- "A D A C E A D F E B A C E D" fcm(txt, context = "window", window = 2)#> Feature co-occurrence matrix of: 6 by 6 features. #> 6 x 6 sparse Matrix of class "fcm" #> features #> features A D C E F B #> A 1 3 3 4 1 1 #> D . . 2 3 1 . #> C . . . 2 . 1 #> E . . . . 1 1 #> F . . . . . 1 #> B . . . . . .fcm(txt, context = "window", count = "weighted", window = 3)#> Feature co-occurrence matrix of: 6 by 6 features. #> 6 x 6 sparse Matrix of class "fcm" #> features #> features A D C E F B #> A 0.8333333 3.333333 2.833333 2.833333 0.8333333 1.0000000 #> D . . 1.333333 2.333333 1.0000000 0.3333333 #> C . . . 2.333333 . 0.5000000 #> E . . . . 1.3333333 1.3333333 #> F . . . . . 0.5000000 #> B . . . . . .fcm(txt, context = "window", count = "weighted", window = 3, weights = c(3, 2, 1), ordered = TRUE, tri = FALSE)#> Feature co-occurrence matrix of: 6 by 6 features. #> 6 x 6 sparse Matrix of class "fcm" #> features #> features A D C E F B #> A 3 7 7 5 2 0 #> D 3 0 2 3 3 1 #> C 2 3 0 6 0 0 #> E 5 5 1 0 1 3 #> F 1 0 0 3 0 2 #> B 3 0 2 1 0 0# with multiple documents txts <- c("a a a b b c", "a a c e", "a c e f g") fcm(txts, context = "document", count = "frequency")#> Error in get(".SigLength", envir = env): object '.SigLength' not foundfcm(txts, context = "document", count = "boolean")#> Feature co-occurrence matrix of: 6 by 6 features. #> 6 x 6 sparse Matrix of class "fcm" #> features #> features a b c e f g #> a 2 1 3 2 1 1 #> b . 1 1 . . . #> c . . 0 2 1 1 #> e . . . 0 1 1 #> f . . . . 0 1 #> g . . . . . 0fcm(txts, context = "window", window = 2)#> Feature co-occurrence matrix of: 6 by 6 features. #> 6 x 6 sparse Matrix of class "fcm" #> features #> features a b c e f g #> a 4 3 3 2 . . #> b . 1 2 . . . #> c . . . 2 1 . #> e . . . . 1 1 #> f . . . . . 1 #> g . . . . . .# from tokens txt <- c("The quick brown fox jumped over the lazy dog.", "The dog jumped and ate the fox.") toks <- tokens(char_tolower(txt), remove_punct = TRUE) fcm(toks, context = "document")#> Error in get(".SigLength", envir = env): object '.SigLength' not foundfcm(toks, context = "window", window = 3)#> Feature co-occurrence matrix of: 10 by 10 features. #> 10 x 10 sparse Matrix of class "fcm" #> features #> features the quick brown fox jumped over lazy dog and ate #> the . 1 1 3 3 1 1 2 2 1 #> quick . . 1 1 1 . . . . . #> brown . . . 1 1 1 . . . . #> fox . . . . 1 1 . . 1 1 #> jumped . . . . . 1 1 1 1 1 #> over . . . . . . 1 1 . . #> lazy . . . . . . . 1 . . #> dog . . . . . . . . 1 1 #> and . . . . . . . . . 1 #> ate . . . . . . . . . .