Define the objects to be tested

Define the objects:

txt <- c(d1 = "a b c d e g h",  d2 = "a b e g h i j")
toks_uni <- tokens(txt)
dfm_uni <- dfm(toks_uni)
toks_bi <- tokens(txt, n = 2, concatenator = " ")
dfm_bi <- dfm(toks_bi)

char_uni <- c("a", "b", "g", "j")
char_bi <- c("a b", "g j")
list_uni <- list("a", "b", "g", "j")
list_bi <- list("a b", "g j")
(dict_uni <- dictionary(one = c("a", "b"), two = c("g", "j")))
## Dictionary object with 2 key entries.
## - one:
##   - a, b
## - two:
##   - g, j
(dict_bi <- dictionary(one = "a b", two = "g j"))
## Dictionary object with 2 key entries.
## - one:
##   - a b
## - two:
##   - g j
(coll_bi <- textstat_collocations(toks_uni, method = "lr", max_size = 2))
##   collocation length count       G2
## 1         a b      2     2 10.81347
## 2         e g      2     2 10.81347
## 3         g h      2     2 10.81347
(coll_tri <- textstat_collocations(toks_uni, method = "lr", min_size = 3, max_size = 3))
##   collocation length count       G2
## 1       e g h      3     2 20.01610
## 2         a b      2     2 10.81347
## 3         e g      2     2 10.81347
## 4         g h      2     2 10.81347

tokens_select() (includes tokens_remove())

With character objects, of lists of characters, it does not work on whitespace separated sequences:

# as expected
tokens_select(toks_uni, char_uni)
## tokens from 2 documents.
## d1 :
## [1] "a" "b" "g"
## 
## d2 :
## [1] "a" "b" "g" "j"
tokens_select(toks_uni, list_uni)
## tokens from 2 documents.
## d1 :
## [1] "a" "b" "g"
## 
## d2 :
## [1] "a" "b" "g" "j"
# not as expected
tokens_select(toks_uni, char_bi)
## tokens from 2 documents.
## d1 :
## [1] "a" "b"
## 
## d2 :
## [1] "a" "b"
tokens_select(toks_uni, list_bi)
## tokens from 2 documents.
## d1 :
## character(0)
## 
## d2 :
## character(0)
# as expected
tokens_select(toks_bi, char_uni)
## tokens from 2 documents.
## d1 :
## character(0)
## 
## d2 :
## character(0)
# not as expected
tokens_select(toks_bi, list_uni)
## tokens from 2 documents.
## d1 :
## character(0)
## 
## d2 :
## character(0)
tokens_select(toks_bi, char_bi)
## tokens from 2 documents.
## d1 :
## character(0)
## 
## d2 :
## character(0)
tokens_select(toks_bi, list_bi)
## tokens from 2 documents.
## d1 :
## [1] "a b"
## 
## d2 :
## [1] "a b"

With dictionary objects:

# as expected
tokens_select(toks_uni, dict_uni)
## tokens from 2 documents.
## d1 :
## [1] "a" "b" "g"
## 
## d2 :
## [1] "a" "b" "g" "j"
tokens_select(toks_bi, dict_uni)
## tokens from 2 documents.
## d1 :
## character(0)
## 
## d2 :
## character(0)
# not as expected
tokens_select(toks_uni, dict_bi)
## tokens from 2 documents.
## d1 :
## [1] "a" "b"
## 
## d2 :
## [1] "a" "b"
tokens_select(toks_bi, dict_bi)
## tokens from 2 documents.
## d1 :
## character(0)
## 
## d2 :
## character(0)

With collocations objects:

# definitely not expected
tokens_select(toks_uni, coll_bi)
## tokens from 2 documents.
## d1 :
## [1] "a" "b" "e" "g" "h"
## 
## d2 :
## [1] "a" "b" "e" "g" "h"
tokens_select(toks_uni, coll_tri)
## tokens from 2 documents.
## d1 :
## [1] "a" "b" "e" "g" "h"
## 
## d2 :
## [1] "a" "b" "e" "g" "h"
# not expected
tokens_select(toks_bi, coll_bi)
## tokens from 2 documents.
## d1 :
## character(0)
## 
## d2 :
## character(0)
# expected
tokens_select(toks_bi, coll_tri)
## tokens from 2 documents.
## d1 :
## character(0)
## 
## d2 :
## character(0)

With dfm objects:

# fails as expected
tokens_select(toks_uni, dfm_uni)
## Error in features2list(features): features must be a character vector, a list of character elements, a dictionary, or collocations
tokens_select(toks_uni, dfm_bi)
## Error in features2list(features): features must be a character vector, a list of character elements, a dictionary, or collocations
tokens_select(toks_bi, dfm_uni)
## Error in features2list(features): features must be a character vector, a list of character elements, a dictionary, or collocations
tokens_select(toks_bi, dfm_bi)
## Error in features2list(features): features must be a character vector, a list of character elements, a dictionary, or collocations

tokens_compound() and selecting token sequences

I understand that an important application is selecting token sequences, and this is the core purpose of tokens_compound(). But I

Currently we look for sequences in the same way we use the keywords in kwic():

tokens_select(toks_uni, "c d e", padding = TRUE)
## tokens from 2 documents.
## d1 :
## [1] ""  ""  "c" "d" "e" ""  "" 
## 
## d2 :
## character(0)

For list objects:

# does not work as with character
tokens_select(toks_uni, list("c d e"), padding = TRUE)
## tokens from 2 documents.
## d1 :
## character(0)
## 
## d2 :
## character(0)
# this works for a sequence
tokens_select(toks_uni, list("a", "h", c("c", "d", "e")), padding = TRUE)
## tokens from 2 documents.
## d1 :
## [1] "a" ""  "c" "d" "e" ""  "h"
## 
## d2 :
## [1] "a" ""  ""  ""  "h" ""  ""

With ambiguous results when the sequence comes from collocations:

tokens_select(toks_uni, coll_bi, padding = TRUE)
## tokens from 2 documents.
## d1 :
## [1] "a" "b" ""  ""  "e" "g" "h"
## 
## d2 :
## [1] "a" "b" "e" "g" "h" ""  ""

Recommendations

  1. Never split character elements, so that a features value of "a b" will only match the feature "a b", not "a" and "b".

  2. Eliminate the list type of input.

  3. Convert dictionaries by unlisting them, and using the compound elements as matched, if found, e.g.

    unlist(dict_bi, use.names = FALSE)
    ## [1] "a b" "g j"
  4. Use the collocations object as a character vector “as is” from the collocation element, e.g. the first column of

    coll_bi
    ##   collocation length count       G2
    ## 1         a b      2     2 10.81347
    ## 2         e g      2     2 10.81347
    ## 3         g h      2     2 10.81347
  5. Make it clear that a dfm object cannot be an input to token feature selection, although featnames(dfm_uni) (e.g.) could. The tokens_* methods do not currently accept dfm objects as features anyway, but the ?features page suggests they do.

  6. Define a new object type called sequence which is essentially a list of characters, where each element of the character elements matches a pattern in sequence. This is how the list behaviour for features is supposed to work now. I propose we stop allowing a list altogether, but define a replacement for list() called sequence() that creates a special list - similar to what we do with dictionary() at the moment. This makes the distinction between characters that contain whitespace, and lists of characters. It also means a) the definition as a sequence is more explicit, and b) we have full control over the definition of this object, whereas with a plain list we are far more limited. We could put a concatenator slot in this list, for instance, so that it takes that as a default. (Most importantly we can decide on this and modify later.)

    I propose to use the sequences class for multi-word kwic matches too. (See below.)

  7. tokens_compound() should take only sequences and collocations inputs for features.

kwic() based token selection

The kwic() function is a bit different, since it takes not a features argument but rather a keywords argument. We currently tokenize the elements of the input to allow the white space to separate pattern matches. But it’s still not working as expected with character or lists of character:

# not expected
kwic(txt, char_uni)
##                                       
##  [d1, 1:2]           | a b | c d e g h
##  [d1, 6:6] a b c d e |  g  | h        
##  [d2, 1:2]           | a b | e g h i j
##  [d2, 4:4]     a b e |  g  | h i j    
##  [d2, 7:7] b e g h i |  j  |
kwic(txt, list_uni)
##                                       
##  [d1, 1:2]           | a b | c d e g h
##  [d1, 6:6] a b c d e |  g  | h        
##  [d2, 1:2]           | a b | e g h i j
##  [d2, 4:4]     a b e |  g  | h i j    
##  [d2, 7:7] b e g h i |  j  |
# missing the "g h"
kwic(txt, char_bi)
##                              
##  [d1, 1:2]  | a b | c d e g h
##  [d2, 1:2]  | a b | e g h i j
# should this return some match?
kwic(txt, list_bi)
## NULL

Behaviour with collocations objects is even weirder, as these return identical results:

kwic(txt, coll_bi)
##                                           
##  [d1, 1:2]         |    a b    | c d e g h
##  [d1, 5:7] a b c d |   e g h   |          
##  [d2, 1:5]         | a b e g h | i j
kwic(txt, coll_tri)
##                                           
##  [d1, 1:2]         |    a b    | c d e g h
##  [d1, 5:7] a b c d |   e g h   |          
##  [d2, 1:5]         | a b e g h | i j

With dictionary objects:

# should not be an "a b"
kwic(txt, dict_uni)
##                                       
##  [d1, 1:2]           | a b | c d e g h
##  [d1, 6:6] a b c d e |  g  | h        
##  [d2, 1:2]           | a b | e g h i j
##  [d2, 4:4]     a b e |  g  | h i j    
##  [d2, 7:7] b e g h i |  j  |
# missing the "g h"
kwic(txt, dict_bi)
##                              
##  [d1, 1:2]  | a b | c d e g h
##  [d2, 1:2]  | a b | e g h i j

With dfm objects supplied for keywords, kwic fails (as it should).

Recommendations

We could:

  1. require sequences to be wrapped in sequence(), e.g. sequence(c("nuclear", "power|energy|war")), or

  2. Continue to make an exception for kwic(), so that we split any keywords expressions on whitespace and consider the elements as separate matches. But this should never make c("a", "b") the same as "a b". This is consistent with the existing help page ?kwic, but I don’t think it’s implemented correctly.

  3. Do both. I propose this option for now, and encouraging users to use sequences().

I propose that if we allow a list input, then we unlist it, and also unlist a dictionary input, to treat each character (or value, for a dictionary) object the same. But I’d prefer not to allow list inputs, except for dictionaries.

dfm_select() (includes dfm_remove())

With character objects, of lists of characters, it does not work on whitespace separated sequences:

# as expected
dfm_select(dfm_uni, char_uni)
## Document-feature matrix of: 2 documents, 4 features (12.5% sparse).
## 2 x 4 sparse Matrix of class "dfmSparse"
##     features
## docs a b g j
##   d1 1 1 1 0
##   d2 1 1 1 1
dfm_select(dfm_uni, list_uni)
## Document-feature matrix of: 2 documents, 4 features (12.5% sparse).
## 2 x 4 sparse Matrix of class "dfmSparse"
##     features
## docs a b g j
##   d1 1 1 1 0
##   d2 1 1 1 1
dfm_select(dfm_uni, char_bi)
## NULL
dfm_select(dfm_uni, list_bi)
## NULL
dfm_select(dfm_bi, char_uni)
## NULL
dfm_select(dfm_bi, list_uni)
## NULL
dfm_select(dfm_bi, char_bi)
## Document-feature matrix of: 2 documents, 1 feature (0% sparse).
## 2 x 1 sparse Matrix of class "dfmSparse"
##     features
## docs a b
##   d1   1
##   d2   1
dfm_select(dfm_bi, list_bi)
## Document-feature matrix of: 2 documents, 1 feature (0% sparse).
## 2 x 1 sparse Matrix of class "dfmSparse"
##     features
## docs a b
##   d1   1
##   d2   1
dfm_select(dfm_uni, dict_uni)
## Document-feature matrix of: 2 documents, 4 features (12.5% sparse).
## 2 x 4 sparse Matrix of class "dfmSparse"
##     features
## docs a b g j
##   d1 1 1 1 0
##   d2 1 1 1 1
dfm_select(dfm_bi, dict_uni)
## NULL

There is no reason this should not work, if we unlist it and use the multi-word dictionary keys as literal matches, as I propose above for tokens:

# not as expected - I think the error message is unintended (a bug iow)
dfm_select(dfm_uni, dict_bi)
## Error in dfm_select.dfm(dfm_uni, dict_bi): dfm_select not implemented for ngrams > 1 and multi-word dictionary values
# as expected
dfm_select(dfm_bi, dict_bi)
## Document-feature matrix of: 2 documents, 1 feature (0% sparse).
## 2 x 1 sparse Matrix of class "dfmSparse"
##     features
## docs a b
##   d1   1
##   d2   1

With collocations objects:

# return is as expected but should not issue an error message
dfm_select(dfm_uni, coll_bi)
## Warning in features2vector(as.list(features)): rmarkdown::render("/
## Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/tests/misc/
## features_comments.Rmd", encoding = "UTF-8") does not support multi-word
## features
## NULL
dfm_select(dfm_uni, coll_tri)
## Warning in features2vector(as.list(features)): rmarkdown::render("/
## Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/tests/misc/
## features_comments.Rmd", encoding = "UTF-8") does not support multi-word
## features
## NULL
# not expected
dfm_select(dfm_bi, coll_bi)
## Warning in features2vector(as.list(features)): rmarkdown::render("/
## Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/tests/misc/
## features_comments.Rmd", encoding = "UTF-8") does not support multi-word
## features
## Document-feature matrix of: 2 documents, 3 features (0% sparse).
## 2 x 3 sparse Matrix of class "dfmSparse"
##     features
## docs a b e g g h
##   d1   1   1   1
##   d2   1   1   1
# not as expected and is incorrect
dfm_select(dfm_bi, coll_tri)
## Warning in features2vector(as.list(features)): rmarkdown::render("/
## Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/tests/misc/
## features_comments.Rmd", encoding = "UTF-8") does not support multi-word
## features
## Document-feature matrix of: 2 documents, 3 features (0% sparse).
## 2 x 3 sparse Matrix of class "dfmSparse"
##     features
## docs a b e g g h
##   d1   1   1   1
##   d2   1   1   1

With dfm objects:

# as expected
dfm_select(dfm_uni, dfm_uni[, 1:3])
## Document-feature matrix of: 2 documents, 3 features (16.7% sparse).
## 2 x 3 sparse Matrix of class "dfmSparse"
##     features
## docs a b c
##   d1 1 1 1
##   d2 1 1 0
dfm_select(dfm_bi, dfm_bi[, 1:3])
## Document-feature matrix of: 2 documents, 3 features (33.3% sparse).
## 2 x 3 sparse Matrix of class "dfmSparse"
##     features
## docs a b b c c d
##   d1   1   1   1
##   d2   1   0   0
# as expected, although unituitive
dfm_select(dfm_uni, dfm_bi[, 1:3])
## Document-feature matrix of: 2 documents, 3 features (100% sparse).
## 2 x 3 sparse Matrix of class "dfmSparse"
##    a b b c c d
## d1   0   0   0
## d2   0   0   0
dfm_select(dfm_bi, dfm_uni[, 1:3])
## Document-feature matrix of: 2 documents, 3 features (100% sparse).
## 2 x 3 sparse Matrix of class "dfmSparse"
##    a b c
## d1 0 0 0
## d2 0 0 0

It would be worth testing these with wildcarded features too.

Recommendations

The multi-word feature values for dfm behave as I would expect them to. Most of the behaviors are ok as is. Still, we need to:

  1. Check and fix the warnings and error messages. It should be perfectly ok now to use multi-word features for a match (but only one element per feature).

  2. Change the way that collocations are treated, to match my suggestion above.

Overall Recommendations

  1. We change all three arguments to simply pattern.

    We are mixing up the definitions of tokens with features by having the same argument name in tokens_select() and dfm_select(). Should we consider an alternative that clearly covers both? This is more appropriately named for dfm_select(), but even then it’s not really a vector of features, but rather a set of objects that contain pattern matches for features.

    What if we adopted the stringi convention and called this pattern? This would also make it compatible if down the road we use the ?readr::modifiers syntax. We could use this for kwic too, but make it clear (for kwic only!) that we will consider whitespace boundaries in the pattern as a separator for sequences of tokens on which we match the patterns element by element for the sequence. For all other applications of pattern, we consider white space as part of the pattern.

    We would need to implement a deprecation for the features and keywords arguments if we do this, but if you look at dfm() you will see how this is readily done.

  2. We implement all of the above expectations as unit tests, once their behaviour has been fixed.