pattern2id.Rd
pattern2id
converts regex or glob to type IDs to allow
C++ function to perform fast searches in tokens object. C++ functions use a
list of type IDs to construct a hash table, against which sub-vectors of
tokens object are matched. This function constructs an index of glob
patterns for faster matching.
pattern2fixed
converts regex and glob patterns to fixed patterns.
index_types
is an auxiliary function for pattern2id
that
constructs an index of "glob" or "fixed" patterns to avoid expensive
sequential search. For example, a type "cars" is index by keys "cars",
"car?", "c*", "ca*", "car*" and "cars*" when valuetype="glob"
.
pattern2id(pattern, types, valuetype = c("glob", "fixed", "regex"), case_insensitive = TRUE, keep_nomatch = FALSE) pattern2fixed(pattern, types, valuetype = c("glob", "fixed", "regex"), case_insensitive = TRUE, keep_nomatch = FALSE) index_types(types, valuetype, case_insensitive, max_len = NULL)
pattern | a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
---|---|
types | unique types of tokens obtained by |
valuetype | the type of pattern matching: |
case_insensitive | if |
keep_nomatch | keep patterns not found |
max_len | maximum length of types to be indexed |
pattern2id
returns a list of integer vectors containing type
IDs
pattern2fixed
returns a list of character vectors containing
types
index_types
returns a list of integer vectors containing type
IDs with index keys as an attribute
types <- c("A", "AA", "B", "BB", "BBB", "C", "CC") pats_regex <- list(c("^a$", "^b"), c("c"), c("d")) pattern2id(pats_regex, types, "regex", case_insensitive = TRUE)#> [[1]] #> [1] 1 3 #> #> [[2]] #> [1] 1 4 #> #> [[3]] #> [1] 1 5 #> #> [[4]] #> [1] 6 #> #> [[5]] #> [1] 7 #>pats_glob <- list(c("a*", "b*"), c("c"), c("d")) pattern2id(pats_glob, types, "glob", case_insensitive = TRUE)#> [[1]] #> [1] 1 3 #> #> [[2]] #> [1] 2 3 #> #> [[3]] #> [1] 1 4 #> #> [[4]] #> [1] 2 4 #> #> [[5]] #> [1] 1 5 #> #> [[6]] #> [1] 2 5 #> #> [[7]] #> [1] 6 #>pattern <- list(c("^a$", "^b"), c("c"), c("d")) types <- c("A", "AA", "B", "BB", "BBB", "C", "CC") pattern2fixed(pattern, types, "regex", case_insensitive = TRUE)#> [[1]] #> [1] "A" "B" #> #> [[2]] #> [1] "A" "BB" #> #> [[3]] #> [1] "A" "BBB" #> #> [[4]] #> [1] "C" #> #> [[5]] #> [1] "CC" #>index <- index_types(c("xxx", "yyyy", "ZZZ"), "glob", FALSE, 3) quanteda:::search_glob("yy*", attr(index, "type_search"), index)#> [1] 2