p_attribute_encode.Rd
Pure R implementation to generate positional attribute from a character vector of tokens (the token stream).
p_attribute_encode(token_stream, p_attribute = "word", registry_dir, corpus, data_dir, method = c("R", "CWB"), verbose = TRUE, encoding = get_encoding(token_stream), compress = NULL) p_attribute_recode(data_dir, p_attribute, from = c("UTF-8", "latin1"), to = c("UTF-8", "latin1"))
token_stream | A character vector with the tokens of the corpus. |
---|---|
p_attribute | The positional attribute. |
registry_dir | Registry directory (needed by |
corpus | The CWB corpus (needed by |
data_dir | The data directory for the corpus with the binary files. |
method | Either 'CWB' or 'R'. |
verbose | Logical. |
encoding | Encoding as defined in the charset corpus property of the registry file for the corpus ('latin1' to 'latin9', and 'utf8'). |
compress | Logical. |
from | Character string describing the current encoding of the attribute. |
to | Character string describing the target encoding of the attribute. |
Four steps generate the binary CWB corpus data format for positional
attributes: First, encode a character vector (the token stream) using
p_attribute_encode
. Second, create reverse index using
p_attribute_makeall
. Third, compress token stream using
p_attribute_huffcode
. Fourth, compress index files using
p_attribute_compress_rdx
.
The implementation for the first two steps (p_attribute_encode
and
p_attribute_makeall
) is a pure R implementation (so far). These two
steps are enough to use the CQP functionality. To run
p_attribute_huffcode
and p_attribute_compress_rdx
, an
installation of the CWB may be necessary.
See the CQP Corpus Encoding Tutorial (http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf) for an explanation of the procedure (section 3, ``Indexing and compression without CWB/Perl'').
p_attribute_recode
will recode the values in the avs-file and change
the attribute value index in the avx file. The rng-file remains unchanged. The registry
file remains unchanged, and it is highly recommended to consider s_attribute_recode
as a helper for corpus_recode
that will recode all s-attributes, all p-attributes,
and will reset the encoding in the registry file.
library(RcppCWB) # In this example, we pursue a "pure R" approach. To rely on the "CWB" # method, you can use the cwb_install() function, which will download and # install the CWB command line # tools within the package. tokens <- readLines(system.file(package = "RcppCWB", "extdata", "examples", "reuters.txt")) # create new (and empty) directory structure tmpdir <- normalizePath(tempdir(), winslash = "/") if (.Platform$OS.type == "windows") tmpdir <- normalizePath(tmpdir, winslash = "/") registry_tmp <- file.path(tmpdir, "registry", fsep = "/") data_dir_tmp <- file.path(tmpdir, "data_dir", fsep = "/") if (file.exists(file.path(data_dir_tmp, "word.corpus"))){ file.remove(file.path(data_dir_tmp, "word.corpus")) }#> [1] TRUEif (dir.exists(registry_tmp)) unlink(registry_tmp, recursive = TRUE) if (dir.exists(data_dir_tmp)) unlink(data_dir_tmp, recursive = TRUE) dir.create (registry_tmp) dir.create(data_dir_tmp) p_attribute_encode( corpus = "reuters", token_stream = tokens, p_attribute = "word", data_dir = data_dir_tmp, method = "R", registry_dir = registry_tmp, compress = FALSE, encoding = "utf8" )#>#>#>#>#>#>#>#> === Makeall: processing corpus reuters === #> Registry directory: /private/var/folders/m_/431fjnbs1t32_62d35wvs7pr0000gp/T/RtmpF5cQGt/registry #> ATTRIBUTE word #> - lexicon OK #> - frequencies OK #> - token stream OK (COMPRESSED) #> - index OK (COMPRESSED) #> ========================================regdata <- registry_data( id = "REUTERS", name = "Reuters Sample Corpus", home = data_dir_tmp, properties = c(encoding = "utf-8", language = "en"), p_attributes = "word" ) regfile <- registry_file_write( data = regdata, corpus = "REUTERS", registry_dir = registry_tmp, data_dir = data_dir_tmp, ) if (cqp_is_initialized()) cqp_reset_registry(registry_tmp) else cqp_initialize(registry_tmp)#> [1] TRUE#> NULLregions <- cqp_dump_subcorpus(corpus = "REUTERS") kwic <- apply( regions, 1, function(region){ ids <- cl_cpos2id("REUTERS", "word", registry_tmp, cpos = region[1]:region[2]) words <- cl_id2str(corpus = "REUTERS", p_attribute = "word", registry = registry_tmp, id = ids) paste0(words, collapse = " ") } ) kwic[1:10]#> [1] "prices for crude oil by 1.50 dlrs" #> [2] "light of falling oil product prices and" #> [3] "a weak crude oil market a company" #> [4] "line of U.S oil companies that have" #> [5] "days citing weak oil markets Reuter OPEC" #> [6] "current slide in oil prices oil industry" #> [7] "in oil prices oil industry analysts said" #> [8] "movement to higher oil prices was never" #> [9] "CERA Analysts and oil industry sources said" #> [10] "faces is excess oil supply in world"