Read, process and write data on structural attributes.

s_attribute_encode(values, data_dir, s_attribute, corpus, region_matrix,
  method = c("R", "CWB"), registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  encoding, delete = FALSE, verbose = TRUE)

s_attribute_recode(data_dir, s_attribute, from = c("UTF-8", "latin1"),
  to = c("UTF-8", "latin1"))

s_attribute_files(s_attribute, data_dir)

s_attribute_get_values(s_attribute, data_dir)

s_attribute_get_regions(s_attribute, data_dir)

s_attribute_merge(x, y)

s_attribute_delete(corpus, s_attribute)

Arguments

values

A character vector with the values of the structural attribute.

data_dir

The data directory where to write the files.

s_attribute

Atomic character vector, the name of the structural attribute.

corpus

A CWB corpus.

region_matrix

A two-column matrix with corpus positions.

method

EWither 'R' or 'CWB'.

registry_dir

Path name of the registry directory.

encoding

Encoding of the data.

delete

Logical, whether a call to RcppCWB::cl_delete_corpus is performed.

verbose

Logical.

from

Character string describing the current encoding of the attribute.

to

Character string describing the target encoding of the attribute.

x

Data defining a first s-attribute, a data.table (or an object coercible to a data.table) with three columns ("cpos_left", "cpos_right", "value").

y

Data defining a second s-attribute, a data.table (or an object coercible to a data.table)with three columns ("cpos_left", "cpos_right", "value").

Details

In addition to using CWB functionality, the s_attribute_encode function includes a pure R implementation to add or modify structural attributes of an existing CWB corpus.

If the corpus has been loaded/used before, a new s-attribute may not be available unless RcppCWB::cl_delete_corpus has been called. Use the argument delete for calling this function.

s_attribute_recode will recode the values in the avs-file and change the attribute value index in the avx file. The rng-file remains unchanged. The registry file remains unchanged, and it is highly recommended to consider s_attribute_recode as a helper for corpus_recode that will recode all s-attributes, all p-attributes, and will reset the encoding in the registry file.

s_attribute_files will return a named character vector with the data files (extensions: "avs", "avx", "rng") in the directory indicated by data_dir for the structural attribute s_attribute.

s_attribute_get_values is equivalent to performing the CL function cl_struc2id for all strucs of a structural attribute. It is a "pure R" operation that is faster than using CL, as it processes entire files for the s-attribute directly. The return value is a character vector with all string values for the s-attribute.

s_attribute_get_regions will return a two-column integer matrix with regions for the strucs of a given s-attribute. Left corpus positions are in the first column, right corpus positions in the second column. The result is equivalent to calling RcppCWB::get_region_matrix for all strucs of a s-attribute, but may be somewhat faster. It is a "pure R" function which is fast as it processes files entirely and directly.

s_attribute_merge combines two tables with regions for s-attributes checking for intersections that may cause problems. The heuristic is to keep all non-intersecting annotations and those annotations that define the same region in object x and object y. Annotations of x and y which overlap uncleanly, i.e. without an identity of the left and the right corpus position ("cpos_left" / "cpos_right") are dropped. The scenario for using the function is to decode a s-attribute (using s_attribute_decode), mix in an additional annotation, and to re-encode the enhanced s-attribute (using s_attribute_encode).

Function s_attribute_delete is not yet implemented.

See also

To decode a structural attribute, see s_attribute_decode.

Examples

require("RcppCWB") registry_tmp <- file.path(normalizePath(tempdir(), winslash = "/"), "cwb", "registry", fsep = "/") data_dir_tmp <- file.path( normalizePath(tempdir(), winslash = "/"), "cwb", "indexed_corpora", "reuters", fsep = "/" ) corpus_copy( corpus = "REUTERS", registry_dir = system.file(package = "RcppCWB", "extdata", "cwb", "registry"), data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters"), registry_dir_new = registry_tmp, data_dir_new = data_dir_tmp )
#> | | | 0% | |=== | 5% | |====== | 9% | |========== | 14% | |============= | 18% | |================ | 23% | |=================== | 27% | |====================== | 32% | |========================= | 36% | |============================= | 41% | |================================ | 45% | |=================================== | 50% | |====================================== | 55% | |========================================= | 59% | |============================================= | 64% | |================================================ | 68% | |=================================================== | 73% | |====================================================== | 77% | |========================================================= | 82% | |============================================================ | 86% | |================================================================ | 91% | |=================================================================== | 95% | |======================================================================| 100%
no_strucs <- cl_attribute_size( corpus = "REUTERS", attribute = "id", attribute_type = "s", registry = registry_tmp ) cpos_list <- lapply( 0L:(no_strucs - 1L), function(i) cl_struc2cpos(corpus = "REUTERS", struc = i, s_attribute = "id", registry = registry_tmp) ) cpos_matrix <- do.call(rbind, cpos_list) s_attribute_encode( values = as.character(1L:nrow(cpos_matrix)), data_dir = data_dir_tmp, s_attribute = "foo", corpus = "REUTERS", region_matrix = cpos_matrix, method = "R", registry_dir = registry_tmp, encoding = "latin1", verbose = TRUE, delete = TRUE )
#> ... adding s-attribute 'foo' to registry
#> Corpus to delete (ID): REUTERS #> Corpus name: Reuters Sample Corpus #> Number of loads before reset: 50 #> Number of loads resetted: 1
cl_struc2str( "REUTERS", struc = 0L:(nrow(cpos_matrix) - 1L), s_attribute = "foo", registry = registry_tmp )
#> [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" #> [16] "16" "17" "18" "19" "20"
unlink(registry_tmp, recursive = TRUE) unlink(data_dir_tmp, recursive = TRUE) avs <- s_attribute_get_values( s_attribute = "id", data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters") ) rng <- s_attribute_get_regions( s_attribute = "id", data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters") ) x <- data.frame( cpos_left = c(1L, 5L, 10L, 20L, 25L), cpos_right = c(2L, 5L, 12L, 21L, 27L), value = c("ORG", "LOC", "ORG", "PERS", "ORG"), stringsAsFactors = FALSE ) y <- data.frame( cpos_left = c(5, 11, 20, 25L, 30L), cpos_right = c(5, 12, 22, 27L, 33L), value = c("LOC", "ORG", "ORG", "ORG", "ORG"), stringsAsFactors = FALSE ) s_attribute_merge(x,y)
#> cpos_left cpos_right value #> 1 1 2 ORG #> 2 5 5 LOC #> 3 25 27 ORG #> 4 30 33 ORG