corpus_utils.RdUtitlity functions to keep the installation of indexed CWB corpora wrapped into R data packages simple.
corpus_install(pkg = NULL, repo = "http://polmine.sowi.uni-due.de/packages", tarball = NULL, lib = .libPaths()[1], verbose = TRUE, user = NULL, password = NULL, ...) corpus_packages() corpus_rename(old, new, registry_dir = Sys.getenv("CORPUS_REGISTRY"), verbose = TRUE) corpus_remove(corpus, registry_dir = Sys.getenv("CORPUS_REGISTRY")) corpus_as_tarball(corpus, registry_dir, tarfile, verbose = TRUE) corpus_copy(corpus, registry_dir, data_dir = NULL, registry_dir_new = file.path(normalizePath(tempdir(), winslash = "/"), "cwb", "registry", fsep = "/"), data_dir_new = file.path(normalizePath(tempdir(), winslash = "/"), "cwb", "indexed_corpora", tolower(corpus), fsep = "/"), verbose = interactive(), progress = TRUE) corpus_recode(corpus, registry_dir = Sys.getenv("CORPUS_REGISTRY"), data_dir = registry_file_parse(corpus, registry_dir)[["home"]], skip = character(), to = c("latin1", "UTF-8"), verbose = TRUE)
| pkg | Name of the data package. |
|---|---|
| repo | URL of the repository. |
| tarball | The URL or local path to a tarball with a CWB indexed corpus. |
| lib | Directory for R packages, defaults to |
| verbose | Logical, whether to be verbose. |
| user | A user name that can be specified to download a corpus from a password protected site. |
| password | A password that can be specified to download a corpus from a password protected site. |
| ... | Further parameters that will be passed into
|
| old | Name of the (old) corpus. |
| new | Name of the (new) corpus. |
| registry_dir | Directory of registry. |
| corpus | A CWB corpus. |
| tarfile | Filename of tarball. |
| data_dir | The data directory where the files of the CWB corpus live. |
| registry_dir_new | Target directory with for (new) registry files. |
| data_dir_new | Target directory for corpus files. |
| progress | Logical, whether to show a progress bar. |
| skip | A character vector with s_attributes to skip. |
| to | Character string describing the target encoding of the corpus. |
A data package with a CWB corpus is assumed to include a directory
/extdata/cwb/registry for registry files and a directory
/extdata/cwb/indexed_corpora for the inexed corpus files. The
corpus_install function combines two steps necessary to install a
CWB corpus. First, it calls install.packages, then it resets the
path pointing to the directory with the indexed corpus files in the
registry file. The package will be installed to the standard library
directory for installing R packages (.libPaths()[1]). Another
location can be used by stating the param 'lib' explicitly (see
documentation for install.packages).
The function can also be used to install a corpus from a password protected
repository. Further parameters are handed over to install.packages, so you
might add method = "wget" extra = "--user donald --password duck".
See examples how to check whether the directory has been set correctly.
corpus_packages will detect the packages that include CWB
corpora.
corpus_rename will rename a corpus, affecting the name of the
registry file, the corpus id, and the name of the directory where data
files reside.
corpus_remove can be used to drop a corpus.
corpus_as_tarball will create a tarball (.tar.gz-file) with
two subdirectories. The 'registry' subdirectory will host the registry file
for the tarred corpus. The data files will be put in a subdirectory with
the corpus name in the 'indexed_corpora' subdirectory.
corpus_copy will create a copy of a corpus (useful for
experimental modifications, for instance).
For managing registry files, see registry_file_parse
for switching to a packaged corpus.
registry_file_new <- file.path( normalizePath(tempdir(), winslash = "/"), "cwb", "registry", "reuters", fsep = "/" ) if (file.exists(registry_file_new)) file.remove(registry_file_new) corpus_copy( corpus = "REUTERS", registry_dir = system.file(package = "RcppCWB", "extdata", "cwb", "registry"), data_dir = system.file( package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters" ) )#> | | | 0% | |=== | 5% | |====== | 9% | |========== | 14% | |============= | 18% | |================ | 23% | |=================== | 27% | |====================== | 32% | |========================= | 36% | |============================= | 41% | |================================ | 45% | |=================================== | 50% | |====================================== | 55% | |========================================= | 59% | |============================================= | 64% | |================================================ | 68% | |=================================================== | 73% | |====================================================== | 77% | |========================================================= | 82% | |============================================================ | 86% | |================================================================ | 91% | |=================================================================== | 95% | |======================================================================| 100%unlink(file.path( normalizePath(tempdir(), winslash = "/"), "cwb", fsep = "/"), recursive = TRUE) corpus <- "REUTERS" pkg <- "RcppCWB" s_attr <- "places" Q <- '"oil"' registry_dir_src <- system.file(package = pkg, "extdata", "cwb", "registry") data_dir_src <- system.file(package = pkg, "extdata", "cwb", "indexed_corpora", tolower(corpus)) registry_dir_tmp <- file.path( normalizePath(tempdir(), winslash = "/"), "cwb", "registry", fsep = "/" ) registry_file_tmp <- file.path(registry_dir_tmp, tolower(corpus), fsep = "/") data_dir_tmp <- file.path( normalizePath(tempdir(), winslash = "/"), "cwb", "indexed_corpora", tolower(corpus), fsep = "/" ) if (file.exists(registry_file_tmp)) file.remove(registry_file_tmp) if (!dir.exists(data_dir_tmp)){ dir.create(data_dir_tmp, recursive = TRUE) } else { if (length(list.files(data_dir_tmp)) > 0L) file.remove(list.files(data_dir_tmp, full.names = TRUE)) } corpus_copy( corpus = corpus, registry_dir = registry_dir_src, data_dir = data_dir_src, registry_dir_new = registry_dir_tmp, data_dir_new = data_dir_tmp )#> | | | 0% | |=== | 5% | |====== | 9% | |========== | 14% | |============= | 18% | |================ | 23% | |=================== | 27% | |====================== | 32% | |========================= | 36% | |============================= | 41% | |================================ | 45% | |=================================== | 50% | |====================================== | 55% | |========================================= | 59% | |============================================= | 64% | |================================================ | 68% | |=================================================== | 73% | |====================================================== | 77% | |========================================================= | 82% | |============================================================ | 86% | |================================================================ | 91% | |=================================================================== | 95% | |======================================================================| 100%#> [1] "latin1"corpus_recode( corpus = corpus, registry_dir = registry_dir_tmp, data_dir = data_dir_tmp, to = "UTF-8" )#>#>#>#>#>#> Corpus to delete (ID): REUTERS #> Corpus name: Reuters Sample Corpus #> Number of loads before reset: 1 #> Number of loads resetted: 1#> [1] 0#> Warning: CQP has already been initialized. Re-initialization is not possible. Only resetting registry.#> [1] TRUE#> [1] "utf8"n_strucs <- RcppCWB::cl_attribute_size( corpus = corpus, attribute = s_attr, attribute_type = "s", registry = registry_dir_tmp ) strucs <- 0L:(n_strucs - 1L) struc_values <- RcppCWB::cl_struc2str( corpus = corpus, s_attribute = s_attr, struc = strucs, registry = registry_dir_tmp ) speakers <- unique(struc_values) Sys.setenv("CORPUS_REGISTRY" = registry_dir_tmp) if (RcppCWB::cqp_is_initialized()) RcppCWB::cqp_reset_registry() else RcppCWB::cqp_initialize()#> [1] TRUE#> NULLcpos <- RcppCWB::cqp_dump_subcorpus(corpus = corpus) ids <- RcppCWB::cl_cpos2id( corpus = corpus, p_attribute = "word", registry = registry_dir_tmp, cpos = cpos ) str <- RcppCWB::cl_id2str( corpus = corpus, p_attribute = "word", registry = registry_dir_tmp, id = ids ) unique(str)#> [1] "oil"