cwbtools 0.1.1 2019-12-09

MINOR IMPROVEMENTS

  • The pkg_add_corpus() function will now create the cwb directories (registry and data directory) if necessary. Previously, these directories were required to exist before moving a corpus into a package, making it necessary to put dummy files into packages to keep R CMD build from issuing warnings and git from dropping these directories. Creating the directories on demand is a precondition for a CRAN release of data packages (#11).

BUG FIXES

  • In the upcoming R version 4.0, the matrix class will inherit from class array. The new package version now takes into account that length(class(matrix(1:4,2,2))) will return the value 2.

DOCUMENTATION FIXES

  • The NEWS file now follows the styleguide such that pkgdown::build_site() will generate a proper changelog page.

cwbtools 0.1.0 2019-10-21

  • updated vignette so that annex explains installation of CoreNLP v3.9.2 (2018-10-05)
  • New functions s_atttribute_get_regions() and s_attribute_get_values().
  • In corpus_install(), using download.file() replaces curl::curl_download() for Windows because curl apparently is not able to process target filenames that include special characters.
  • For Windows machines, there is a check for non-ASCII characters in the file path. If TRUE, a path generated by a call to shortPathName() is used.
  • In the vignette, the registry is reset after creating the new corpora, to make the new corpus available.
  • A (preliminary) decode()-method will turn a partition into an Annotation object from the NLP package.
  • A new conll_get_regions()-function will turn an CoNLL-style annotated token stream into a table with regions that can be encoded using s_attribute_encode().
  • A new function s_attribute_merge() will merge two data.table objects defining s-attributes, checking for overlaps.

cwbtools 0.0.11 Unreleased

cwbtools 0.0.10 Unreleased

  • Missing documentation written for fields of class CorpusData.
  • New fields ‘sentences’ and ‘named_entities’ added to class CorpusData, as a basis for encoding annotation of sentences and named entities.

cwbtools 0.0.9 Unreleased

  • issue with parsing path correctly in registry_file_path when path is in inverted commas solved (adjusted regex)
  • issue with ALTREP vector for corpus positions resolved
  • layout of progress bars consistently using pbapply package
  • sanity checks for s_attribute_encode, ensure that region_matrix is integer matrix
  • s_attribute_encode when called with method = “R” will now add s_attribute to registry
  • s_attribute_encode will add structural attribute to registry when using R implementation, too
  • corpus_as_tarball-function added
  • install_corpus able to install from tarball
  • progress option for CorpusData$import_xml()-method
  • Minimal rework of progress bar in CorpusData$add_corpus_positions() (helper function .fn)
  • Three dots (…) are passed into download.file() by install_corpus(), if argument tarball is specified. This is a precondition for passing arguments to download password-protected corpora.

cwbtools 0.0.8 Unreleased

  • major bug removed when writing regions to disk (s_attribute_encode) with R
  • when creating/removing files in p_attribute_encode, only basenames of filenames are outputted
  • for CorpusData$encode(), an already existing corpus will be removed

cwbtools 0.0.7 Unreleased

  • bug removed in function pkg_create_cwb_dirs causing error when a directory already exists
  • new vignette ‘europarl’: sample workflow for putting indexed corpus into package
  • for $tokenize()-method of CorpusData: stricter requirement that chunkdata is data.table
  • progress bar for $tokenize()-method, when tokenizers package is used
  • tilde expansion for paths that are passed into p_attribute_encode
  • stri_detect_regex replacing grepl to speed things up in p_attribute_encode
  • awful workaround for coping with latin1 removed in p_attribute_encode
  • stip_punct = FALSE for $tokenize() method of CorpusData
  • purging the data for the CWB has been moved away from p_attribute_encode to a $purge()-method of CorpusData (to be performed on chunkdata) as a matter of efficiency.
  • continuous removal of objects and garbage collection in p_attribute_encode to be as parsimonious with memory as possible
  • checking of encoding in p_attribute_encode has been moved to $check_encoding() method in CorpusData-class to keep necessity to copy around vectors (potentially exceeding memory) to a minimum.
  • additional parameters passed into tokenizers::tokenize_words by …
  • writing hex for content of s_attributes to cope with encoding issues
  • values coerced to character

cwbtools 0.0.6 Unreleased

  • DataPackage class turned into pkg_*-functions
  • first version that passes all tests

cwbtools 0.0.5 Unreleased

  • undocumented

cwbtools 0.0.4 Unreleased

  • askYesNo function has been replaced by readlines(), to ensure compatibility with R versions < 3.5