There is a newer version of the record available.

Published June 5, 2018 | Version v0.6
Software Open

LanguageMachines/ticcltools: v0.6

  • 1. Radboud University
  • 2. Centre of Language and Speech Technology, Radboud University Nijmegen

Description

Intermediate release, with a lot of new code to handle N-grams Also a lot of refactoring is done, for more clear and maintainable code. This is work in progress still.

  • TICCL-unk:

    • more extensive acronym detection
    • fixed artifreq problems in 'clean' punctuated words
    • added filters for 'unwanted' characters
    • added a ligature filter to convert evil ligatures
    • normalize all hyphens to a 'normal' one (-)
    • use a better definition of punctuation (unicode character class is not good enough to decide)
  • TICCL-lexstat:

    • the 'separator' symbol should get freq=0, so it isnt counted
    • the clip value is added to the output filename
  • TICCL-indexer:

    • indexer and indexerNT now produce the same output, using different strategies when a --foci files is used.
  • TICCL-LDcalc: major overhaul for n-grams

    • added a ngram point column to the output (so NOT backward compatible!)
    • produce a '.short' list for short word corrections
    • produce a '.ambi' file with a list of n-grams related to short words
    • prune a lot of ngrams from the output
  • TICCL-rank:

    • output is sorted now
    • honor the ngram-points from the new LDcalc. (so NOT backward compatible!)
  • TICCL-chain: new module to chain ranked files

  • TICCL-lexclean: -added a -x option for 'inverse' alphabet

  • TICCL-anahash:

    • added a --list option to produce a list of words and anagram values
  • added metadata file: codemeta.json

Files

LanguageMachines/ticcltools-v0.6.zip

Files (142.1 MB)

Name Size Download all
md5:6618336635efa1456b8a081df4df65f9
142.1 MB Preview Download

Additional details