There is a newer version of the record available.

Published October 2, 2015 | Version v0.6
Software Open

jiebaR: CRAN version 0.6

Description

Changes in Version 0.6 (2015-10-1)

  • Add: C API
  • Add: freq() to count word frequency
  • Fix: filter_segment() may occasionally remove words
  • Enhencement: filter_segment() now can handle list of vectors of words, and adds a unit option.
  • Enhencement: segmentation worker now can remove stop words. The default STOPPATH is not used by default for segmentation worker.
  • Enhencement: when symbol = F, 2010-10-13, 10.2 can be identified.

一、增强:分词、词性标注时,增加过滤停止词功能,默认的 STOPPATH 路径将不会被使用,不默认使用停止词库。需要自定义其他路径,停止词才能在分词时使用。停止词库的编码需要为 UTF-8 格式,否则读入的数据可能为乱码。

cutter = worker() cutter # Worker Type: Mix Segment # Fixed Model Components: # ... # $stop_word # NULL # $timestamp # [1] 1442716020 # $detect $encoding $symbol $output $write $lines $bylines can be reset
cutter = worker(stop_word="../stop.txt") cutter # Worker Type: Mix Segment # Fixed Model Components: # ... # $stop_word # [1] "../stop.txt" # $timestamp # [1] 1442716020 # $detect $encoding $symbol $output $write $lines $bylines can be reset.

二、增强:分词时,symbol = FALSE 时,2010-10-12,20.2 类似格式的文本中的符号会被保留。单纯的符号将会被过滤。

cutter = worker() cutter$symbol = F cutter["2010-10-10"]

三、增加:freq() 进行词频统计,输入内容为文本向量,输出内容为文本频率的数据框。

freq(c("测试", "测试", "文本"))

四、增强:filter_segment() 现在可以输入以文本向量为内容的 list。

cutter = worker() result_segment = list( cutter["我是测试文本,用于测试过滤分词效果。"], cutter["我是测试文本,用于测试过滤分词效果。"]) result_segment filter_words = c("","","","大家") filter_segment(result_segment,filter_words)

五、修复 :filter_segment() 可能会出现删除非停止词。

六、增加:filter_segment() 增加unit 选项。

处理文本时,停止词数量较多时 ,生成的正则表达式超过 265 bytes ,R 可能会报错。通过 unit 选项可以对于较多的停止词分多次处理,控制每次识别的停止词的个数,控制生成的正则表达式的长度。unit 默认值为 50,一般不需要修改 unit 的默认值。

help(regex) Long regular expressions may or may not be accepted: the POSIX standard only requires up to 256 bytes. filter_segment(result_segment,filter_words) # 使用默认值,一般不需要修改。 filter_segment(result_segment,filter_words, unit=10) # 如果你有较多文本长度很长的停止词词条

七、增加: C API,可以在其他 R 包调用本包的 C 接口。

// inst/include/jiebaRAPI.h SEXP jiebaR_filecoding(SEXP fileSEXP); SEXP jiebaR_mp_ptr(SEXP dictSEXP, SEXP userSEXP); ....

Files

jiebaR-v0.6.zip

Files (113.2 kB)

Name Size Download all
md5:ead55b363215418c1fa93ae2c828011a
113.2 kB Preview Download

Additional details

Related works