There is a newer version of this record available.

Software Open Access

jiebaR: CRAN version 0.5

Qin Wenfeng; Check your git settings!; Yanyi Wu

Changes in Version 0.5 (2015-04-29)
  • Fix: edit_dict() on Mac
  • New function: filter_segment() to filter segmentation result
  • New function: vector_keywords() to extract keywords from a string
  • Enhancement: Segmentation support: Vector input => List output
  • Enhancement: Segmentation support: Input by lines => Output by lines
  • Enhancement: Add option write = "NOFILE"
  • Enhancement: New rules for "English word + Numbers"
  • Update documentation

一、 增加过滤分词结果的方法 filter_segment(),类似于关键词提取中使用的停止词功能。

cutter = worker() result_segment = cutter["我是测试文本,用于测试过滤分词效果。"] result_segment
[1] "" "" "测试" "文本" "用于" "测试" "过滤" "分词" "效果"
filter_words = c("","","","大家") filter_segment(result_segment,filter_words)
[1] "" "测试" "文本" "用于" "测试" "过滤" "分词" "效果"

二、 分词支持 “向量文本输入 => list输出” 与 “按行输入文件 => list输出”

通过 bylines 选项控制是否按行输出,默认值为bylines = FALSE。

cutter = worker(bylines = TRUE) cutter
Worker Type: Mix Segment Detect Encoding : TRUE Default Encoding: UTF-8 Keep Symbols : FALSE Output Path : Write File : TRUE By Lines : TRUE Max Read Lines : 1e+05 ....
cutter[c("这是非常的好","大家好才是真的好")]
[[1]] [1] "这是" "非常" "" "" [[2]] [1] "大家" "" "" "" "真的" ""
cutter$write = FALSE # 输入文件文本是: # 这是一个分行测试文本 # 用于测试分行的输出结果 cutter["files.path"]
[[1]] [1] "这是" "一个" "分行" "测试" "文本" [[2]] [1] "用于" "测试" "分行" "" "输出" "结果"
# 按行写入文件 cutter$write = TRUE cutter$bylines = TRUE

三、可以使用 vector_keywords 对一个文本向量提取关键词。

keyworker = worker("keywords") cutter = worker() vector_keywords(cutter["这是一个比较长的测试文本。"],keyworker)
8.94485 7.14724 4.77176 4.29163 2.81755 "文本" "测试" "比较" "这是" "一个"
vector_keywords(c("今天","天气","真的","十分","不错","","感觉"),keyworker)
6.45994 6.18823 5.64148 5.63374 4.99212 "天气" "不错" "感觉" "真的" "今天"

四、增加 write = "NOFILE" 选项,不检查文件路径。

cutter = worker(write = "NOFILE",symbol = TRUE) cutter["./test.txt"] # 目录下有test.txt 文件
[1] "." "/" "test" "." "txt"
Files (94.7 kB)
Name Size
jiebaR-0.5.zip
md5:eb28423c319b275aae83a80f283065cf
94.7 kB Download
819
142
views
downloads
All versions This version
Views 819176
Downloads 14245
Data volume 15.5 MB4.3 MB
Unique views 782173
Unique downloads 14045

Share

Cite as