There is a newer version of the record available.

Published December 6, 2015 | Version 0.7
Software Open

jiebaR: Changes in Version 0.7

Description

Changes in Version 0.7

o Add: tobin() to transform simhash to binary format. o Add: vector_simhash() vector_distance() to extract simhash or compute Hamming distance from the result of segmentation. o Add: get_tuple() to get tuple from segmentation result. o Add: get_idf() to generate IDF dict. o Fix: C API now work with Clang on Mac 10.11. o Enhencement: Update tests for C API. o Warning: Next version will update internal CppJieba version and tag(), EditDict(), ShowDictPath() will be remove.

一、增加:get_tuple() 返回分词结果中 n 个连续的字符串组合的频率情况,可以作为自定义词典的参考。

get_tuple(c("sd","sd","sd","rd"),size=3) # name count # 4 sdsd 2 # 1 sdrd 1 # 2 sdsdrd 1 # 3 sdsdsd 1 get_tuple(list( c("sd","sd","sd","rd"), c("新浪","微博","sd","rd"), ), size = 2) # name count # 2 sdrd 2 # 3 sdsd 2 # 1 微博sd 1 # 4 新浪微博 1

二、增加:get_idf() 根据多文档词条结果计算 IDF 值。输入一个包含多个文本向量的 list,每一个文本向量代表一个文档,可自定义停止词列表。

get_idf(a_big_list,stop="停止词列表",path="输出IDF目录")

三、增加:可以使用 vector_simhash vector_distance 直接对文本向量计算 simhash 和 海明距离。

sim = worker("simhash") cutter = worker() vector_simhash(cutter["这是一个比较长的测试文本。"],sim)
$simhash [1] "9679845206667243434" $keyword 8.94485 7.14724 4.77176 4.29163 2.81755 "文本" "测试" "比较" "这是" "一个"
vector_simhash(c("今天","天气","真的","十分","不错","","感觉"),sim)
$simhash [1] "13133893567857586837" $keyword 6.45994 6.18823 5.64148 5.63374 4.99212 "天气" "不错" "感觉" "真的" "今天"
vector_distance(c("今天","天气","真的","十分","不错","","感觉"),c("今天","天气","真的","十分","不错","","感觉"),sim)
$distance [1] "0" $lhs 6.45994 6.18823 5.64148 5.63374 4.99212 "天气" "不错" "感觉" "真的" "今天" $rhs 6.45994 6.18823 5.64148 5.63374 4.99212 "天气" "不错" "感觉" "真的" "今天"

四、增加:可以使用 tobin 进行 simhash 数值的二进制转换。

res = vector_simhash(c("今天","天气","真的","十分","不错","","感觉"),sim) tobin(res$simhash)
[1] "0000000000000000000000000000000000010101111100000111001010010101"

Files

jiebaR-0.7.zip

Files (124.1 kB)

Name Size Download all
md5:51c92a5aef988951553b0e9e6ac00fd9
124.1 kB Preview Download

Additional details

Related works