1205017
doi
10.5281/zenodo.1205017
oai:zenodo.org:1205017
Dmitriy Selivanov
Jeffrey Arnold
University of Washington
Kenneth Benoit
London School of Economics and Political Science
Os Keyes
Karthik Ram
UC Berkeley
ropensci/tokenizers: tokenizers 0.2.0
Lincoln Mullen
George Mason University
url:https://github.com/ropensci/tokenizers/tree/v0.2.0
info:eu-repo/semantics/openAccess
Other (Open)
Features
<ul>
<li>Add the <code>tokenize_ptb()</code> function for Penn Treebank tokenizations (@jrnold) (#12).</li>
<li>Add a function <code>chunk_text()</code> to split long documents into pieces (#30).</li>
<li>New functions to count words, characters, and sentences without tokenization (#36).</li>
<li>New function <code>tokenize_tweets()</code> preserves usernames, hashtags, and URLS (@kbenoit) (#44).</li>
<li>The <code>stopwords()</code> function has been removed in favor of using the <strong>stopwords</strong> package (#46).</li>
<li>The package now complies with the basic recommendations of the <strong>Text Interchange Format</strong>. All tokenization functions are now methods. This enables them to take corpus inputs as either TIF-compliant named character vectors, named lists, or data frames. All outputs are still named lists of tokens, but these can be easily coerced to data frames of tokens using the <code>tif</code> package. (#49)</li>
<li>Add a new vignette "The Text Interchange Formats and the tokenizers Package" (#49).</li>
</ul>
Bug fixes and performance improvements
<ul>
<li><code>tokenize_skip_ngrams</code> has been improved to generate unigrams and bigrams, according to the skip definition (#24).</li>
<li>C++98 has replaced the C++11 code used for n-gram generation, widening the range of compilers <code>tokenizers</code> supports (@ironholds) (#26).</li>
<li><code>tokenize_skip_ngrams</code> now supports stopwords (#31).</li>
<li>If tokenisers fail to generate tokens for a particular entry, they return <code>NA</code> consistently (#33).</li>
<li>Keyboard interrupt checks have been added to Rcpp-backed functions to enable users to terminate them before completion (#37).</li>
<li><code>tokenize_words()</code> gains arguments to preserve or strip punctuation and numbers (#48).</li>
<li><code>tokenize_skip_ngrams()</code> and <code>tokenize_ngrams()</code> to return properly marked UTF8 strings on Windows (@patperry) (#58).</li>
</ul>
Zenodo
2018-03-21
info:eu-repo/semantics/other
1205016
v0.2.0
1579936875.737365
978928
md5:32aa98f5d41d70a0207fdd95d2ecd9f2
https://zenodo.org/records/1205017/files/ropensci/tokenizers-v0.2.0.zip
public
https://github.com/ropensci/tokenizers/tree/v0.2.0
Is supplement to
url
10.5281/zenodo.1205016
isVersionOf
doi