A reliable string processing toolkit is a must-have for any data scientist.

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds). As for now, about 850 CRAN packages depend (either directly or recursively) on stringi. And quite recently, the package got listed among the top downloaded R extensions.

# install.packages("stringi") or update.packages()
library("stringi")
stri_info(TRUE)
## [1] "stringi_0.5.2; en_US.UTF-8; ICU4C 55.1; Unicode 7.0"
apkg <- available.packages(contriburl="http://cran.rstudio.com/src/contrib")
length(tools::dependsOnPkgs('stringi', installed=apkg, recursive=TRUE))
## [1] 845

Refer to the INSTALL file for more details if you compile stringi from sources (Linux users mostly).

Here’s a list of changes in version 0.5-2. There are many major (like date&time processing) and minor new features, enhancements, as well as bugfixes. In the current release we also focused on bringing stringr package’s users even better string processing experience, as since the 1.0.0 release it is now powered by stringi.

stri_trans_char("id.123", ".", "_")
## [1] "id_123"
stri_trans_char("babaab", "ab", "01")
## [1] "101001"
stri_width(LETTERS[1:5])
## [1] 1 1 1 1 1
nchar(stri_trans_nfkd("\u0105"), "width") # provides incorrect information
## [1] 0
stri_width(stri_trans_nfkd("\u0105")) # A and ogonek (width = 1)
## [1] 1
stri_width( # Full-width equivalents of ASCII characters:
   stri_enc_fromutf32(as.list(c(0x3000, 0xFF01:0xFF5E)))
)
##  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
x <- stri_flatten(c(
   stri_dup(LETTERS, 2),
   stri_enc_fromutf32(as.list(0xFF21:0xFF3a))
), collapse=' ')
# Note that your web browser may have problems with properly aligning
# this (try it in RStudio)
cat(stri_wrap(x, 11), sep='\n')
## AA BB CC DD
## EE FF GG HH
## II JJ KK LL
## MM NN OO PP
## QQ RR SS TT
## UU VV WW XX
## YY ZZ A B
## C D E F
## G H I J
## K L M N
## O P Q R
## S T U V
## W X Y Z
x <- stri_rand_strings(100, 10000, "[actg]")
microbenchmark::microbenchmark(
   stri_detect_fixed(x, "acgtgaa"),
   grepl("actggact", x),
   grepl("actggact", x, perl=TRUE),
   grepl("actggact", x, fixed=TRUE)
)
## Unit: microseconds
##                                expr       min        lq       mean
##     stri_detect_fixed(x, "acgtgaa")   349.153   354.181   381.2391
##                grepl("actggact", x) 14017.923 14181.416 14457.3996
##   grepl("actggact", x, perl = TRUE)  8280.282  8367.426  8516.0124
##  grepl("actggact", x, fixed = TRUE)  3599.200  3637.373  3726.6020
##      median         uq       max neval  cld
##    362.7515   391.0655   681.267   100 a   
##  14292.2815 14594.4970 15736.535   100    d
##   8463.4490  8570.0080  9564.503   100   c 
##   3686.6690  3753.4060  4402.397   100  b

Enjoy! Any comments and suggestions are welcome.